This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/CodeGen/
-
llvm/
-
CodeGen/
-
DAGCombine.h
-
lib/
-
CodeGen/SelectionDAG/
-
SelectionDAG/
-
DAGCombiner.cpp
-
Target/AMDGPU/
-
AMDGPU/
25/26
SIISelLowering.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
combine-vload-extract.ll
-
cvt_f32_ubyte.ll
-
ds_read2.ll
2
fast-unaligned-load-store.global.ll
-
fast-unaligned-load-store.private.ll
-
insert_vector_elt.v2i16.ll
-
load-hi16.ll
-
load-lo16.ll
-
load-local.128.ll
-
load-local.96.ll
-
pack.v2f16.ll
-
pack.v2i16.ll
-
permute.ll
4/4
permute_i8.ll

Differential D142782

[AMDGPU] Add basic support for extended i8 perm matching
ClosedPublic

Authored by jrbyrnes on Jan 27 2023, 1:30 PM.

Download Raw Diff

Details

Reviewers

arsenm
Pierre-vh
foad
jmmartinez

Commits

rGac2d6df2d6a8: [AMDGPU] Add basic support for extended i8 perm matching

Summary

Implement traversal algorithm to match trees to i8 vperms. For ors that can be combined into perms, we expect to see some pattern that combine four 8 bit operands (actually 16 bit operands, with only 8 nonzero bits) into two 16 bit operand, and combine thes two 16 bit operands via the or (after an zext, and ext-shift), The trees that do this type of combination are one of the two classes of trees relevant, and are matched in calculateByteProvider. The 8 bit operands used in this tree are typically produced via an AND op or a SRL op, and are the leaves of the trees in calculateByteProvider. The other relevant class of trees are those that map a leaf of calculateByteProvider to an ultimate source. This class of trees is matched in calculateSrcByte.

Through this recusive process, we track an Index (SrcIndex in calculateSrcByte) which is the byte of the current op that maps to the byte of the dest of the or we are currently mapping. For example, the 4th byte of the dest of SHL Src, 16 maps to the 2nd byte of Src. Through basic rules like this we can map src bytes to the dest byte of the or. Using this mapping we can create perm masks.

Much of the code for calculateByteProvider was borrowed from CodeGen/SelectionDAG/DAGCombiner.cpp (MatchLoadCombine). There are still many candidate trees that can be matched into perms that this patch does not attempt to. Those are saved for future iterations.

Depends on: https://reviews.llvm.org/D143018

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	540 ms	x64 debian > LLVM.CodeGen/AMDGPU::andorn2.ll

Event Timeline

jrbyrnes created this revision.Jan 27 2023, 1:30 PM

Herald added a project: Restricted Project. · View Herald TranscriptJan 27 2023, 1:30 PM

Herald added subscribers: kosarev, foad, kerbowa and 8 others. · View Herald Transcript

jrbyrnes requested review of this revision.Jan 27 2023, 1:30 PM

Herald added a project: Restricted Project. · View Herald TranscriptJan 27 2023, 1:30 PM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

arsenm added inline comments.Jan 27 2023, 2:32 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
9768	Can you keep this as a generic utility?
9883	Needs to move to std::optional
llvm/test/CodeGen/AMDGPU/permute_i8.ll
91	Test needs to use opaque pointers

Harbormaster completed remote builds in B210465: Diff 492893.Jan 27 2023, 4:17 PM

Blacklist or->perm combine for certain users of or. Some ops (e.g. V_CVT_F32_UBYTE) are performed in bytewise manner. If the or has such a user, it is better to leave the dag in uncombined state since we will need to byte extract the combine.

Herald added a subscriber: ecnelises. · View Herald TranscriptJan 30 2023, 3:24 PM

Fix typo

Harbormaster completed remote builds in B210874: Diff 493437.Jan 30 2023, 4:41 PM

jrbyrnes updated this revision to Diff 493458.Jan 30 2023, 5:20 PM

jrbyrnes marked 2 inline comments as done.

Resolve remaining regressions + Separate out ByteProvider + std::optional

Harbormaster completed remote builds in B210887: Diff 493458.Jan 30 2023, 7:06 PM

Rebase

Harbormaster completed remote builds in B211076: Diff 493738.Jan 31 2023, 2:33 PM

jrbyrnes retitled this revision from [AMDGPU] WIP: Add basic support for extended i8 perm matching to [AMDGPU] Add basic support for extended i8 perm matching.Jan 31 2023, 2:36 PM

jrbyrnes edited the summary of this revision. (Show Details)

jrbyrnes added a reviewer: arsenm.

The ratio of test to code changes has me worried the test coverage is incomplete

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
9768	By generic utility I mean in generic code and used by the load combine as well
9811	6 is the one true recursion depth limit
10059	Don't need a small vector, can just directly iterate an initializer list?
10062	Is the multiple use case covered in the tests?
10125	StartingIndex=
10127	!P
10165	Don't understand why you need the const_cast. Also, why isn't this an SDValue to begin with? Assuming output 0 can be risky
llvm/test/CodeGen/AMDGPU/permute_i8.ll
3	Drop -opaque-pointers (also direction doesn't make sense with the test contents)
7	Need to use typed pointers, also should prefer global loads to flat

jrbyrnes marked an inline comment as done.Jan 31 2023, 4:29 PM

jrbyrnes added inline comments.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
9768	Hey Matt -- thanks for comments. I think I don't fully understand this one -- I guess you didn't mean https://reviews.llvm.org/D143018 ? By generic, do you mean templated base class (perhaps in ADT) ?

arsenm added inline comments.Feb 2 2023, 6:26 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
9768	I didn't see that, I mean share code with the MatchLoadCombine that you mentioned. I didn't think about the details of that (something abstracter might be good since the same thing will need to be ported for GlobalISel)

Rebase + review comments.

Herald added a subscriber: • pcwang-thead. · View Herald TranscriptFeb 3 2023, 11:28 AM

In D142782#4095092, @arsenm wrote:

The ratio of test to code changes has me worried the test coverage is incomplete

This patch is intended to simply bring in the components necessary for i8 perm matching. Fitting / tuning it to be optimally useful in actual workloads is left to a future iteration. As such, the heuristics / conditions we use to apply the combine are very restrictive (e.g. no multi use operands in or, IsCombineVectorized heuristic, no support for 16 bit ors, etc). For this iteration, my primary concerns for testing were: true positives (i.e. testing accurate production of v_perm when we expect to), and false positive (correctness error / inefficient codegen). False negative (missed opportunity) are left to future iteration.

True positive coverage:
There are 4096 4xi8 shuffle_vector iterations. I tested and validated all permuations. The initial tests included covered all trees for these permutations.
There are 8192 4xi8 shuffle_vector iterations where 1 operand is undef. This iteration doesn't fully support these. Of these, about ~2k are lowered to v_perm by this iteration. I validated all of these.

False positive coverage:
lit tests
CK correctness tests
epsdb (to be run)

Harbormaster completed remote builds in B211772: Diff 494696.Feb 3 2023, 11:33 AM

jrbyrnes mentioned this in D143018: [DAGCombiner][NFC] Factor out ByteProvider.Feb 7 2023, 9:49 AM

passes psdb

Add tests for gating conditions on attempting to v_perm combine + nits

Harbormaster completed remote builds in B212889: Diff 496236.Feb 9 2023, 2:14 PM

Adding reviewers due to size of diff

Herald added a subscriber: StephenFan. · View Herald TranscriptFeb 9 2023, 2:20 PM

Add opaque tests

Harbormaster completed remote builds in B215325: Diff 499591.Feb 22 2023, 11:08 AM

arsenm added inline comments.Feb 22 2023, 11:17 AM

llvm/test/CodeGen/AMDGPU/permute_i8.ll
7	Opaque pointer tests are not additional, the tests need to be just converted. There are 0 remaining typed pointer AMDGPU tests

Convert test to opaque pointers

Harbormaster completed remote builds in B215342: Diff 499615.Feb 22 2023, 11:59 AM

foad added inline comments.Feb 23 2023, 2:26 AM

llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.global.ll
42	These kind of changes look like regressions for some combination of code size / latency / sgpr pressure.

arsenm added inline comments.Feb 23 2023, 6:53 AM

llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.global.ll
64	Yes, this is worse. Should avoid cases that can use v_lshl_or_b32

Check that we are actually doing 8 bit extraction before lowering into v_perm.

We can determine (based on the potential perm mask and operands) if we need to insert any 8 bit extraction code.

For example, a perm mask of 0x05040100 suggests we will not need to extract any bits from the operands iff they have 16 bits of data (e.g. zext 16 load into 32 bit). In this case, we assume CodeGen will lower it well, and do not combine into v_perm. If, however, the operands are 32 bit, then we will need to insert mask code, so we do lower to v_perm.

As another example, if we have a mask of 0x05040201 then we will lower into v_perm for muiltiple reasons: 1. the 0x0201 portion of the mask implies a 32 bit operand, 2. the 0x0201 portion of the mask is not well formed, since it requires a shift instruction to address these bits.

Finally, if the mask and operands indicates we are just producing one of the ops, combine the tree into the op.

Feature resulted in changes -- optimally not lowering into v_perm -- in:
CodeGen/AMDGPU/combine-vload-extract.ll
CodeGen/AMDGPU/cvt_f32_ubyte.ll
CodeGen/AMDGPU/ds_read2.ll
CodeGen/AMDGPU/fast-unaligned-load-store.global.ll
CodeGen/AMDGPU/fast-unaligned-load-store.private.ll
CodeGen/AMDGPU/load-hi16.ll
CodeGen/AMDGPU/load-local.128.ll
CodeGen/AMDGPU/load-local.96.ll
CodeGen/AMDGPU/permute.ll

All 4096 permutation of <4 x i8> shufflevector produced desired result (including <i32 0, i32 1, i32 2, i32 3> and <i32 4, i32 5, i32 6, i32 7> which lower into correspond 32 bit operand).

Harbormaster completed remote builds in B215597: Diff 499971.Feb 23 2023, 1:48 PM

Ping

Should ByteProvider really be BitProvider?

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
9856	Why 8? Usually 6 is the one true recursion depth limit
9866–9869	Do the RHS calls first and short circuit the second call if the first failed
9955	Do you really need the explicit std::optional?
10068	llvm::any_of?

arsenm added inline comments.Mar 17 2023, 11:24 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
10024	Need to make sure this is scalar
10031	Need to make sure this is scalar

Thanks @arsenm for taking another look.

Address review comments. Need to rerun psdb since short-circuiting LHS slightly changes algorithmic behavior.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
9856	Based on testing, I can lower depth, but we must accept depth 6 since this is the max tree depth across build vectors that should be lowered into v_perm. Relevant test is already included in permute_i8.ll. This depth may need to change in future iteration.
9955	Yes, it cannot infer std::optional, even after changing order.

Harbormaster completed remote builds in B220498: Diff 506658.Mar 20 2023, 11:13 AM

arsenm added inline comments.Apr 20 2023, 10:54 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
9990	Braces
10031	This isn't checking for a scalar type?
10045	Can this be a separate function? It's a lambda that doesn't capture anything

Rebase + passed PSDB

Herald added a subscriber: bzcheeseman. · View Herald TranscriptMay 9 2023, 12:33 PM

Sorry @arsenm, I somehow missed your comments -- I'll address those.

Harbormaster completed remote builds in B230942: Diff 520792.May 9 2023, 1:53 PM

Address comments

Harbormaster completed remote builds in B230987: Diff 520852.May 9 2023, 4:13 PM

Cleanup checking in is16BitScalarOp + ping

Harbormaster completed remote builds in B234235: Diff 525221.May 24 2023, 10:08 AM

Thanks for comments thus far -- any other concerns here?

arsenm accepted this revision.Jun 6 2023, 11:23 AM

arsenm added inline comments.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
9807	No reference
10073	Doesn't feel like this is the right way to express it. Should handle CopyToRegs at least too
10080	Don't need llvm::
10136	Don't these default initialize to nullopt?

This revision is now accepted and ready to land.Jun 6 2023, 11:23 AM

Thanks @arsenm for the review.

I'll take another look at the v_perm candidate whitelist method in a subsequent patch which extends this work to capture more patterns.

Harbormaster completed remote builds in B237036: Diff 528985.Jun 6 2023, 12:15 PM

arsenm accepted this revision.Jun 8 2023, 4:15 PM

jrbyrnes added a parent revision: D143018: [DAGCombiner][NFC] Factor out ByteProvider.Jun 8 2023, 4:29 PM

Rebase

This revision was landed with ongoing or failed builds.Jun 19 2023, 9:54 AM

Closed by commit rGac2d6df2d6a8: [AMDGPU] Add basic support for extended i8 perm matching (authored by jrbyrnes). · Explain Why

This revision was automatically updated to reflect the committed changes.

jrbyrnes added a commit: rGac2d6df2d6a8: [AMDGPU] Add basic support for extended i8 perm matching.

In D142782#4433155, @jrbyrnes wrote:

Rebase

Why rebase a closed revision?

In D142782#4433237, @arsenm wrote:

In D142782#4433155, @jrbyrnes wrote:

Rebase

Why rebase a closed revision?

It was for posterity. Sorry for confusion.

Harbormaster completed remote builds in B239836: Diff 532694.Jun 19 2023, 11:00 AM

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

DAGCombine.h

39 lines

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

59 lines

Target/

AMDGPU/

SIISelLowering.cpp

345 lines

test/

CodeGen/

AMDGPU/

combine-vload-extract.ll

16 lines

cvt_f32_ubyte.ll

10 lines

ds_read2.ll

25 lines

fast-unaligned-load-store.global.ll

7 lines

fast-unaligned-load-store.private.ll

38 lines

insert_vector_elt.v2i16.ll

24 lines

36 lines

56 lines

106 lines

81 lines

8 lines

8 lines

25 lines

316 lines

Diff 493458

llvm/include/llvm/CodeGen/DAGCombine.h

	//===-- llvm/CodeGen/DAGCombine.h ------- SelectionDAG Nodes ---- C++ --===//			//===-- llvm/CodeGen/DAGCombine.h ------- SelectionDAG Nodes ---- C++ --===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//

	#ifndef LLVM_CODEGEN_DAGCOMBINE_H			#ifndef LLVM_CODEGEN_DAGCOMBINE_H
	#define LLVM_CODEGEN_DAGCOMBINE_H			#define LLVM_CODEGEN_DAGCOMBINE_H

				#include "llvm/CodeGen/SelectionDAGNodes.h"

	namespace llvm {			namespace llvm {

	enum CombineLevel {			enum CombineLevel {
	BeforeLegalizeTypes,			BeforeLegalizeTypes,
	AfterLegalizeTypes,			AfterLegalizeTypes,
	AfterLegalizeVectorOps,			AfterLegalizeVectorOps,
	AfterLegalizeDAG			AfterLegalizeDAG
	};			};

				/// Represents known origin of an individual byte in combine pattern. The
				/// value of the byte is either constant zero, or comes from memory /
				/// some other productive instruction (e.g. arithmetic instructions).
				/// Bit manipulation instructions like shifts are not ByteProviders, rather
				/// are used to extract Bytes.
				struct ByteProvider {
				// For constant zero providers Src is set to nullptr. For actual providers
				// Stc represents the node which originally produced the relevant bits.
				// ByteOffset is the offset of the byte in the value produced by the load.
				SDNode *Src = nullptr;
				unsigned DestOffset = 0;
				unsigned SrcOffset = 0;

				ByteProvider() = default;

				static ByteProvider getSrc(SDNode *Load, unsigned ByteOffset,
				unsigned VectorOffset) {
				return ByteProvider(Load, ByteOffset, VectorOffset);
				}

				static ByteProvider getConstantZero() { return ByteProvider(nullptr, 0, 0); }
				bool isConstantZero() const { return !Src; }

				bool hasSrc() const { return Src; }

				bool hasSameSrc(const ByteProvider &Other) const { return Other.Src == Src; }

				bool operator==(const ByteProvider &Other) const {
				return Other.Src == Src && Other.DestOffset == DestOffset &&
				Other.SrcOffset == SrcOffset;
				}

				private:
				ByteProvider(SDNode *Src, unsigned DestOffset, unsigned SrcOffset)
				: Src(Src), DestOffset(DestOffset), SrcOffset(SrcOffset) {}
				};

	} // end llvm namespace			} // end llvm namespace

	#endif			#endif

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,773 Lines • ▼ Show 20 Lines	SDValue TryR =
MatchFunnelPosNeg(LHSShiftArg, RHSShiftArg, RHSShiftAmt, LHSShiftAmt,		MatchFunnelPosNeg(LHSShiftArg, RHSShiftArg, RHSShiftAmt, LHSShiftAmt,
RExtOp0, LExtOp0, HasFSHR, ISD::FSHR, ISD::FSHL, DL);		RExtOp0, LExtOp0, HasFSHR, ISD::FSHR, ISD::FSHL, DL);
if (TryR)		if (TryR)
return TryR;		return TryR;

return SDValue();		return SDValue();
}		}

namespace {

/// Represents known origin of an individual byte in load combine pattern. The
/// value of the byte is either constant zero or comes from memory.
struct ByteProvider {
// For constant zero providers Load is set to nullptr. For memory providers
// Load represents the node which loads the byte from memory.
// ByteOffset is the offset of the byte in the value produced by the load.
LoadSDNode *Load = nullptr;
unsigned ByteOffset = 0;
unsigned VectorOffset = 0;

ByteProvider() = default;

static ByteProvider getMemory(LoadSDNode *Load, unsigned ByteOffset,
unsigned VectorOffset) {
return ByteProvider(Load, ByteOffset, VectorOffset);
}

static ByteProvider getConstantZero() { return ByteProvider(nullptr, 0, 0); }

bool isConstantZero() const { return !Load; }
bool isMemory() const { return Load; }

bool operator==(const ByteProvider &Other) const {
return Other.Load == Load && Other.ByteOffset == ByteOffset &&
Other.VectorOffset == VectorOffset;
}

private:
ByteProvider(LoadSDNode *Load, unsigned ByteOffset, unsigned VectorOffset)
: Load(Load), ByteOffset(ByteOffset), VectorOffset(VectorOffset) {}
};

} // end anonymous namespace

/// Recursively traverses the expression calculating the origin of the requested		/// Recursively traverses the expression calculating the origin of the requested
/// byte of the given value. Returns None if the provider can't be calculated.		/// byte of the given value. Returns None if the provider can't be calculated.
///		///
/// For all the values except the root of the expression, we verify that the		/// For all the values except the root of the expression, we verify that the
/// value has exactly one use and if not then return None. This way if the		/// value has exactly one use and if not then return None. This way if the
/// origin of the byte is returned it's guaranteed that the values which		/// origin of the byte is returned it's guaranteed that the values which
/// contribute to the byte are not used outside of this expression.		/// contribute to the byte are not used outside of this expression.

▲ Show 20 Lines • Show All 152 Lines • ▼ Show 20 Lines	case ISD::LOAD: {
// and it is not a ZEXTLOAD, then the load does not provide for the byte in		// and it is not a ZEXTLOAD, then the load does not provide for the byte in
// question		// question
if (Index >= NarrowByteWidth)		if (Index >= NarrowByteWidth)
return L->getExtensionType() == ISD::ZEXTLOAD		return L->getExtensionType() == ISD::ZEXTLOAD
? Optional<ByteProvider>(ByteProvider::getConstantZero())		? Optional<ByteProvider>(ByteProvider::getConstantZero())
: None;		: None;

unsigned BPVectorIndex = VectorIndex.value_or(0U);		unsigned BPVectorIndex = VectorIndex.value_or(0U);
return ByteProvider::getMemory(L, Index, BPVectorIndex);		return ByteProvider::getSrc(L, Index, BPVectorIndex);
}		}
}		}

return None;		return None;
}		}

static unsigned littleEndianByteAt(unsigned BW, unsigned i) {		static unsigned littleEndianByteAt(unsigned BW, unsigned i) {
return i;		return i;
▲ Show 20 Lines • Show All 274 Lines • ▼ Show 20 Lines	SDValue DAGCombiner::MatchLoadCombine(SDNode *N) {
// Handles simple types only		// Handles simple types only
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
if (VT != MVT::i16 && VT != MVT::i32 && VT != MVT::i64)		if (VT != MVT::i16 && VT != MVT::i32 && VT != MVT::i64)
return SDValue();		return SDValue();
unsigned ByteWidth = VT.getSizeInBits() / 8;		unsigned ByteWidth = VT.getSizeInBits() / 8;

bool IsBigEndianTarget = DAG.getDataLayout().isBigEndian();		bool IsBigEndianTarget = DAG.getDataLayout().isBigEndian();
auto MemoryByteOffset = [&] (ByteProvider P) {		auto MemoryByteOffset = [&] (ByteProvider P) {
assert(P.isMemory() && "Must be a memory byte provider");		assert(P.hasSrc() && "Must be a memory byte provider");
unsigned LoadBitWidth = P.Load->getMemoryVT().getScalarSizeInBits();		LoadSDNode *Load = cast<LoadSDNode>(P.Src);
		assert(Load);

		unsigned LoadBitWidth = Load->getMemoryVT().getScalarSizeInBits();

assert(LoadBitWidth % 8 == 0 &&		assert(LoadBitWidth % 8 == 0 &&
"can only analyze providers for individual bytes not bit");		"can only analyze providers for individual bytes not bit");
unsigned LoadByteWidth = LoadBitWidth / 8;		unsigned LoadByteWidth = LoadBitWidth / 8;
return IsBigEndianTarget		return IsBigEndianTarget
? bigEndianByteAt(LoadByteWidth, P.ByteOffset)		? bigEndianByteAt(LoadByteWidth, P.DestOffset)
: littleEndianByteAt(LoadByteWidth, P.ByteOffset);		: littleEndianByteAt(LoadByteWidth, P.DestOffset);
};		};

Optional<BaseIndexOffset> Base;		Optional<BaseIndexOffset> Base;
SDValue Chain;		SDValue Chain;

SmallPtrSet<LoadSDNode *, 8> Loads;		SmallPtrSet<LoadSDNode *, 8> Loads;
Optional<ByteProvider> FirstByteProvider;		Optional<ByteProvider> FirstByteProvider;
int64_t FirstOffset = INT64_MAX;		int64_t FirstOffset = INT64_MAX;
Show All 10 Lines	for (int i = ByteWidth - 1; i >= 0; --i) {

if (P->isConstantZero()) {		if (P->isConstantZero()) {
// It's OK for the N most significant bytes to be 0, we can just		// It's OK for the N most significant bytes to be 0, we can just
// zero-extend the load.		// zero-extend the load.
if (++ZeroExtendedBytes != (ByteWidth - static_cast<unsigned>(i)))		if (++ZeroExtendedBytes != (ByteWidth - static_cast<unsigned>(i)))
return SDValue();		return SDValue();
continue;		continue;
}		}
assert(P->isMemory() && "provenance should either be memory or zero");		assert(P->hasSrc() && "provenance should either be memory or zero");

LoadSDNode *L = P->Load;		LoadSDNode *L = cast<LoadSDNode>(P->Src);
		assert(L);

// All loads must share the same chain		// All loads must share the same chain
SDValue LChain = L->getChain();		SDValue LChain = L->getChain();
if (!Chain)		if (!Chain)
Chain = LChain;		Chain = LChain;
else if (Chain != LChain)		else if (Chain != LChain)
return SDValue();		return SDValue();

// Loads must share the same base address		// Loads must share the same base address
BaseIndexOffset Ptr = BaseIndexOffset::match(L, DAG);		BaseIndexOffset Ptr = BaseIndexOffset::match(L, DAG);
int64_t ByteOffsetFromBase = 0;		int64_t ByteOffsetFromBase = 0;

// For vector loads, the expected load combine pattern will have an		// For vector loads, the expected load combine pattern will have an
// ExtractElement for each index in the vector. While each of these		// ExtractElement for each index in the vector. While each of these
// ExtractElements will be accessing the same base address as determined		// ExtractElements will be accessing the same base address as determined
// by the load instruction, the actual bytes they interact with will differ		// by the load instruction, the actual bytes they interact with will differ
// due to different ExtractElement indices. To accurately determine the		// due to different ExtractElement indices. To accurately determine the
// byte position of an ExtractElement, we offset the base load ptr with		// byte position of an ExtractElement, we offset the base load ptr with
// the index multiplied by the byte size of each element in the vector.		// the index multiplied by the byte size of each element in the vector.
if (L->getMemoryVT().isVector()) {		if (L->getMemoryVT().isVector()) {
unsigned LoadWidthInBit = L->getMemoryVT().getScalarSizeInBits();		unsigned LoadWidthInBit = L->getMemoryVT().getScalarSizeInBits();
if (LoadWidthInBit % 8 != 0)		if (LoadWidthInBit % 8 != 0)
return SDValue();		return SDValue();
unsigned ByteOffsetFromVector = P->VectorOffset * LoadWidthInBit / 8;		unsigned ByteOffsetFromVector = P->SrcOffset * LoadWidthInBit / 8;
Ptr.addToOffset(ByteOffsetFromVector);		Ptr.addToOffset(ByteOffsetFromVector);
}		}

if (!Base)		if (!Base)
Base = Ptr;		Base = Ptr;

else if (!Base->equalBaseIndex(Ptr, DAG, ByteOffsetFromBase))		else if (!Base->equalBaseIndex(Ptr, DAG, ByteOffsetFromBase))
return SDValue();		return SDValue();
Show All 40 Lines	if (!IsBigEndian)
return SDValue();		return SDValue();

assert(FirstByteProvider && "must be set");		assert(FirstByteProvider && "must be set");

// Ensure that the first byte is loaded from zero offset of the first load.		// Ensure that the first byte is loaded from zero offset of the first load.
// So the combined value can be loaded from the first load address.		// So the combined value can be loaded from the first load address.
if (MemoryByteOffset(*FirstByteProvider) != 0)		if (MemoryByteOffset(*FirstByteProvider) != 0)
return SDValue();		return SDValue();
LoadSDNode *FirstLoad = FirstByteProvider->Load;		LoadSDNode *FirstLoad = cast<LoadSDNode>(FirstByteProvider->Src);
		assert(FirstLoad);

// The node we are looking at matches with the pattern, check if we can		// The node we are looking at matches with the pattern, check if we can
// replace it with a single (possibly zero-extended) load and bswap + shift if		// replace it with a single (possibly zero-extended) load and bswap + shift if
// needed.		// needed.

// If the load needs byte swap check if the target supports it		// If the load needs byte swap check if the target supports it
bool NeedsBswap = IsBigEndianTarget != *IsBigEndian;		bool NeedsBswap = IsBigEndianTarget != *IsBigEndian;

▲ Show 20 Lines • Show All 16,983 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show All 17 Lines
#include "SIMachineFunctionInfo.h"		#include "SIMachineFunctionInfo.h"
#include "SIRegisterInfo.h"		#include "SIRegisterInfo.h"
#include "llvm/ADT/FloatingPointMode.h"		#include "llvm/ADT/FloatingPointMode.h"
#include "llvm/ADT/Statistic.h"		#include "llvm/ADT/Statistic.h"
#include "llvm/Analysis/LegacyDivergenceAnalysis.h"		#include "llvm/Analysis/LegacyDivergenceAnalysis.h"
#include "llvm/Analysis/OptimizationRemarkEmitter.h"		#include "llvm/Analysis/OptimizationRemarkEmitter.h"
#include "llvm/BinaryFormat/ELF.h"		#include "llvm/BinaryFormat/ELF.h"
#include "llvm/CodeGen/Analysis.h"		#include "llvm/CodeGen/Analysis.h"
		#include "llvm/CodeGen/DAGCombine.h"
#include "llvm/CodeGen/FunctionLoweringInfo.h"		#include "llvm/CodeGen/FunctionLoweringInfo.h"
#include "llvm/CodeGen/GlobalISel/GISelKnownBits.h"		#include "llvm/CodeGen/GlobalISel/GISelKnownBits.h"
#include "llvm/CodeGen/GlobalISel/MIPatternMatch.h"		#include "llvm/CodeGen/GlobalISel/MIPatternMatch.h"
#include "llvm/CodeGen/MachineFrameInfo.h"		#include "llvm/CodeGen/MachineFrameInfo.h"
#include "llvm/CodeGen/MachineFunction.h"		#include "llvm/CodeGen/MachineFunction.h"
#include "llvm/CodeGen/MachineLoopInfo.h"		#include "llvm/CodeGen/MachineLoopInfo.h"
#include "llvm/IR/DiagnosticInfo.h"		#include "llvm/IR/DiagnosticInfo.h"
#include "llvm/IR/IntrinsicInst.h"		#include "llvm/IR/IntrinsicInst.h"
#include "llvm/IR/IntrinsicsAMDGPU.h"		#include "llvm/IR/IntrinsicsAMDGPU.h"
#include "llvm/IR/IntrinsicsR600.h"		#include "llvm/IR/IntrinsicsR600.h"
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"
#include "llvm/Support/KnownBits.h"		#include "llvm/Support/KnownBits.h"
		#include <optional>

using namespace llvm;		using namespace llvm;

#define DEBUG_TYPE "si-lower"		#define DEBUG_TYPE "si-lower"

STATISTIC(NumTailCalls, "Number of tail calls");		STATISTIC(NumTailCalls, "Number of tail calls");

static cl::opt<bool> DisableLoopAlignment(		static cl::opt<bool> DisableLoopAlignment(
▲ Show 20 Lines • Show All 9,483 Lines • ▼ Show 20 Lines

// Check if a node selects whole bytes from its operand 0 starting at a byte		// Check if a node selects whole bytes from its operand 0 starting at a byte
// boundary while masking the rest. Returns select mask as in the v_perm_b32		// boundary while masking the rest. Returns select mask as in the v_perm_b32
// or -1 if not succeeded.		// or -1 if not succeeded.
// Note byte select encoding:		// Note byte select encoding:
// value 0-3 selects corresponding source byte;		// value 0-3 selects corresponding source byte;
// value 0xc selects zero;		// value 0xc selects zero;
// value 0xff selects 0xff.		// value 0xff selects 0xff.
static uint32_t getPermuteMask(SelectionDAG &DAG, SDValue V) {		static uint32_t getPermuteMask(SDValue V) {
assert(V.getValueSizeInBits() == 32);		assert(V.getValueSizeInBits() == 32);

if (V.getNumOperands() != 2)		if (V.getNumOperands() != 2)
return ~0;		return ~0;

ConstantSDNode *N1 = dyn_cast<ConstantSDNode>(V.getOperand(1));		ConstantSDNode *N1 = dyn_cast<ConstantSDNode>(V.getOperand(1));
if (!N1)		if (!N1)
return ~0;		return ~0;

uint32_t C = N1->getZExtValue();		uint32_t C = N1->getZExtValue();

switch (V.getOpcode()) {		switch (V.getOpcode()) {
default:		default:
break;		break;
case ISD::AND:		case ISD::AND:
if (uint32_t ConstMask = getConstantPermuteMask(C)) {		if (uint32_t ConstMask = getConstantPermuteMask(C))
return (0x03020100 & ConstMask) \| (0x0c0c0c0c & ~ConstMask);		return (0x03020100 & ConstMask) \| (0x0c0c0c0c & ~ConstMask);
}
break;		break;

case ISD::OR:		case ISD::OR:
if (uint32_t ConstMask = getConstantPermuteMask(C)) {		if (uint32_t ConstMask = getConstantPermuteMask(C))
return (0x03020100 & ~ConstMask) \| ConstMask;		return (0x03020100 & ~ConstMask) \| ConstMask;
}
break;		break;

case ISD::SHL:		case ISD::SHL:
if (C % 8)		if (C % 8)
return ~0;		return ~0;

return uint32_t((0x030201000c0c0c0cull << C) >> 32);		return uint32_t((0x030201000c0c0c0cull << C) >> 32);

▲ Show 20 Lines • Show All 141 Lines • ▼ Show 20 Lines	if (isBoolSGPR(RHS.getOperand(0)))
return DAG.getSelect(SDLoc(N), MVT::i32, RHS.getOperand(0),		return DAG.getSelect(SDLoc(N), MVT::i32, RHS.getOperand(0),
LHS, DAG.getConstant(0, SDLoc(N), MVT::i32));		LHS, DAG.getConstant(0, SDLoc(N), MVT::i32));
}		}

// and (op x, c1), (op y, c2) -> perm x, y, permute_mask(c1, c2)		// and (op x, c1), (op y, c2) -> perm x, y, permute_mask(c1, c2)
const SIInstrInfo *TII = getSubtarget()->getInstrInfo();		const SIInstrInfo *TII = getSubtarget()->getInstrInfo();
if (VT == MVT::i32 && LHS.hasOneUse() && RHS.hasOneUse() &&		if (VT == MVT::i32 && LHS.hasOneUse() && RHS.hasOneUse() &&
N->isDivergent() && TII->pseudoToMCOpcode(AMDGPU::V_PERM_B32_e64) != -1) {		N->isDivergent() && TII->pseudoToMCOpcode(AMDGPU::V_PERM_B32_e64) != -1) {
uint32_t LHSMask = getPermuteMask(DAG, LHS);		uint32_t LHSMask = getPermuteMask(LHS);
uint32_t RHSMask = getPermuteMask(DAG, RHS);		uint32_t RHSMask = getPermuteMask(RHS);
if (LHSMask != ~0u && RHSMask != ~0u) {		if (LHSMask != ~0u && RHSMask != ~0u) {
// Canonicalize the expression in an attempt to have fewer unique masks		// Canonicalize the expression in an attempt to have fewer unique masks
// and therefore fewer registers used to hold the masks.		// and therefore fewer registers used to hold the masks.
if (LHSMask > RHSMask) {		if (LHSMask > RHSMask) {
std::swap(LHSMask, RHSMask);		std::swap(LHSMask, RHSMask);
std::swap(LHS, RHS);		std::swap(LHS, RHS);
}		}

// Select 0xc for each lane used from source operand. Zero has 0xc mask		// Select 0xc for each lane used from source operand. Zero has 0xc mask
// set, 0xff have 0xff in the mask, actual lanes are in the 0-3 range.		// set, 0xff have 0xff in the mask, actual lanes are in the 0-3 range.
uint32_t LHSUsedLanes = ~(LHSMask & 0x0c0c0c0c) & 0x0c0c0c0c;		uint32_t LHSUsedLanes = ~(LHSMask & 0x0c0c0c0c) & 0x0c0c0c0c;
uint32_t RHSUsedLanes = ~(RHSMask & 0x0c0c0c0c) & 0x0c0c0c0c;		uint32_t RHSUsedLanes = ~(RHSMask & 0x0c0c0c0c) & 0x0c0c0c0c;

// Check of we need to combine values from two sources within a byte.		// Check of we need to combine values from two sources within a byte.
if (!(LHSUsedLanes & RHSUsedLanes) &&		if (!(LHSUsedLanes & RHSUsedLanes) \|\|
// If we select high and lower word keep it for SDWA.		// If we select high and lower word keep it for SDWA.
// TODO: teach SDWA to work with v_perm_b32 and remove the check.		// TODO: teach SDWA to work with v_perm_b32 and remove the check.
!(LHSUsedLanes == 0x0c0c0000 && RHSUsedLanes == 0x00000c0c)) {		(LHSUsedLanes == 0x0c0c0000 && RHSUsedLanes == 0x00000c0c)) {
// Each byte in each mask is either selector mask 0-3, or has higher		// Each byte in each mask is either selector mask 0-3, or has higher
// bits set in either of masks, which can be 0xff for 0xff or 0x0c for		// bits set in either of masks, which can be 0xff for 0xff or 0x0c for
// zero. If 0x0c is in either mask it shall always be 0x0c. Otherwise		// zero. If 0x0c is in either mask it shall always be 0x0c. Otherwise
// mask which is not 0xff wins. By anding both masks we have a correct		// mask which is not 0xff wins. By anding both masks we have a correct
// result except that 0x0c shall be corrected to give 0x0c only.		// result except that 0x0c shall be corrected to give 0x0c only.
uint32_t Mask = LHSMask & RHSMask;		uint32_t Mask = LHSMask & RHSMask;
for (unsigned I = 0; I < 32; I += 8) {		for (unsigned I = 0; I < 32; I += 8) {
uint32_t ByteSel = 0xff << I;		uint32_t ByteSel = 0xff << I;
Show All 11 Lines	if (LHSMask != ~0u && RHSMask != ~0u) {
DAG.getConstant(Sel, DL, MVT::i32));		DAG.getConstant(Sel, DL, MVT::i32));
}		}
}		}
}		}

return SDValue();		return SDValue();
}		}

		// A key component of v_perm is a mapping between byte position of the src
		// operands, and the byte position of the dest. To provide such, we need: 1. the
		// node that provides x byte of the dest of the OR, and 2. the byte of the node
		arsenmUnsubmitted Done Reply Inline Actions Can you keep this as a generic utility? arsenm: Can you keep this as a generic utility?
		arsenmUnsubmitted Done Reply Inline Actions By generic utility I mean in generic code and used by the load combine as well arsenm: By generic utility I mean in generic code and used by the load combine as well
		jrbyrnesAuthorUnsubmitted Done Reply Inline Actions Hey Matt -- thanks for comments. I think I don't fully understand this one -- I guess you didn't mean https://reviews.llvm.org/D143018 ? By generic, do you mean templated base class (perhaps in ADT) ? jrbyrnes: Hey Matt -- thanks for comments. I think I don't fully understand this one -- I guess you…
		arsenmUnsubmitted Not Done Reply Inline Actions I didn't see that, I mean share code with the MatchLoadCombine that you mentioned. I didn't think about the details of that (something abstracter might be good since the same thing will need to be ported for GlobalISel) arsenm: I didn't see that, I mean share code with the MatchLoadCombine that you mentioned. I didn't…
		// used to provide that x byte. calculateByteProvider finds which node provides
		// a certain byte of the dest of the OR, and calculateSrcByte takes that node,
		// and finds an ultimate src and byte position For example: The supported
		// LoadCombine pattern for vector loads is as follows
		// t1
		// or
		// / \
		// t2 t3
		// zext shl
		// \| \| \
		// t4 t5 16
		// or anyext
		// / \ \|
		// t6 t7 t8
		// srl shl or
		// / \| / \ / \
		// t9 t10 t11 t12 t13 t14
		// trunc* 8 trunc* 8 and and
		// \| \| / \| \| \
		// t15 t16 t17 t18 t19 t20
		// trunc* 255 srl -256
		// \| / \
		// t15 t15 16
		//
		// *In this example, the truncs are from i32->i16
		//
		// calculateByteProvider would find t6, t7, t13, and t14 for bytes 0-3
		// respectively. calculateSrcByte would find (given node) -> ultimate src &
		// byteposition: t6 -> t15 & 1, t7 -> t16 & 0, t13 -> t15 & 0, t14 -> t15 & 3.
		// After finding the mapping, we can combine the tree into vperm t15, t16,
		// 0x05000407

		// Find the source and byte position from a node.
		// \p DestByte is the byte position of the dest of the or that the src
		// ultimately provides. \p SrcIndex is the byte of the src that maps to this
		// dest of the or byte. \p Depth tracks how many recursive iterations we have
		// performed.
		static const std::optional<ByteProvider> calculateSrcByte(const SDValue *Op,
		uint64_t DestByte,
		arsenmUnsubmitted Done Reply Inline Actions No reference arsenm: No reference
		uint64_t SrcIndex = 0,
		unsigned Depth = 0) {
		// We may need to recursively traverse a series of SRLs
		if (Depth >= 5)
		arsenmUnsubmitted Done Reply Inline Actions 6 is the one true recursion depth limit arsenm: 6 is the one true recursion depth limit
		return std::nullopt;

		switch (Op->getOpcode()) {
		case ISD::TRUNCATE: {
		if (Op->getOperand(0).getScalarValueSizeInBits() != 32)
		return std::nullopt;
		return calculateSrcByte(&Op->getOperand(0), DestByte, SrcIndex, Depth + 1);
		}

		case ISD::SRL: {
		auto ShiftOp = dyn_cast<ConstantSDNode>(Op->getOperand(1));
		if (!ShiftOp)
		return std::nullopt;

		uint64_t BitShift = ShiftOp->getZExtValue();

		if (BitShift % 8 != 0)
		return std::nullopt;

		SrcIndex += BitShift / 8;

		return calculateSrcByte(&Op->getOperand(0), DestByte, SrcIndex, Depth + 1);
		}

		default: {
		if (Op->getScalarValueSizeInBits() != 32)
		return std::nullopt;

		return ByteProvider::getSrc(Op->getNode(), DestByte, SrcIndex);
		}
		}
		llvm_unreachable("fully handled switch");
		}

		// For a byte position in the result of an Or, traverse the tree and find the
		// node (and the byte of the node) which ultimately provides this {Or,
		// BytePosition}. \p Op is the operand we are currently examining. \p Index is
		// the byte position of the Op that corresponds with the originally requested
		// byte of the Or \p Depth tracks how many recursive iterations we have
		// performed. \p StartingIndex is the originally requested byte of the Or
		static const std::optional<ByteProvider>
		calculateByteProvider(SDValue Op, unsigned Index, unsigned Depth,
		unsigned StartingIndex = 0) {
		// Finding Src tree of RHS of or typically requires at least 1 additional
		// depth
		arsenmUnsubmitted Done Reply Inline Actions Why 8? Usually 6 is the one true recursion depth limit arsenm: Why 8? Usually 6 is the one true recursion depth limit
		jrbyrnesAuthorUnsubmitted Done Reply Inline Actions Based on testing, I can lower depth, but we must accept depth 6 since this is the max tree depth across build vectors that should be lowered into v_perm. Relevant test is already included in permute_i8.ll. This depth may need to change in future iteration. jrbyrnes: Based on testing, I can lower depth, but we must accept depth 6 since this is the max tree…
		if (Depth >= 8)
		return std::nullopt;

		unsigned BitWidth = Op.getScalarValueSizeInBits();
		if (BitWidth % 8 != 0)
		return std::nullopt;
		assert(Index < BitWidth / 8 && "invalid index requested");

		switch (Op.getOpcode()) {
		case ISD::OR: {
		auto LHS = calculateByteProvider(Op->getOperand(0), Index, Depth + 1,
		StartingIndex);
		auto RHS = calculateByteProvider(Op->getOperand(1), Index, Depth + 1,
		arsenmUnsubmitted Done Reply Inline Actions Do the RHS calls first and short circuit the second call if the first failed arsenm: Do the RHS calls first and short circuit the second call if the first failed
		StartingIndex);
		// A well formed Or will only have nonzero bytes for one operand
		if (LHS && RHS && !LHS->isConstantZero() && !RHS->isConstantZero())
		return std::nullopt;
		if (!LHS \|\| LHS->isConstantZero())
		return RHS;
		if (!RHS \|\| RHS->isConstantZero())
		return LHS;
		return std::nullopt;
		}

		case ISD::AND: {
		auto BitMaskOp = dyn_cast<ConstantSDNode>(Op->getOperand(1));
		if (!BitMaskOp)
		arsenmUnsubmitted Done Reply Inline Actions Needs to move to std::optional arsenm: Needs to move to std::optional
		return std::nullopt;

		uint32_t BitMask = BitMaskOp->getZExtValue();
		// Bits we expect for our StartingIndex
		uint32_t IndexMask = 0xFF << (Index * 8);

		if ((IndexMask & BitMask) != IndexMask) {
		// If the result of the and partially provides the byte, then it
		// is not well formatted
		if (IndexMask & BitMask)
		return std::nullopt;
		return ByteProvider::getConstantZero();
		}

		return calculateSrcByte(&Op->getOperand(0), StartingIndex, Index);
		}

		case ISD::SRL: {
		auto ShiftOp = dyn_cast<ConstantSDNode>(Op->getOperand(1));
		if (!ShiftOp)
		return std::nullopt;

		uint64_t BitShift = ShiftOp->getZExtValue();
		if (BitShift % 8)
		return std::nullopt;

		auto BitsProvided = Op.getScalarValueSizeInBits();
		if (BitsProvided % 8 != 0)
		return std::nullopt;

		uint64_t BytesProvided = BitsProvided / 8;
		uint64_t ByteShift = BitShift / 8;
		// The dest of shift will have good [0 : (BytesProvided - ByteShift)] bytes.
		// If the byte we are trying to provide (as tracked by index) falls in this
		// range, then the SRL provides the byte. The byte of interest of the src of
		// the SRL is Index + ByteShift
		return BytesProvided - ByteShift > Index
		? calculateSrcByte(&Op->getOperand(0), StartingIndex,
		Index + ByteShift)
		: ByteProvider::getConstantZero();
		}

		case ISD::SHL: {
		auto ShiftOp = dyn_cast<ConstantSDNode>(Op->getOperand(1));
		if (!ShiftOp)
		return std::nullopt;

		uint64_t BitShift = ShiftOp->getZExtValue();
		if (BitShift % 8 != 0)
		return std::nullopt;
		uint64_t ByteShift = BitShift / 8;

		// If we are shifting by an amount greater than (or equal to)
		// the index we are trying to provide, then it provides 0s. If not,
		// then this bytes are not definitively 0s, and the corresponding byte
		// of interest is Index - ByteShift of the src
		return Index < ByteShift
		? ByteProvider::getConstantZero()
		: calculateByteProvider(Op->getOperand(0), Index - ByteShift,
		Depth + 1, StartingIndex);
		}
		case ISD::ANY_EXTEND:
		case ISD::SIGN_EXTEND:
		case ISD::ZERO_EXTEND: {
		SDValue NarrowOp = Op->getOperand(0);
		unsigned NarrowBitWidth = NarrowOp.getScalarValueSizeInBits();
		if (NarrowBitWidth % 8 != 0)
		return std::nullopt;
		uint64_t NarrowByteWidth = NarrowBitWidth / 8;

		if (Index >= NarrowByteWidth)
		return Op.getOpcode() == ISD::ZERO_EXTEND
		arsenmUnsubmitted Done Reply Inline Actions Do you really need the explicit std::optional? arsenm: Do you really need the explicit std::optional?
		jrbyrnesAuthorUnsubmitted Done Reply Inline Actions Yes, it cannot infer std::optional, even after changing order. jrbyrnes: Yes, it cannot infer std::optional, even after changing order.
		? std::optional<ByteProvider>(ByteProvider::getConstantZero())
		: std::nullopt;
		return calculateByteProvider(NarrowOp, Index, Depth + 1, StartingIndex);
		}

		case ISD::TRUNCATE: {
		unsigned NarrowBitWidth = Op.getScalarValueSizeInBits();
		if (NarrowBitWidth % 8 != 0)
		return std::nullopt;
		uint64_t NarrowByteWidth = NarrowBitWidth / 8;

		if (NarrowByteWidth >= Index) {
		return calculateByteProvider(Op->getOperand(0), Index, Depth + 1,
		StartingIndex);
		}

		return std::nullopt;
		}

		case ISD::LOAD: {
		auto L = cast<LoadSDNode>(Op.getNode());
		unsigned NarrowBitWidth = L->getMemoryVT().getSizeInBits();
		if (NarrowBitWidth % 8 != 0)
		return std::nullopt;
		uint64_t NarrowByteWidth = NarrowBitWidth / 8;

		// If the width of the load does not reach byte we are trying to provide for
		// and it is not a ZEXTLOAD, then the load does not provide for the byte in
		// question
		if (Index >= NarrowByteWidth)
		return L->getExtensionType() == ISD::ZEXTLOAD
		? std::optional<ByteProvider>(ByteProvider::getConstantZero())
		: std::nullopt;

		if (NarrowByteWidth > Index) {
		arsenmUnsubmitted Done Reply Inline Actions Braces arsenm: Braces
		return calculateSrcByte(const_cast<const SDValue *>(&Op), StartingIndex,
		Index);
		}

		return std::nullopt;
		}

		default: {
		return std::nullopt;
		}
		}

		llvm_unreachable("fully handled switch");
		}

SDValue SITargetLowering::performOrCombine(SDNode *N,		SDValue SITargetLowering::performOrCombine(SDNode *N,
DAGCombinerInfo &DCI) const {		DAGCombinerInfo &DCI) const {
SelectionDAG &DAG = DCI.DAG;		SelectionDAG &DAG = DCI.DAG;
SDValue LHS = N->getOperand(0);		SDValue LHS = N->getOperand(0);
SDValue RHS = N->getOperand(1);		SDValue RHS = N->getOperand(1);

EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
if (VT == MVT::i1) {		if (VT == MVT::i1) {
// or (fp_class x, c1), (fp_class x, c2) -> fp_class x, (c1 \| c2)		// or (fp_class x, c1), (fp_class x, c2) -> fp_class x, (c1 \| c2)
if (LHS.getOpcode() == AMDGPUISD::FP_CLASS &&		if (LHS.getOpcode() == AMDGPUISD::FP_CLASS &&
RHS.getOpcode() == AMDGPUISD::FP_CLASS) {		RHS.getOpcode() == AMDGPUISD::FP_CLASS) {
SDValue Src = LHS.getOperand(0);		SDValue Src = LHS.getOperand(0);
if (Src != RHS.getOperand(0))		if (Src != RHS.getOperand(0))
return SDValue();		return SDValue();

const ConstantSDNode *CLHS = dyn_cast<ConstantSDNode>(LHS.getOperand(1));		const ConstantSDNode *CLHS = dyn_cast<ConstantSDNode>(LHS.getOperand(1));
const ConstantSDNode *CRHS = dyn_cast<ConstantSDNode>(RHS.getOperand(1));		const ConstantSDNode *CRHS = dyn_cast<ConstantSDNode>(RHS.getOperand(1));
if (!CLHS \|\| !CRHS)		if (!CLHS \|\| !CRHS)
return SDValue();		return SDValue();
		arsenmUnsubmitted Done Reply Inline Actions Need to make sure this is scalar arsenm: Need to make sure this is scalar

// Only 10 bits are used.		// Only 10 bits are used.
static const uint32_t MaxMask = 0x3ff;		static const uint32_t MaxMask = 0x3ff;

uint32_t NewMask = (CLHS->getZExtValue() \| CRHS->getZExtValue()) & MaxMask;		uint32_t NewMask = (CLHS->getZExtValue() \| CRHS->getZExtValue()) & MaxMask;
SDLoc DL(N);		SDLoc DL(N);
return DAG.getNode(AMDGPUISD::FP_CLASS, DL, MVT::i1,		return DAG.getNode(AMDGPUISD::FP_CLASS, DL, MVT::i1,
		arsenmUnsubmitted Done Reply Inline Actions Need to make sure this is scalar arsenm: Need to make sure this is scalar
		arsenmUnsubmitted Done Reply Inline Actions This isn't checking for a scalar type? arsenm: This isn't checking for a scalar type?
Src, DAG.getConstant(NewMask, DL, MVT::i32));		Src, DAG.getConstant(NewMask, DL, MVT::i32));
}		}

return SDValue();		return SDValue();
}		}

// or (perm x, y, c1), c2 -> perm x, y, permute_mask(c1, c2)		// or (perm x, y, c1), c2 -> perm x, y, permute_mask(c1, c2)
if (isa<ConstantSDNode>(RHS) && LHS.hasOneUse() &&		if (isa<ConstantSDNode>(RHS) && LHS.hasOneUse() &&
LHS.getOpcode() == AMDGPUISD::PERM &&		LHS.getOpcode() == AMDGPUISD::PERM &&
isa<ConstantSDNode>(LHS.getOperand(2))) {		isa<ConstantSDNode>(LHS.getOperand(2))) {
uint32_t Sel = getConstantPermuteMask(N->getConstantOperandVal(1));		uint32_t Sel = getConstantPermuteMask(N->getConstantOperandVal(1));
if (!Sel)		if (!Sel)
return SDValue();		return SDValue();

		arsenmUnsubmitted Done Reply Inline Actions Can this be a separate function? It's a lambda that doesn't capture anything arsenm: Can this be a separate function? It's a lambda that doesn't capture anything
Sel \|= LHS.getConstantOperandVal(2);		Sel \|= LHS.getConstantOperandVal(2);
SDLoc DL(N);		SDLoc DL(N);
return DAG.getNode(AMDGPUISD::PERM, DL, MVT::i32, LHS.getOperand(0),		return DAG.getNode(AMDGPUISD::PERM, DL, MVT::i32, LHS.getOperand(0),
LHS.getOperand(1), DAG.getConstant(Sel, DL, MVT::i32));		LHS.getOperand(1), DAG.getConstant(Sel, DL, MVT::i32));
}		}

// or (op x, c1), (op y, c2) -> perm x, y, permute_mask(c1, c2)		// or (op x, c1), (op y, c2) -> perm x, y, permute_mask(c1, c2)
const SIInstrInfo *TII = getSubtarget()->getInstrInfo();		const SIInstrInfo *TII = getSubtarget()->getInstrInfo();
if (VT == MVT::i32 && LHS.hasOneUse() && RHS.hasOneUse() &&		if (VT == MVT::i32 && LHS.hasOneUse() && RHS.hasOneUse() &&
N->isDivergent() && TII->pseudoToMCOpcode(AMDGPU::V_PERM_B32_e64) != -1) {		N->isDivergent() && TII->pseudoToMCOpcode(AMDGPU::V_PERM_B32_e64) != -1) {
uint32_t LHSMask = getPermuteMask(DAG, LHS);
uint32_t RHSMask = getPermuteMask(DAG, RHS);		// If the users of the or are a BytewiseOp, then the result of a combnine will
		// be extracted. We should simply not combine.
		SmallVector<unsigned, 4> BytewiseOps = {ISD::SINT_TO_FP, ISD::UINT_TO_FP};
		arsenmUnsubmitted Done Reply Inline Actions Don't need a small vector, can just directly iterate an initializer list? arsenm: Don't need a small vector, can just directly iterate an initializer list?

		bool IsCombineExtracted = false;
		for (auto OrUse : N->uses()) {
		arsenmUnsubmitted Done Reply Inline Actions Is the multiple use case covered in the tests? arsenm: Is the multiple use case covered in the tests?
		// Only special case bitcast to vectors
		if (OrUse->getOpcode() != ISD::BITCAST \|\| !OrUse->getValueType(0).isVector()) {
		continue;
		}

		if (OrUse->hasOneUse())
		arsenmUnsubmitted Done Reply Inline Actions llvm::any_of? arsenm: llvm::any_of?
		if (OrUse->use_begin()->getOpcode() == ISD::ZERO_EXTEND)
		OrUse = *OrUse->use_begin();

		for (auto VUse : OrUse->uses()) {
		for (auto BytewiseOp : BytewiseOps)
		arsenmUnsubmitted Done Reply Inline Actions Doesn't feel like this is the right way to express it. Should handle CopyToRegs at least too arsenm: Doesn't feel like this is the right way to express it. Should handle CopyToRegs at least too
		if (VUse->getOpcode() == BytewiseOp) {
		IsCombineExtracted = true;
		break;
		}
		}
		}

		arsenmUnsubmitted Done Reply Inline Actions Don't need llvm:: arsenm: Don't need llvm::
		if (IsCombineExtracted) return SDValue();

		uint32_t LHSMask = getPermuteMask(LHS);
		uint32_t RHSMask = getPermuteMask(RHS);

if (LHSMask != ~0u && RHSMask != ~0u) {		if (LHSMask != ~0u && RHSMask != ~0u) {
// Canonicalize the expression in an attempt to have fewer unique masks		// Canonicalize the expression in an attempt to have fewer unique masks
// and therefore fewer registers used to hold the masks.		// and therefore fewer registers used to hold the masks.
if (LHSMask > RHSMask) {		if (LHSMask > RHSMask) {
std::swap(LHSMask, RHSMask);		std::swap(LHSMask, RHSMask);
std::swap(LHS, RHS);		std::swap(LHS, RHS);
}		}

Show All 10 Lines	if (LHSMask != ~0u && RHSMask != ~0u) {
// Kill zero bytes selected by other mask. Zero value is 0xc.		// Kill zero bytes selected by other mask. Zero value is 0xc.
LHSMask &= ~RHSUsedLanes;		LHSMask &= ~RHSUsedLanes;
RHSMask &= ~LHSUsedLanes;		RHSMask &= ~LHSUsedLanes;
// Add 4 to each active LHS lane		// Add 4 to each active LHS lane
LHSMask \|= LHSUsedLanes & 0x04040404;		LHSMask \|= LHSUsedLanes & 0x04040404;
// Combine masks		// Combine masks
uint32_t Sel = LHSMask \| RHSMask;		uint32_t Sel = LHSMask \| RHSMask;
SDLoc DL(N);		SDLoc DL(N);

return DAG.getNode(AMDGPUISD::PERM, DL, MVT::i32,		return DAG.getNode(AMDGPUISD::PERM, DL, MVT::i32,
LHS.getOperand(0), RHS.getOperand(0),		LHS.getOperand(0), RHS.getOperand(0),
DAG.getConstant(Sel, DL, MVT::i32));		DAG.getConstant(Sel, DL, MVT::i32));
}		}
}		}
		if (LHSMask == ~0u \|\| RHSMask == ~0u) {
		SmallVector<ByteProvider, 8> PermNodes;

		// VT is known to be MVT::i32, so we need to provide 4 bytes.
		assert(VT == MVT::i32);
		for (int i = 0; i < 4; i++) {
		// Find the ByteProvider that provides the ith byte of the result of OR
		std::optional<ByteProvider> P =
		calculateByteProvider(SDValue(N, 0), i, 0, /StartingIndex/ i);
		arsenmUnsubmitted Done Reply Inline Actions StartingIndex= arsenm: StartingIndex=
		// TODO support constantZero
		if (!P.has_value() \|\| P->isConstantZero())
		arsenmUnsubmitted Done Reply Inline Actions !P arsenm: !P
		return SDValue();

		PermNodes.push_back(*P);
		}
		if (PermNodes.size() != 4)
		return SDValue();

		int FirstSrc = 0;
		int SecondSrc = -1;
		arsenmUnsubmitted Done Reply Inline Actions Don't these default initialize to nullopt? arsenm: Don't these default initialize to nullopt?
		uint64_t permMask = 0x00000000;
		for (size_t i = 0; i < PermNodes.size(); i++) {
		auto PermOp = PermNodes[i];
		// Since the mask is applied to Src1:Src2, Src1 bytes must be offset
		// by sizeof(Src2) = 4
		int SrcByteAdjust = 4;

		if (!PermOp.hasSameSrc(PermNodes[FirstSrc])) {
		if (SecondSrc != -1) {
		if (!PermOp.hasSameSrc(PermNodes[SecondSrc])) {
		return SDValue();
		}
		}
		// Set the index of the second distinct Src node
		SecondSrc = i;
		assert(PermNodes[SecondSrc].Src->getValueType(0).getSizeInBits() ==
		32);
		SrcByteAdjust = 0;
		}
		assert(PermOp.SrcOffset + SrcByteAdjust < 8);
		// 0th PermNode is MSB in PermMask
		permMask \|= (PermOp.SrcOffset + SrcByteAdjust) << (24 - (i * 8));
		}

		SDLoc DL(N);

		return DAG.getNode(
		AMDGPUISD::PERM, DL, MVT::i32,
		SDValue(const_cast<SDNode *>(PermNodes[FirstSrc].Src), 0),
		arsenmUnsubmitted Done Reply Inline Actions Don't understand why you need the const_cast. Also, why isn't this an SDValue to begin with? Assuming output 0 can be risky arsenm: Don't understand why you need the const_cast. Also, why isn't this an SDValue to begin with?
		SecondSrc == -1
		? DAG.getConstant(0, DL, MVT::i32)
		: SDValue(const_cast<SDNode *>(PermNodes[SecondSrc].Src), 0),
		DAG.getConstant(permMask, DL, MVT::i32));
		}
}		}

if (VT != MVT::i64 \|\| DCI.isBeforeLegalizeOps())		if (VT != MVT::i64 \|\| DCI.isBeforeLegalizeOps())
return SDValue();		return SDValue();

// TODO: This could be a generic combine with a predicate for extracting the		// TODO: This could be a generic combine with a predicate for extracting the
// high half of an integer being free.		// high half of an integer being free.

▲ Show 20 Lines • Show All 3,209 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/combine-vload-extract.ll

Show All 32 Lines	entry:
store i32 %insert3, i32* %out		store i32 %insert3, i32* %out
ret void		ret void
}		}

define amdgpu_kernel void @vectorLoadShuffle(<4 x i8>* %in, i32* %out) {		define amdgpu_kernel void @vectorLoadShuffle(<4 x i8>* %in, i32* %out) {
; GCN-LABEL: vectorLoadShuffle:		; GCN-LABEL: vectorLoadShuffle:
; GCN: ; %bb.0: ; %entry		; GCN: ; %bb.0: ; %entry
; GCN-NEXT: s_load_dwordx4 s[0:3], s[0:1], 0x24		; GCN-NEXT: s_load_dwordx4 s[0:3], s[0:1], 0x24
		; GCN-NEXT: v_mov_b32_e32 v3, 0x4060507
; GCN-NEXT: s_waitcnt lgkmcnt(0)		; GCN-NEXT: s_waitcnt lgkmcnt(0)
; GCN-NEXT: v_mov_b32_e32 v0, s0		; GCN-NEXT: v_mov_b32_e32 v0, s0
; GCN-NEXT: v_mov_b32_e32 v1, s1		; GCN-NEXT: v_mov_b32_e32 v1, s1
; GCN-NEXT: flat_load_dword v2, v[0:1]		; GCN-NEXT: flat_load_dword v2, v[0:1]
; GCN-NEXT: s_mov_b32 s0, 0x6050400
; GCN-NEXT: v_mov_b32_e32 v0, s2		; GCN-NEXT: v_mov_b32_e32 v0, s2
; GCN-NEXT: v_mov_b32_e32 v1, s3		; GCN-NEXT: v_mov_b32_e32 v1, s3
; GCN-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)		; GCN-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GCN-NEXT: v_bfe_u32 v3, v2, 16, 8		; GCN-NEXT: v_perm_b32 v2, v2, 0, v3
; GCN-NEXT: v_lshlrev_b32_e32 v4, 8, v2
; GCN-NEXT: v_perm_b32 v3, v3, v2, s0
; GCN-NEXT: v_and_b32_e32 v4, 0xff0000, v4
; GCN-NEXT: v_and_b32_e32 v2, 0xff000000, v2
; GCN-NEXT: v_or3_b32 v2, v3, v4, v2
; GCN-NEXT: flat_store_dword v[0:1], v2		; GCN-NEXT: flat_store_dword v[0:1], v2
; GCN-NEXT: s_endpgm		; GCN-NEXT: s_endpgm
entry:		entry:
%0 = load <4 x i8>, <4 x i8>* %in, align 4		%0 = load <4 x i8>, <4 x i8>* %in, align 4
%1 = extractelement <4 x i8> %0, i32 0		%1 = extractelement <4 x i8> %0, i32 0
%2 = extractelement <4 x i8> %0, i32 1		%2 = extractelement <4 x i8> %0, i32 1
%3 = extractelement <4 x i8> %0, i32 2		%3 = extractelement <4 x i8> %0, i32 2
%4 = extractelement <4 x i8> %0, i32 3		%4 = extractelement <4 x i8> %0, i32 3
Show All 28 Lines
}		}

define i32 @load_2xi16_noncombine(i16 addrspace(1)* %p) #0 {		define i32 @load_2xi16_noncombine(i16 addrspace(1)* %p) #0 {
; GCN-LABEL: load_2xi16_noncombine:		; GCN-LABEL: load_2xi16_noncombine:
; GCN: ; %bb.0:		; GCN: ; %bb.0:
; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GCN-NEXT: global_load_ushort v2, v[0:1], off		; GCN-NEXT: global_load_ushort v2, v[0:1], off
; GCN-NEXT: global_load_ushort v3, v[0:1], off offset:4		; GCN-NEXT: global_load_ushort v3, v[0:1], off offset:4
		; GCN-NEXT: s_mov_b32 s4, 0x4050001
; GCN-NEXT: s_waitcnt vmcnt(0)		; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: v_lshl_or_b32 v0, v3, 16, v2		; GCN-NEXT: v_perm_b32 v0, v2, v3, s4
; GCN-NEXT: s_setpc_b64 s[30:31]		; GCN-NEXT: s_setpc_b64 s[30:31]
%gep.p = getelementptr i16, i16 addrspace(1)* %p, i32 2		%gep.p = getelementptr i16, i16 addrspace(1)* %p, i32 2
%p.0 = load i16, i16 addrspace(1)* %p, align 4		%p.0 = load i16, i16 addrspace(1)* %p, align 4
%p.1 = load i16, i16 addrspace(1)* %gep.p, align 4		%p.1 = load i16, i16 addrspace(1)* %gep.p, align 4
%zext.0 = zext i16 %p.0 to i32		%zext.0 = zext i16 %p.0 to i32
%zext.1 = zext i16 %p.1 to i32		%zext.1 = zext i16 %p.1 to i32
%shl.1 = shl i32 %zext.1, 16		%shl.1 = shl i32 %zext.1, 16
%or = or i32 %zext.0, %shl.1		%or = or i32 %zext.0, %shl.1
▲ Show 20 Lines • Show All 124 Lines • ▼ Show 20 Lines
}		}

define i64 @load_3xi16_noncombine(i16 addrspace(1)* %p) #0 {		define i64 @load_3xi16_noncombine(i16 addrspace(1)* %p) #0 {
; GCN-LABEL: load_3xi16_noncombine:		; GCN-LABEL: load_3xi16_noncombine:
; GCN: ; %bb.0:		; GCN: ; %bb.0:
; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GCN-NEXT: global_load_ushort v2, v[0:1], off		; GCN-NEXT: global_load_ushort v2, v[0:1], off
; GCN-NEXT: global_load_dword v3, v[0:1], off offset:4		; GCN-NEXT: global_load_dword v3, v[0:1], off offset:4
; GCN-NEXT: s_mov_b32 s4, 0xffff0000		; GCN-NEXT: s_mov_b32 s4, 0x4050203
; GCN-NEXT: s_waitcnt vmcnt(0)		; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: v_and_or_b32 v0, v3, s4, v2		; GCN-NEXT: v_perm_b32 v0, v2, v3, s4
; GCN-NEXT: v_and_b32_e32 v1, 0xffff, v3		; GCN-NEXT: v_and_b32_e32 v1, 0xffff, v3
; GCN-NEXT: s_setpc_b64 s[30:31]		; GCN-NEXT: s_setpc_b64 s[30:31]
%gep.p = getelementptr i16, i16 addrspace(1)* %p, i32 3		%gep.p = getelementptr i16, i16 addrspace(1)* %p, i32 3
%gep.2p = getelementptr i16, i16 addrspace(1)* %p, i32 2		%gep.2p = getelementptr i16, i16 addrspace(1)* %p, i32 2
%p.0 = load i16, i16 addrspace(1)* %p, align 4		%p.0 = load i16, i16 addrspace(1)* %p, align 4
%p.1 = load i16, i16 addrspace(1)* %gep.p, align 4		%p.1 = load i16, i16 addrspace(1)* %gep.p, align 4
%p.2 = load i16, i16 addrspace(1)* %gep.2p, align 4		%p.2 = load i16, i16 addrspace(1)* %gep.2p, align 4
%zext.0 = zext i16 %p.0 to i64		%zext.0 = zext i16 %p.0 to i64
Show All 9 Lines

llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll

	Show First 20 Lines • Show All 2,655 Lines • ▼ Show 20 Lines
	; VI-NEXT: v_lshlrev_b32_e32 v0, 2, v0			; VI-NEXT: v_lshlrev_b32_e32 v0, 2, v0
	; VI-NEXT: s_mov_b32 s7, 0xf000			; VI-NEXT: s_mov_b32 s7, 0xf000
	; VI-NEXT: s_mov_b32 s6, -1			; VI-NEXT: s_mov_b32 s6, -1
	; VI-NEXT: s_waitcnt lgkmcnt(0)			; VI-NEXT: s_waitcnt lgkmcnt(0)
	; VI-NEXT: v_mov_b32_e32 v1, s1			; VI-NEXT: v_mov_b32_e32 v1, s1
	; VI-NEXT: v_add_u32_e32 v0, vcc, s0, v0			; VI-NEXT: v_add_u32_e32 v0, vcc, s0, v0
	; VI-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc			; VI-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc
	; VI-NEXT: flat_load_dword v0, v[0:1]			; VI-NEXT: flat_load_dword v0, v[0:1]
				; VI-NEXT: v_mov_b32_e32 v1, 0x4050607
	; VI-NEXT: s_mov_b32 s4, s2			; VI-NEXT: s_mov_b32 s4, s2
	; VI-NEXT: s_mov_b32 s5, s3			; VI-NEXT: s_mov_b32 s5, s3
	; VI-NEXT: s_waitcnt vmcnt(0)			; VI-NEXT: s_waitcnt vmcnt(0)
	; VI-NEXT: v_or_b32_e32 v0, 0x80000001, v0			; VI-NEXT: v_perm_b32 v0, v0, 0, v1
	; VI-NEXT: v_cvt_f32_ubyte0_e32 v1, v0			; VI-NEXT: v_cvt_f32_ubyte0_e32 v1, v0
	; VI-NEXT: v_add_f32_e32 v0, v0, v1			; VI-NEXT: v_add_f32_e32 v0, v0, v1
	; VI-NEXT: buffer_store_dword v0, off, s[4:7], 0			; VI-NEXT: buffer_store_dword v0, off, s[4:7], 0
	; VI-NEXT: s_endpgm			; VI-NEXT: s_endpgm
	;			;
	; GFX10-LABEL: cvt_ubyte0_or_multiuse:			; GFX10-LABEL: cvt_ubyte0_or_multiuse:
	; GFX10: ; %bb.0: ; %bb			; GFX10: ; %bb.0: ; %bb
	; GFX10-NEXT: s_load_dwordx4 s[0:3], s[0:1], 0x24			; GFX10-NEXT: s_load_dwordx4 s[0:3], s[0:1], 0x24
	; GFX10-NEXT: v_lshlrev_b32_e32 v0, 2, v0			; GFX10-NEXT: v_lshlrev_b32_e32 v0, 2, v0
	; GFX10-NEXT: v_mov_b32_e32 v2, 0			; GFX10-NEXT: v_mov_b32_e32 v2, 0
	; GFX10-NEXT: s_waitcnt lgkmcnt(0)			; GFX10-NEXT: s_waitcnt lgkmcnt(0)
	; GFX10-NEXT: global_load_dword v0, v0, s[0:1]			; GFX10-NEXT: global_load_dword v0, v0, s[0:1]
	; GFX10-NEXT: s_waitcnt vmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0)
	; GFX10-NEXT: v_or_b32_e32 v0, 0x80000001, v0			; GFX10-NEXT: v_perm_b32 v0, v0, 0, 0x4050607
	; GFX10-NEXT: v_cvt_f32_ubyte0_e32 v1, v0			; GFX10-NEXT: v_cvt_f32_ubyte0_e32 v1, v0
	; GFX10-NEXT: v_add_f32_e32 v0, v0, v1			; GFX10-NEXT: v_add_f32_e32 v0, v0, v1
	; GFX10-NEXT: global_store_dword v2, v0, s[2:3]			; GFX10-NEXT: global_store_dword v2, v0, s[2:3]
	; GFX10-NEXT: s_endpgm			; GFX10-NEXT: s_endpgm
	;			;
	; GFX9-LABEL: cvt_ubyte0_or_multiuse:			; GFX9-LABEL: cvt_ubyte0_or_multiuse:
	; GFX9: ; %bb.0: ; %bb			; GFX9: ; %bb.0: ; %bb
	; GFX9-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x24			; GFX9-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x24
	; GFX9-NEXT: v_lshlrev_b32_e32 v0, 2, v0			; GFX9-NEXT: v_lshlrev_b32_e32 v0, 2, v0
				; GFX9-NEXT: v_mov_b32_e32 v2, 0x4050607
	; GFX9-NEXT: v_mov_b32_e32 v1, 0			; GFX9-NEXT: v_mov_b32_e32 v1, 0
	; GFX9-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-NEXT: global_load_dword v0, v0, s[0:1]			; GFX9-NEXT: global_load_dword v0, v0, s[0:1]
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: v_or_b32_e32 v0, 0x80000001, v0			; GFX9-NEXT: v_perm_b32 v0, v0, 0, v2
	; GFX9-NEXT: v_cvt_f32_ubyte0_e32 v2, v0			; GFX9-NEXT: v_cvt_f32_ubyte0_e32 v2, v0
	; GFX9-NEXT: v_add_f32_e32 v0, v0, v2			; GFX9-NEXT: v_add_f32_e32 v0, v0, v2
	; GFX9-NEXT: global_store_dword v1, v0, s[2:3]			; GFX9-NEXT: global_store_dword v1, v0, s[2:3]
	; GFX9-NEXT: s_endpgm			; GFX9-NEXT: s_endpgm
	;			;
	; GFX11-LABEL: cvt_ubyte0_or_multiuse:			; GFX11-LABEL: cvt_ubyte0_or_multiuse:
	; GFX11: ; %bb.0: ; %bb			; GFX11: ; %bb.0: ; %bb
	; GFX11-NEXT: s_load_b128 s[0:3], s[0:1], 0x24			; GFX11-NEXT: s_load_b128 s[0:3], s[0:1], 0x24
	; GFX11-NEXT: v_lshlrev_b32_e32 v0, 2, v0			; GFX11-NEXT: v_lshlrev_b32_e32 v0, 2, v0
	; GFX11-NEXT: v_mov_b32_e32 v2, 0			; GFX11-NEXT: v_mov_b32_e32 v2, 0
	; GFX11-NEXT: s_waitcnt lgkmcnt(0)			; GFX11-NEXT: s_waitcnt lgkmcnt(0)
	; GFX11-NEXT: global_load_b32 v0, v0, s[0:1]			; GFX11-NEXT: global_load_b32 v0, v0, s[0:1]
	; GFX11-NEXT: s_waitcnt vmcnt(0)			; GFX11-NEXT: s_waitcnt vmcnt(0)
	; GFX11-NEXT: v_or_b32_e32 v0, 0x80000001, v0			; GFX11-NEXT: v_perm_b32 v0, v0, 0, 0x4050607
	; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1) \| instskip(NEXT) \| instid1(VALU_DEP_1)			; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1) \| instskip(NEXT) \| instid1(VALU_DEP_1)
	; GFX11-NEXT: v_cvt_f32_ubyte0_e32 v1, v0			; GFX11-NEXT: v_cvt_f32_ubyte0_e32 v1, v0
	; GFX11-NEXT: v_add_f32_e32 v0, v0, v1			; GFX11-NEXT: v_add_f32_e32 v0, v0, v1
	; GFX11-NEXT: global_store_b32 v2, v0, s[2:3]			; GFX11-NEXT: global_store_b32 v2, v0, s[2:3]
	; GFX11-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)			; GFX11-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
	; GFX11-NEXT: s_endpgm			; GFX11-NEXT: s_endpgm
	bb:			bb:
	%lid = tail call i32 @llvm.amdgcn.workitem.id.x()			%lid = tail call i32 @llvm.amdgcn.workitem.id.x()
	▲ Show 20 Lines • Show All 166 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/ds_read2.ll

	Show First 20 Lines • Show All 558 Lines • ▼ Show 20 Lines
	; CI-NEXT: buffer_store_dword v2, v[0:1], s[0:3], 0 addr64			; CI-NEXT: buffer_store_dword v2, v[0:1], s[0:3], 0 addr64
	; CI-NEXT: s_endpgm			; CI-NEXT: s_endpgm
	;			;
	; GFX9-ALIGNED-LABEL: unaligned_read2_f32:			; GFX9-ALIGNED-LABEL: unaligned_read2_f32:
	; GFX9-ALIGNED: ; %bb.0:			; GFX9-ALIGNED: ; %bb.0:
	; GFX9-ALIGNED-NEXT: s_load_dword s4, s[0:1], 0x8			; GFX9-ALIGNED-NEXT: s_load_dword s4, s[0:1], 0x8
	; GFX9-ALIGNED-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x0			; GFX9-ALIGNED-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x0
	; GFX9-ALIGNED-NEXT: v_lshlrev_b32_e32 v0, 2, v0			; GFX9-ALIGNED-NEXT: v_lshlrev_b32_e32 v0, 2, v0
				; GFX9-ALIGNED-NEXT: s_mov_b32 s0, 0x4050001
	; GFX9-ALIGNED-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-ALIGNED-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-ALIGNED-NEXT: v_add_u32_e32 v1, s4, v0			; GFX9-ALIGNED-NEXT: v_add_u32_e32 v1, s4, v0
	; GFX9-ALIGNED-NEXT: ds_read_u8 v2, v1			; GFX9-ALIGNED-NEXT: ds_read_u8 v2, v1
	; GFX9-ALIGNED-NEXT: ds_read_u8 v3, v1 offset:1			; GFX9-ALIGNED-NEXT: ds_read_u8 v3, v1 offset:1
	; GFX9-ALIGNED-NEXT: ds_read_u8 v4, v1 offset:2			; GFX9-ALIGNED-NEXT: ds_read_u8 v4, v1 offset:2
	; GFX9-ALIGNED-NEXT: ds_read_u8 v5, v1 offset:3			; GFX9-ALIGNED-NEXT: ds_read_u8 v5, v1 offset:3
	; GFX9-ALIGNED-NEXT: ds_read_u8 v6, v1 offset:32			; GFX9-ALIGNED-NEXT: ds_read_u8 v6, v1 offset:32
	; GFX9-ALIGNED-NEXT: ds_read_u8 v7, v1 offset:33			; GFX9-ALIGNED-NEXT: ds_read_u8 v7, v1 offset:33
	; GFX9-ALIGNED-NEXT: ds_read_u8 v8, v1 offset:34			; GFX9-ALIGNED-NEXT: ds_read_u8 v8, v1 offset:34
	; GFX9-ALIGNED-NEXT: ds_read_u8 v1, v1 offset:35			; GFX9-ALIGNED-NEXT: ds_read_u8 v1, v1 offset:35
	; GFX9-ALIGNED-NEXT: s_waitcnt lgkmcnt(6)
	; GFX9-ALIGNED-NEXT: v_lshl_or_b32 v2, v3, 8, v2
	; GFX9-ALIGNED-NEXT: s_waitcnt lgkmcnt(4)			; GFX9-ALIGNED-NEXT: s_waitcnt lgkmcnt(4)
	; GFX9-ALIGNED-NEXT: v_lshl_or_b32 v3, v5, 8, v4			; GFX9-ALIGNED-NEXT: v_lshl_or_b32 v4, v5, 8, v4
	; GFX9-ALIGNED-NEXT: v_lshl_or_b32 v2, v3, 16, v2			; GFX9-ALIGNED-NEXT: v_lshl_or_b32 v2, v3, 8, v2
	; GFX9-ALIGNED-NEXT: s_waitcnt lgkmcnt(2)			; GFX9-ALIGNED-NEXT: s_waitcnt lgkmcnt(2)
	; GFX9-ALIGNED-NEXT: v_lshl_or_b32 v3, v7, 8, v6			; GFX9-ALIGNED-NEXT: v_lshl_or_b32 v3, v7, 8, v6
				; GFX9-ALIGNED-NEXT: v_perm_b32 v2, v2, v4, s0
	; GFX9-ALIGNED-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-ALIGNED-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-ALIGNED-NEXT: v_lshl_or_b32 v1, v1, 8, v8			; GFX9-ALIGNED-NEXT: v_lshl_or_b32 v1, v1, 8, v8
	; GFX9-ALIGNED-NEXT: v_lshl_or_b32 v1, v1, 16, v3			; GFX9-ALIGNED-NEXT: v_perm_b32 v1, v3, v1, s0
	; GFX9-ALIGNED-NEXT: v_add_f32_e32 v1, v2, v1			; GFX9-ALIGNED-NEXT: v_add_f32_e32 v1, v2, v1
	; GFX9-ALIGNED-NEXT: global_store_dword v0, v1, s[2:3]			; GFX9-ALIGNED-NEXT: global_store_dword v0, v1, s[2:3]
	; GFX9-ALIGNED-NEXT: s_endpgm			; GFX9-ALIGNED-NEXT: s_endpgm
	;			;
	; GFX9-UNALIGNED-LABEL: unaligned_read2_f32:			; GFX9-UNALIGNED-LABEL: unaligned_read2_f32:
	; GFX9-UNALIGNED: ; %bb.0:			; GFX9-UNALIGNED: ; %bb.0:
	; GFX9-UNALIGNED-NEXT: s_load_dword s2, s[0:1], 0x8			; GFX9-UNALIGNED-NEXT: s_load_dword s2, s[0:1], 0x8
	; GFX9-UNALIGNED-NEXT: v_lshlrev_b32_e32 v2, 2, v0			; GFX9-UNALIGNED-NEXT: v_lshlrev_b32_e32 v2, 2, v0
	▲ Show 20 Lines • Show All 57 Lines • ▼ Show 20 Lines
	; CI-NEXT: buffer_store_dword v2, v[0:1], s[0:3], 0 addr64			; CI-NEXT: buffer_store_dword v2, v[0:1], s[0:3], 0 addr64
	; CI-NEXT: s_endpgm			; CI-NEXT: s_endpgm
	;			;
	; GFX9-ALIGNED-LABEL: unaligned_offset_read2_f32:			; GFX9-ALIGNED-LABEL: unaligned_offset_read2_f32:
	; GFX9-ALIGNED: ; %bb.0:			; GFX9-ALIGNED: ; %bb.0:
	; GFX9-ALIGNED-NEXT: s_load_dword s4, s[0:1], 0x8			; GFX9-ALIGNED-NEXT: s_load_dword s4, s[0:1], 0x8
	; GFX9-ALIGNED-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x0			; GFX9-ALIGNED-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x0
	; GFX9-ALIGNED-NEXT: v_lshlrev_b32_e32 v0, 2, v0			; GFX9-ALIGNED-NEXT: v_lshlrev_b32_e32 v0, 2, v0
				; GFX9-ALIGNED-NEXT: s_mov_b32 s0, 0x4050001
	; GFX9-ALIGNED-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-ALIGNED-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-ALIGNED-NEXT: v_add_u32_e32 v1, s4, v0			; GFX9-ALIGNED-NEXT: v_add_u32_e32 v1, s4, v0
	; GFX9-ALIGNED-NEXT: ds_read_u8 v2, v1 offset:5			; GFX9-ALIGNED-NEXT: ds_read_u8 v2, v1 offset:5
	; GFX9-ALIGNED-NEXT: ds_read_u8 v3, v1 offset:6			; GFX9-ALIGNED-NEXT: ds_read_u8 v3, v1 offset:6
	; GFX9-ALIGNED-NEXT: ds_read_u8 v4, v1 offset:7			; GFX9-ALIGNED-NEXT: ds_read_u8 v4, v1 offset:7
	; GFX9-ALIGNED-NEXT: ds_read_u8 v5, v1 offset:8			; GFX9-ALIGNED-NEXT: ds_read_u8 v5, v1 offset:8
	; GFX9-ALIGNED-NEXT: ds_read_u8 v6, v1 offset:9			; GFX9-ALIGNED-NEXT: ds_read_u8 v6, v1 offset:9
	; GFX9-ALIGNED-NEXT: ds_read_u8 v7, v1 offset:10			; GFX9-ALIGNED-NEXT: ds_read_u8 v7, v1 offset:10
	; GFX9-ALIGNED-NEXT: ds_read_u8 v8, v1 offset:11			; GFX9-ALIGNED-NEXT: ds_read_u8 v8, v1 offset:11
	; GFX9-ALIGNED-NEXT: ds_read_u8 v1, v1 offset:12			; GFX9-ALIGNED-NEXT: ds_read_u8 v1, v1 offset:12
	; GFX9-ALIGNED-NEXT: s_waitcnt lgkmcnt(6)
	; GFX9-ALIGNED-NEXT: v_lshl_or_b32 v2, v3, 8, v2
	; GFX9-ALIGNED-NEXT: s_waitcnt lgkmcnt(4)			; GFX9-ALIGNED-NEXT: s_waitcnt lgkmcnt(4)
	; GFX9-ALIGNED-NEXT: v_lshl_or_b32 v3, v5, 8, v4			; GFX9-ALIGNED-NEXT: v_lshl_or_b32 v4, v5, 8, v4
	; GFX9-ALIGNED-NEXT: v_lshl_or_b32 v2, v3, 16, v2			; GFX9-ALIGNED-NEXT: v_lshl_or_b32 v2, v3, 8, v2
	; GFX9-ALIGNED-NEXT: s_waitcnt lgkmcnt(2)			; GFX9-ALIGNED-NEXT: s_waitcnt lgkmcnt(2)
	; GFX9-ALIGNED-NEXT: v_lshl_or_b32 v3, v7, 8, v6			; GFX9-ALIGNED-NEXT: v_lshl_or_b32 v3, v7, 8, v6
				; GFX9-ALIGNED-NEXT: v_perm_b32 v2, v2, v4, s0
	; GFX9-ALIGNED-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-ALIGNED-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-ALIGNED-NEXT: v_lshl_or_b32 v1, v1, 8, v8			; GFX9-ALIGNED-NEXT: v_lshl_or_b32 v1, v1, 8, v8
	; GFX9-ALIGNED-NEXT: v_lshl_or_b32 v1, v1, 16, v3			; GFX9-ALIGNED-NEXT: v_perm_b32 v1, v3, v1, s0
	; GFX9-ALIGNED-NEXT: v_add_f32_e32 v1, v2, v1			; GFX9-ALIGNED-NEXT: v_add_f32_e32 v1, v2, v1
	; GFX9-ALIGNED-NEXT: global_store_dword v0, v1, s[2:3]			; GFX9-ALIGNED-NEXT: global_store_dword v0, v1, s[2:3]
	; GFX9-ALIGNED-NEXT: s_endpgm			; GFX9-ALIGNED-NEXT: s_endpgm
	;			;
	; GFX9-UNALIGNED-LABEL: unaligned_offset_read2_f32:			; GFX9-UNALIGNED-LABEL: unaligned_offset_read2_f32:
	; GFX9-UNALIGNED: ; %bb.0:			; GFX9-UNALIGNED: ; %bb.0:
	; GFX9-UNALIGNED-NEXT: s_load_dword s2, s[0:1], 0x8			; GFX9-UNALIGNED-NEXT: s_load_dword s2, s[0:1], 0x8
	; GFX9-UNALIGNED-NEXT: v_lshlrev_b32_e32 v2, 2, v0			; GFX9-UNALIGNED-NEXT: v_lshlrev_b32_e32 v2, 2, v0
	▲ Show 20 Lines • Show All 53 Lines • ▼ Show 20 Lines
	; GFX9-ALIGNED-NEXT: v_lshlrev_b32_e32 v0, 2, v0			; GFX9-ALIGNED-NEXT: v_lshlrev_b32_e32 v0, 2, v0
	; GFX9-ALIGNED-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x0			; GFX9-ALIGNED-NEXT: s_load_dwordx2 s[0:1], s[0:1], 0x0
	; GFX9-ALIGNED-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-ALIGNED-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-ALIGNED-NEXT: v_add_u32_e32 v1, s2, v0			; GFX9-ALIGNED-NEXT: v_add_u32_e32 v1, s2, v0
	; GFX9-ALIGNED-NEXT: ds_read_u16 v2, v1			; GFX9-ALIGNED-NEXT: ds_read_u16 v2, v1
	; GFX9-ALIGNED-NEXT: ds_read_u16 v3, v1 offset:2			; GFX9-ALIGNED-NEXT: ds_read_u16 v3, v1 offset:2
	; GFX9-ALIGNED-NEXT: ds_read_u16 v4, v1 offset:32			; GFX9-ALIGNED-NEXT: ds_read_u16 v4, v1 offset:32
	; GFX9-ALIGNED-NEXT: ds_read_u16 v1, v1 offset:34			; GFX9-ALIGNED-NEXT: ds_read_u16 v1, v1 offset:34
				; GFX9-ALIGNED-NEXT: s_mov_b32 s2, 0x4050001
	; GFX9-ALIGNED-NEXT: s_waitcnt lgkmcnt(2)			; GFX9-ALIGNED-NEXT: s_waitcnt lgkmcnt(2)
	; GFX9-ALIGNED-NEXT: v_lshl_or_b32 v2, v3, 16, v2			; GFX9-ALIGNED-NEXT: v_perm_b32 v2, v2, v3, s2
	; GFX9-ALIGNED-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-ALIGNED-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-ALIGNED-NEXT: v_lshl_or_b32 v1, v1, 16, v4			; GFX9-ALIGNED-NEXT: v_perm_b32 v1, v4, v1, s2
	; GFX9-ALIGNED-NEXT: v_add_f32_e32 v1, v2, v1			; GFX9-ALIGNED-NEXT: v_add_f32_e32 v1, v2, v1
	; GFX9-ALIGNED-NEXT: global_store_dword v0, v1, s[0:1]			; GFX9-ALIGNED-NEXT: global_store_dword v0, v1, s[0:1]
	; GFX9-ALIGNED-NEXT: s_endpgm			; GFX9-ALIGNED-NEXT: s_endpgm
	;			;
	; GFX9-UNALIGNED-LABEL: misaligned_2_simple_read2_f32:			; GFX9-UNALIGNED-LABEL: misaligned_2_simple_read2_f32:
	; GFX9-UNALIGNED: ; %bb.0:			; GFX9-UNALIGNED: ; %bb.0:
	; GFX9-UNALIGNED-NEXT: s_load_dword s2, s[0:1], 0x8			; GFX9-UNALIGNED-NEXT: s_load_dword s2, s[0:1], 0x8
	; GFX9-UNALIGNED-NEXT: v_lshlrev_b32_e32 v2, 2, v0			; GFX9-UNALIGNED-NEXT: v_lshlrev_b32_e32 v2, 2, v0
	▲ Show 20 Lines • Show All 795 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.global.ll

	Show All 32 Lines
	; GFX7-UNALIGNED-NEXT: v_or_b32_e32 v0, v0, v1			; GFX7-UNALIGNED-NEXT: v_or_b32_e32 v0, v0, v1
	; GFX7-UNALIGNED-NEXT: s_setpc_b64 s[30:31]			; GFX7-UNALIGNED-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX9-LABEL: global_load_2xi16_align2:			; GFX9-LABEL: global_load_2xi16_align2:
	; GFX9: ; %bb.0:			; GFX9: ; %bb.0:
	; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX9-NEXT: global_load_ushort v2, v[0:1], off			; GFX9-NEXT: global_load_ushort v2, v[0:1], off
	; GFX9-NEXT: global_load_ushort v3, v[0:1], off offset:2			; GFX9-NEXT: global_load_ushort v3, v[0:1], off offset:2
				; GFX9-NEXT: s_mov_b32 s4, 0x4050001
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: v_lshl_or_b32 v0, v3, 16, v2			; GFX9-NEXT: v_perm_b32 v0, v2, v3, s4
	foadUnsubmitted Not Done Reply Inline Actions These kind of changes look like regressions for some combination of code size / latency / sgpr pressure. foad: These kind of changes look like regressions for some combination of code size / latency / sgpr…
	; GFX9-NEXT: s_setpc_b64 s[30:31]			; GFX9-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX10-LABEL: global_load_2xi16_align2:			; GFX10-LABEL: global_load_2xi16_align2:
	; GFX10: ; %bb.0:			; GFX10: ; %bb.0:
	; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX10-NEXT: s_waitcnt_vscnt null, 0x0			; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
	; GFX10-NEXT: s_clause 0x1			; GFX10-NEXT: s_clause 0x1
	; GFX10-NEXT: global_load_ushort v2, v[0:1], off			; GFX10-NEXT: global_load_ushort v2, v[0:1], off
	; GFX10-NEXT: global_load_ushort v3, v[0:1], off offset:2			; GFX10-NEXT: global_load_ushort v3, v[0:1], off offset:2
	; GFX10-NEXT: s_waitcnt vmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0)
	; GFX10-NEXT: v_lshl_or_b32 v0, v3, 16, v2			; GFX10-NEXT: v_perm_b32 v0, v2, v3, 0x4050001
	; GFX10-NEXT: s_setpc_b64 s[30:31]			; GFX10-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX11-LABEL: global_load_2xi16_align2:			; GFX11-LABEL: global_load_2xi16_align2:
	; GFX11: ; %bb.0:			; GFX11: ; %bb.0:
	; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX11-NEXT: s_waitcnt_vscnt null, 0x0			; GFX11-NEXT: s_waitcnt_vscnt null, 0x0
	; GFX11-NEXT: s_clause 0x1			; GFX11-NEXT: s_clause 0x1
	; GFX11-NEXT: global_load_u16 v2, v[0:1], off			; GFX11-NEXT: global_load_u16 v2, v[0:1], off
	; GFX11-NEXT: global_load_u16 v0, v[0:1], off offset:2			; GFX11-NEXT: global_load_u16 v0, v[0:1], off offset:2
	; GFX11-NEXT: s_waitcnt vmcnt(0)			; GFX11-NEXT: s_waitcnt vmcnt(0)
	; GFX11-NEXT: v_lshl_or_b32 v0, v0, 16, v2			; GFX11-NEXT: v_perm_b32 v0, v2, v0, 0x4050001
	arsenmUnsubmitted Not Done Reply Inline Actions Yes, this is worse. Should avoid cases that can use v_lshl_or_b32 arsenm: Yes, this is worse. Should avoid cases that can use v_lshl_or_b32
	; GFX11-NEXT: s_setpc_b64 s[30:31]			; GFX11-NEXT: s_setpc_b64 s[30:31]
	%gep.p = getelementptr i16, i16 addrspace(1)* %p, i64 1			%gep.p = getelementptr i16, i16 addrspace(1)* %p, i64 1
	%p.0 = load i16, i16 addrspace(1)* %p, align 2			%p.0 = load i16, i16 addrspace(1)* %p, align 2
	%p.1 = load i16, i16 addrspace(1)* %gep.p, align 2			%p.1 = load i16, i16 addrspace(1)* %gep.p, align 2
	%zext.0 = zext i16 %p.0 to i32			%zext.0 = zext i16 %p.0 to i32
	%zext.1 = zext i16 %p.1 to i32			%zext.1 = zext i16 %p.1 to i32
	%shl.1 = shl i32 %zext.1, 16			%shl.1 = shl i32 %zext.1, 16
	%or = or i32 %zext.0, %shl.1			%or = or i32 %zext.0, %shl.1
	▲ Show 20 Lines • Show All 357 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.private.ll

	Show All 33 Lines
	; GFX7-UNALIGNED-NEXT: v_or_b32_e32 v0, v0, v1			; GFX7-UNALIGNED-NEXT: v_or_b32_e32 v0, v0, v1
	; GFX7-UNALIGNED-NEXT: s_setpc_b64 s[30:31]			; GFX7-UNALIGNED-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX9-LABEL: private_load_2xi16_align2:			; GFX9-LABEL: private_load_2xi16_align2:
	; GFX9: ; %bb.0:			; GFX9: ; %bb.0:
	; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX9-NEXT: buffer_load_ushort v1, v0, s[0:3], 0 offen			; GFX9-NEXT: buffer_load_ushort v1, v0, s[0:3], 0 offen
	; GFX9-NEXT: buffer_load_ushort v2, v0, s[0:3], 0 offen offset:2			; GFX9-NEXT: buffer_load_ushort v2, v0, s[0:3], 0 offen offset:2
				; GFX9-NEXT: s_mov_b32 s4, 0x4050001
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: v_lshl_or_b32 v0, v2, 16, v1			; GFX9-NEXT: v_perm_b32 v0, v1, v2, s4
	; GFX9-NEXT: s_setpc_b64 s[30:31]			; GFX9-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX9-FLASTSCR-LABEL: private_load_2xi16_align2:			; GFX9-FLASTSCR-LABEL: private_load_2xi16_align2:
	; GFX9-FLASTSCR: ; %bb.0:			; GFX9-FLASTSCR: ; %bb.0:
	; GFX9-FLASTSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX9-FLASTSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX9-FLASTSCR-NEXT: scratch_load_ushort v1, v0, off			; GFX9-FLASTSCR-NEXT: scratch_load_ushort v1, v0, off
	; GFX9-FLASTSCR-NEXT: scratch_load_ushort v2, v0, off offset:2			; GFX9-FLASTSCR-NEXT: scratch_load_ushort v2, v0, off offset:2
				; GFX9-FLASTSCR-NEXT: s_mov_b32 s0, 0x4050001
	; GFX9-FLASTSCR-NEXT: s_waitcnt vmcnt(0)			; GFX9-FLASTSCR-NEXT: s_waitcnt vmcnt(0)
	; GFX9-FLASTSCR-NEXT: v_lshl_or_b32 v0, v2, 16, v1			; GFX9-FLASTSCR-NEXT: v_perm_b32 v0, v1, v2, s0
	; GFX9-FLASTSCR-NEXT: s_setpc_b64 s[30:31]			; GFX9-FLASTSCR-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX10-LABEL: private_load_2xi16_align2:			; GFX10-LABEL: private_load_2xi16_align2:
	; GFX10: ; %bb.0:			; GFX10: ; %bb.0:
	; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX10-NEXT: s_waitcnt_vscnt null, 0x0			; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
	; GFX10-NEXT: s_clause 0x1			; GFX10-NEXT: s_clause 0x1
	; GFX10-NEXT: buffer_load_ushort v1, v0, s[0:3], 0 offen			; GFX10-NEXT: buffer_load_ushort v1, v0, s[0:3], 0 offen
	; GFX10-NEXT: buffer_load_ushort v2, v0, s[0:3], 0 offen offset:2			; GFX10-NEXT: buffer_load_ushort v2, v0, s[0:3], 0 offen offset:2
	; GFX10-NEXT: s_waitcnt vmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0)
	; GFX10-NEXT: v_lshl_or_b32 v0, v2, 16, v1			; GFX10-NEXT: v_perm_b32 v0, v1, v2, 0x4050001
	; GFX10-NEXT: s_setpc_b64 s[30:31]			; GFX10-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX10-FLASTSCR-LABEL: private_load_2xi16_align2:			; GFX10-FLASTSCR-LABEL: private_load_2xi16_align2:
	; GFX10-FLASTSCR: ; %bb.0:			; GFX10-FLASTSCR: ; %bb.0:
	; GFX10-FLASTSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX10-FLASTSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX10-FLASTSCR-NEXT: s_waitcnt_vscnt null, 0x0			; GFX10-FLASTSCR-NEXT: s_waitcnt_vscnt null, 0x0
	; GFX10-FLASTSCR-NEXT: s_clause 0x1			; GFX10-FLASTSCR-NEXT: s_clause 0x1
	; GFX10-FLASTSCR-NEXT: scratch_load_ushort v1, v0, off			; GFX10-FLASTSCR-NEXT: scratch_load_ushort v1, v0, off
	; GFX10-FLASTSCR-NEXT: scratch_load_ushort v2, v0, off offset:2			; GFX10-FLASTSCR-NEXT: scratch_load_ushort v2, v0, off offset:2
	; GFX10-FLASTSCR-NEXT: s_waitcnt vmcnt(0)			; GFX10-FLASTSCR-NEXT: s_waitcnt vmcnt(0)
	; GFX10-FLASTSCR-NEXT: v_lshl_or_b32 v0, v2, 16, v1			; GFX10-FLASTSCR-NEXT: v_perm_b32 v0, v1, v2, 0x4050001
	; GFX10-FLASTSCR-NEXT: s_setpc_b64 s[30:31]			; GFX10-FLASTSCR-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX11-LABEL: private_load_2xi16_align2:			; GFX11-LABEL: private_load_2xi16_align2:
	; GFX11: ; %bb.0:			; GFX11: ; %bb.0:
	; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX11-NEXT: s_waitcnt_vscnt null, 0x0			; GFX11-NEXT: s_waitcnt_vscnt null, 0x0
	; GFX11-NEXT: s_clause 0x1			; GFX11-NEXT: s_clause 0x1
	; GFX11-NEXT: scratch_load_u16 v1, v0, off			; GFX11-NEXT: scratch_load_u16 v1, v0, off
	; GFX11-NEXT: scratch_load_u16 v0, v0, off offset:2			; GFX11-NEXT: scratch_load_u16 v0, v0, off offset:2
	; GFX11-NEXT: s_waitcnt vmcnt(0)			; GFX11-NEXT: s_waitcnt vmcnt(0)
	; GFX11-NEXT: v_lshl_or_b32 v0, v0, 16, v1			; GFX11-NEXT: v_perm_b32 v0, v1, v0, 0x4050001
	; GFX11-NEXT: s_setpc_b64 s[30:31]			; GFX11-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX11-FLASTSCR-LABEL: private_load_2xi16_align2:			; GFX11-FLASTSCR-LABEL: private_load_2xi16_align2:
	; GFX11-FLASTSCR: ; %bb.0:			; GFX11-FLASTSCR: ; %bb.0:
	; GFX11-FLASTSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX11-FLASTSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX11-FLASTSCR-NEXT: s_waitcnt_vscnt null, 0x0			; GFX11-FLASTSCR-NEXT: s_waitcnt_vscnt null, 0x0
	; GFX11-FLASTSCR-NEXT: s_clause 0x1			; GFX11-FLASTSCR-NEXT: s_clause 0x1
	; GFX11-FLASTSCR-NEXT: scratch_load_u16 v1, v0, off			; GFX11-FLASTSCR-NEXT: scratch_load_u16 v1, v0, off
	; GFX11-FLASTSCR-NEXT: scratch_load_u16 v0, v0, off offset:2			; GFX11-FLASTSCR-NEXT: scratch_load_u16 v0, v0, off offset:2
	; GFX11-FLASTSCR-NEXT: s_waitcnt vmcnt(0)			; GFX11-FLASTSCR-NEXT: s_waitcnt vmcnt(0)
	; GFX11-FLASTSCR-NEXT: v_lshl_or_b32 v0, v0, 16, v1			; GFX11-FLASTSCR-NEXT: v_perm_b32 v0, v1, v0, 0x4050001
	; GFX11-FLASTSCR-NEXT: s_setpc_b64 s[30:31]			; GFX11-FLASTSCR-NEXT: s_setpc_b64 s[30:31]
	%gep.p = getelementptr i16, i16 addrspace(5)* %p, i64 1			%gep.p = getelementptr i16, i16 addrspace(5)* %p, i64 1
	%p.0 = load i16, i16 addrspace(5)* %p, align 2			%p.0 = load i16, i16 addrspace(5)* %p, align 2
	%p.1 = load i16, i16 addrspace(5)* %gep.p, align 2			%p.1 = load i16, i16 addrspace(5)* %gep.p, align 2
	%zext.0 = zext i16 %p.0 to i32			%zext.0 = zext i16 %p.0 to i32
	%zext.1 = zext i16 %p.1 to i32			%zext.1 = zext i16 %p.1 to i32
	%shl.1 = shl i32 %zext.1, 16			%shl.1 = shl i32 %zext.1, 16
	%or = or i32 %zext.0, %shl.1			%or = or i32 %zext.0, %shl.1
	▲ Show 20 Lines • Show All 125 Lines • ▼ Show 20 Lines
	; GFX7-UNALIGNED-NEXT: buffer_load_dword v0, v0, s[0:3], 0 offen			; GFX7-UNALIGNED-NEXT: buffer_load_dword v0, v0, s[0:3], 0 offen
	; GFX7-UNALIGNED-NEXT: s_waitcnt vmcnt(0)			; GFX7-UNALIGNED-NEXT: s_waitcnt vmcnt(0)
	; GFX7-UNALIGNED-NEXT: s_setpc_b64 s[30:31]			; GFX7-UNALIGNED-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX9-LABEL: private_load_2xi16_align1:			; GFX9-LABEL: private_load_2xi16_align1:
	; GFX9: ; %bb.0:			; GFX9: ; %bb.0:
	; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX9-NEXT: buffer_load_dword v0, v0, s[0:3], 0 offen			; GFX9-NEXT: buffer_load_dword v0, v0, s[0:3], 0 offen
	; GFX9-NEXT: s_mov_b32 s4, 0xffff			; GFX9-NEXT: v_mov_b32_e32 v1, 0x4050607
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: v_and_b32_e32 v1, 0xffff0000, v0			; GFX9-NEXT: v_perm_b32 v0, v0, 0, v1
	; GFX9-NEXT: v_and_or_b32 v0, v0, s4, v1
	; GFX9-NEXT: s_setpc_b64 s[30:31]			; GFX9-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX9-FLASTSCR-LABEL: private_load_2xi16_align1:			; GFX9-FLASTSCR-LABEL: private_load_2xi16_align1:
	; GFX9-FLASTSCR: ; %bb.0:			; GFX9-FLASTSCR: ; %bb.0:
	; GFX9-FLASTSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX9-FLASTSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX9-FLASTSCR-NEXT: scratch_load_dword v0, v0, off			; GFX9-FLASTSCR-NEXT: scratch_load_dword v0, v0, off
	; GFX9-FLASTSCR-NEXT: s_mov_b32 s0, 0xffff			; GFX9-FLASTSCR-NEXT: v_mov_b32_e32 v1, 0x4050607
	; GFX9-FLASTSCR-NEXT: s_waitcnt vmcnt(0)			; GFX9-FLASTSCR-NEXT: s_waitcnt vmcnt(0)
	; GFX9-FLASTSCR-NEXT: v_and_b32_e32 v1, 0xffff0000, v0			; GFX9-FLASTSCR-NEXT: v_perm_b32 v0, v0, 0, v1
	; GFX9-FLASTSCR-NEXT: v_and_or_b32 v0, v0, s0, v1
	; GFX9-FLASTSCR-NEXT: s_setpc_b64 s[30:31]			; GFX9-FLASTSCR-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX10-LABEL: private_load_2xi16_align1:			; GFX10-LABEL: private_load_2xi16_align1:
	; GFX10: ; %bb.0:			; GFX10: ; %bb.0:
	; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX10-NEXT: s_waitcnt_vscnt null, 0x0			; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
	; GFX10-NEXT: buffer_load_dword v0, v0, s[0:3], 0 offen			; GFX10-NEXT: buffer_load_dword v0, v0, s[0:3], 0 offen
	; GFX10-NEXT: s_waitcnt vmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0)
	; GFX10-NEXT: v_and_b32_e32 v1, 0xffff0000, v0			; GFX10-NEXT: v_perm_b32 v0, v0, 0, 0x4050607
	; GFX10-NEXT: v_and_or_b32 v0, 0xffff, v0, v1
	; GFX10-NEXT: s_setpc_b64 s[30:31]			; GFX10-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX10-FLASTSCR-LABEL: private_load_2xi16_align1:			; GFX10-FLASTSCR-LABEL: private_load_2xi16_align1:
	; GFX10-FLASTSCR: ; %bb.0:			; GFX10-FLASTSCR: ; %bb.0:
	; GFX10-FLASTSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX10-FLASTSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX10-FLASTSCR-NEXT: s_waitcnt_vscnt null, 0x0			; GFX10-FLASTSCR-NEXT: s_waitcnt_vscnt null, 0x0
	; GFX10-FLASTSCR-NEXT: scratch_load_dword v0, v0, off			; GFX10-FLASTSCR-NEXT: scratch_load_dword v0, v0, off
	; GFX10-FLASTSCR-NEXT: s_waitcnt vmcnt(0)			; GFX10-FLASTSCR-NEXT: s_waitcnt vmcnt(0)
	; GFX10-FLASTSCR-NEXT: v_and_b32_e32 v1, 0xffff0000, v0			; GFX10-FLASTSCR-NEXT: v_perm_b32 v0, v0, 0, 0x4050607
	; GFX10-FLASTSCR-NEXT: v_and_or_b32 v0, 0xffff, v0, v1
	; GFX10-FLASTSCR-NEXT: s_setpc_b64 s[30:31]			; GFX10-FLASTSCR-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX11-LABEL: private_load_2xi16_align1:			; GFX11-LABEL: private_load_2xi16_align1:
	; GFX11: ; %bb.0:			; GFX11: ; %bb.0:
	; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX11-NEXT: s_waitcnt_vscnt null, 0x0			; GFX11-NEXT: s_waitcnt_vscnt null, 0x0
	; GFX11-NEXT: scratch_load_b32 v0, v0, off			; GFX11-NEXT: scratch_load_b32 v0, v0, off
	; GFX11-NEXT: s_waitcnt vmcnt(0)			; GFX11-NEXT: s_waitcnt vmcnt(0)
	; GFX11-NEXT: v_and_b32_e32 v1, 0xffff0000, v0			; GFX11-NEXT: v_perm_b32 v0, v0, 0, 0x4050607
	; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1)
	; GFX11-NEXT: v_and_or_b32 v0, 0xffff, v0, v1
	; GFX11-NEXT: s_setpc_b64 s[30:31]			; GFX11-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX11-FLASTSCR-LABEL: private_load_2xi16_align1:			; GFX11-FLASTSCR-LABEL: private_load_2xi16_align1:
	; GFX11-FLASTSCR: ; %bb.0:			; GFX11-FLASTSCR: ; %bb.0:
	; GFX11-FLASTSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX11-FLASTSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX11-FLASTSCR-NEXT: s_waitcnt_vscnt null, 0x0			; GFX11-FLASTSCR-NEXT: s_waitcnt_vscnt null, 0x0
	; GFX11-FLASTSCR-NEXT: scratch_load_b32 v0, v0, off			; GFX11-FLASTSCR-NEXT: scratch_load_b32 v0, v0, off
	; GFX11-FLASTSCR-NEXT: s_waitcnt vmcnt(0)			; GFX11-FLASTSCR-NEXT: s_waitcnt vmcnt(0)
	; GFX11-FLASTSCR-NEXT: v_and_b32_e32 v1, 0xffff0000, v0			; GFX11-FLASTSCR-NEXT: v_perm_b32 v0, v0, 0, 0x4050607
	; GFX11-FLASTSCR-NEXT: s_delay_alu instid0(VALU_DEP_1)
	; GFX11-FLASTSCR-NEXT: v_and_or_b32 v0, 0xffff, v0, v1
	; GFX11-FLASTSCR-NEXT: s_setpc_b64 s[30:31]			; GFX11-FLASTSCR-NEXT: s_setpc_b64 s[30:31]
	%gep.p = getelementptr i16, i16 addrspace(5)* %p, i64 1			%gep.p = getelementptr i16, i16 addrspace(5)* %p, i64 1
	%p.0 = load i16, i16 addrspace(5)* %p, align 1			%p.0 = load i16, i16 addrspace(5)* %p, align 1
	%p.1 = load i16, i16 addrspace(5)* %gep.p, align 1			%p.1 = load i16, i16 addrspace(5)* %gep.p, align 1
	%zext.0 = zext i16 %p.0 to i32			%zext.0 = zext i16 %p.0 to i32
	%zext.1 = zext i16 %p.1 to i32			%zext.1 = zext i16 %p.1 to i32
	%shl.1 = shl i32 %zext.1, 16			%shl.1 = shl i32 %zext.1, 16
	%or = or i32 %zext.0, %shl.1			%or = or i32 %zext.0, %shl.1
	▲ Show 20 Lines • Show All 241 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/insert_vector_elt.v2i16.ll

	Show First 20 Lines • Show All 749 Lines • ▼ Show 20 Lines
	; VI-NEXT: v_lshlrev_b32_e32 v2, 2, v0			; VI-NEXT: v_lshlrev_b32_e32 v2, 2, v0
	; VI-NEXT: s_waitcnt lgkmcnt(0)			; VI-NEXT: s_waitcnt lgkmcnt(0)
	; VI-NEXT: v_mov_b32_e32 v1, s3			; VI-NEXT: v_mov_b32_e32 v1, s3
	; VI-NEXT: v_add_u32_e32 v0, vcc, s2, v2			; VI-NEXT: v_add_u32_e32 v0, vcc, s2, v2
	; VI-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc			; VI-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc
	; VI-NEXT: flat_load_dword v3, v[0:1]			; VI-NEXT: flat_load_dword v3, v[0:1]
	; VI-NEXT: v_mov_b32_e32 v1, s1			; VI-NEXT: v_mov_b32_e32 v1, s1
	; VI-NEXT: v_add_u32_e32 v0, vcc, s0, v2			; VI-NEXT: v_add_u32_e32 v0, vcc, s0, v2
				; VI-NEXT: v_mov_b32_e32 v2, 0x6070203
	; VI-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc			; VI-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc
	; VI-NEXT: s_waitcnt vmcnt(0)			; VI-NEXT: s_waitcnt vmcnt(0)
	; VI-NEXT: v_lshrrev_b32_e32 v2, 16, v3			; VI-NEXT: v_perm_b32 v2, s4, v3, v2
	; VI-NEXT: v_alignbit_b32 v2, v2, s4, 16
	; VI-NEXT: flat_store_dword v[0:1], v2			; VI-NEXT: flat_store_dword v[0:1], v2
	; VI-NEXT: s_endpgm			; VI-NEXT: s_endpgm
	;			;
	; CI-LABEL: v_insertelement_v2i16_0_reghi:			; CI-LABEL: v_insertelement_v2i16_0_reghi:
	; CI: ; %bb.0:			; CI: ; %bb.0:
	; CI-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0			; CI-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0
	; CI-NEXT: s_load_dword s4, s[4:5], 0x4			; CI-NEXT: s_load_dword s4, s[4:5], 0x4
	; CI-NEXT: v_lshlrev_b32_e32 v2, 2, v0			; CI-NEXT: v_lshlrev_b32_e32 v2, 2, v0
	▲ Show 20 Lines • Show All 819 Lines • ▼ Show 20 Lines
	; GFX9-NEXT: global_store_dwordx2 v2, v[0:1], s[0:1]			; GFX9-NEXT: global_store_dwordx2 v2, v[0:1], s[0:1]
	; GFX9-NEXT: s_endpgm			; GFX9-NEXT: s_endpgm
	;			;
	; VI-LABEL: v_insertelement_v4f16_0:			; VI-LABEL: v_insertelement_v4f16_0:
	; VI: ; %bb.0:			; VI: ; %bb.0:
	; VI-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0			; VI-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0
	; VI-NEXT: s_load_dword s4, s[4:5], 0x30			; VI-NEXT: s_load_dword s4, s[4:5], 0x30
	; VI-NEXT: v_lshlrev_b32_e32 v2, 3, v0			; VI-NEXT: v_lshlrev_b32_e32 v2, 3, v0
				; VI-NEXT: v_mov_b32_e32 v4, 0x4050203
	; VI-NEXT: s_waitcnt lgkmcnt(0)			; VI-NEXT: s_waitcnt lgkmcnt(0)
	; VI-NEXT: v_mov_b32_e32 v1, s3			; VI-NEXT: v_mov_b32_e32 v1, s3
	; VI-NEXT: v_add_u32_e32 v0, vcc, s2, v2			; VI-NEXT: v_add_u32_e32 v0, vcc, s2, v2
	; VI-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc			; VI-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc
	; VI-NEXT: flat_load_dwordx2 v[0:1], v[0:1]			; VI-NEXT: flat_load_dwordx2 v[0:1], v[0:1]
	; VI-NEXT: v_mov_b32_e32 v3, s1			; VI-NEXT: v_mov_b32_e32 v3, s1
	; VI-NEXT: v_add_u32_e32 v2, vcc, s0, v2			; VI-NEXT: v_add_u32_e32 v2, vcc, s0, v2
	; VI-NEXT: s_mov_b32 s0, 0xffff
	; VI-NEXT: v_mov_b32_e32 v4, s4
	; VI-NEXT: v_addc_u32_e32 v3, vcc, 0, v3, vcc			; VI-NEXT: v_addc_u32_e32 v3, vcc, 0, v3, vcc
	; VI-NEXT: s_waitcnt vmcnt(0)			; VI-NEXT: s_waitcnt vmcnt(0)
	; VI-NEXT: v_bfi_b32 v0, s0, v4, v0			; VI-NEXT: v_perm_b32 v0, s4, v0, v4
	; VI-NEXT: flat_store_dwordx2 v[2:3], v[0:1]			; VI-NEXT: flat_store_dwordx2 v[2:3], v[0:1]
	; VI-NEXT: s_endpgm			; VI-NEXT: s_endpgm
	;			;
	; CI-LABEL: v_insertelement_v4f16_0:			; CI-LABEL: v_insertelement_v4f16_0:
	; CI: ; %bb.0:			; CI: ; %bb.0:
	; CI-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0			; CI-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0
	; CI-NEXT: s_load_dword s4, s[4:5], 0xc			; CI-NEXT: s_load_dword s4, s[4:5], 0xc
	; CI-NEXT: v_lshlrev_b32_e32 v2, 3, v0			; CI-NEXT: v_lshlrev_b32_e32 v2, 3, v0
	▲ Show 20 Lines • Show All 129 Lines • ▼ Show 20 Lines
	; GFX9-NEXT: global_store_dwordx2 v2, v[0:1], s[0:1]			; GFX9-NEXT: global_store_dwordx2 v2, v[0:1], s[0:1]
	; GFX9-NEXT: s_endpgm			; GFX9-NEXT: s_endpgm
	;			;
	; VI-LABEL: v_insertelement_v4f16_2:			; VI-LABEL: v_insertelement_v4f16_2:
	; VI: ; %bb.0:			; VI: ; %bb.0:
	; VI-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0			; VI-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0
	; VI-NEXT: s_load_dword s4, s[4:5], 0x30			; VI-NEXT: s_load_dword s4, s[4:5], 0x30
	; VI-NEXT: v_lshlrev_b32_e32 v2, 3, v0			; VI-NEXT: v_lshlrev_b32_e32 v2, 3, v0
				; VI-NEXT: v_mov_b32_e32 v4, 0x4050203
	; VI-NEXT: s_waitcnt lgkmcnt(0)			; VI-NEXT: s_waitcnt lgkmcnt(0)
	; VI-NEXT: v_mov_b32_e32 v1, s3			; VI-NEXT: v_mov_b32_e32 v1, s3
	; VI-NEXT: v_add_u32_e32 v0, vcc, s2, v2			; VI-NEXT: v_add_u32_e32 v0, vcc, s2, v2
	; VI-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc			; VI-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc
	; VI-NEXT: flat_load_dwordx2 v[0:1], v[0:1]			; VI-NEXT: flat_load_dwordx2 v[0:1], v[0:1]
	; VI-NEXT: v_mov_b32_e32 v3, s1			; VI-NEXT: v_mov_b32_e32 v3, s1
	; VI-NEXT: v_add_u32_e32 v2, vcc, s0, v2			; VI-NEXT: v_add_u32_e32 v2, vcc, s0, v2
	; VI-NEXT: s_mov_b32 s0, 0xffff
	; VI-NEXT: v_mov_b32_e32 v4, s4
	; VI-NEXT: v_addc_u32_e32 v3, vcc, 0, v3, vcc			; VI-NEXT: v_addc_u32_e32 v3, vcc, 0, v3, vcc
	; VI-NEXT: s_waitcnt vmcnt(0)			; VI-NEXT: s_waitcnt vmcnt(0)
	; VI-NEXT: v_bfi_b32 v1, s0, v4, v1			; VI-NEXT: v_perm_b32 v1, s4, v1, v4
	; VI-NEXT: flat_store_dwordx2 v[2:3], v[0:1]			; VI-NEXT: flat_store_dwordx2 v[2:3], v[0:1]
	; VI-NEXT: s_endpgm			; VI-NEXT: s_endpgm
	;			;
	; CI-LABEL: v_insertelement_v4f16_2:			; CI-LABEL: v_insertelement_v4f16_2:
	; CI: ; %bb.0:			; CI: ; %bb.0:
	; CI-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0			; CI-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0
	; CI-NEXT: s_load_dword s4, s[4:5], 0xc			; CI-NEXT: s_load_dword s4, s[4:5], 0xc
	; CI-NEXT: v_lshlrev_b32_e32 v2, 3, v0			; CI-NEXT: v_lshlrev_b32_e32 v2, 3, v0
	▲ Show 20 Lines • Show All 129 Lines • ▼ Show 20 Lines
	; GFX9-NEXT: global_store_dwordx2 v2, v[0:1], s[0:1]			; GFX9-NEXT: global_store_dwordx2 v2, v[0:1], s[0:1]
	; GFX9-NEXT: s_endpgm			; GFX9-NEXT: s_endpgm
	;			;
	; VI-LABEL: v_insertelement_v4i16_2:			; VI-LABEL: v_insertelement_v4i16_2:
	; VI: ; %bb.0:			; VI: ; %bb.0:
	; VI-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0			; VI-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0
	; VI-NEXT: s_load_dword s4, s[4:5], 0x10			; VI-NEXT: s_load_dword s4, s[4:5], 0x10
	; VI-NEXT: v_lshlrev_b32_e32 v2, 3, v0			; VI-NEXT: v_lshlrev_b32_e32 v2, 3, v0
				; VI-NEXT: v_mov_b32_e32 v4, 0x4050203
	; VI-NEXT: s_waitcnt lgkmcnt(0)			; VI-NEXT: s_waitcnt lgkmcnt(0)
	; VI-NEXT: v_mov_b32_e32 v1, s3			; VI-NEXT: v_mov_b32_e32 v1, s3
	; VI-NEXT: v_add_u32_e32 v0, vcc, s2, v2			; VI-NEXT: v_add_u32_e32 v0, vcc, s2, v2
	; VI-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc			; VI-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc
	; VI-NEXT: flat_load_dwordx2 v[0:1], v[0:1]			; VI-NEXT: flat_load_dwordx2 v[0:1], v[0:1]
	; VI-NEXT: v_mov_b32_e32 v3, s1			; VI-NEXT: v_mov_b32_e32 v3, s1
	; VI-NEXT: v_add_u32_e32 v2, vcc, s0, v2			; VI-NEXT: v_add_u32_e32 v2, vcc, s0, v2
	; VI-NEXT: s_mov_b32 s0, 0xffff
	; VI-NEXT: v_mov_b32_e32 v4, s4
	; VI-NEXT: v_addc_u32_e32 v3, vcc, 0, v3, vcc			; VI-NEXT: v_addc_u32_e32 v3, vcc, 0, v3, vcc
	; VI-NEXT: s_waitcnt vmcnt(0)			; VI-NEXT: s_waitcnt vmcnt(0)
	; VI-NEXT: v_bfi_b32 v1, s0, v4, v1			; VI-NEXT: v_perm_b32 v1, s4, v1, v4
	; VI-NEXT: flat_store_dwordx2 v[2:3], v[0:1]			; VI-NEXT: flat_store_dwordx2 v[2:3], v[0:1]
	; VI-NEXT: s_endpgm			; VI-NEXT: s_endpgm
	;			;
	; CI-LABEL: v_insertelement_v4i16_2:			; CI-LABEL: v_insertelement_v4i16_2:
	; CI: ; %bb.0:			; CI: ; %bb.0:
	; CI-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0			; CI-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0
	; CI-NEXT: s_load_dword s4, s[4:5], 0x4			; CI-NEXT: s_load_dword s4, s[4:5], 0x4
	; CI-NEXT: v_lshlrev_b32_e32 v2, 3, v0			; CI-NEXT: v_lshlrev_b32_e32 v2, 3, v0
	▲ Show 20 Lines • Show All 753 Lines • ▼ Show 20 Lines
	; GFX9-NEXT: global_store_dwordx4 v8, v[0:3], s[0:1]			; GFX9-NEXT: global_store_dwordx4 v8, v[0:3], s[0:1]
	; GFX9-NEXT: s_endpgm			; GFX9-NEXT: s_endpgm
	;			;
	; VI-LABEL: v_insertelement_v16i16_6:			; VI-LABEL: v_insertelement_v16i16_6:
	; VI: ; %bb.0:			; VI: ; %bb.0:
	; VI-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0			; VI-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0
	; VI-NEXT: s_load_dword s4, s[4:5], 0x10			; VI-NEXT: s_load_dword s4, s[4:5], 0x10
	; VI-NEXT: v_lshlrev_b32_e32 v8, 5, v0			; VI-NEXT: v_lshlrev_b32_e32 v8, 5, v0
				; VI-NEXT: v_mov_b32_e32 v12, 0x4050203
	; VI-NEXT: s_waitcnt lgkmcnt(0)			; VI-NEXT: s_waitcnt lgkmcnt(0)
	; VI-NEXT: v_mov_b32_e32 v1, s3			; VI-NEXT: v_mov_b32_e32 v1, s3
	; VI-NEXT: v_add_u32_e32 v0, vcc, s2, v8			; VI-NEXT: v_add_u32_e32 v0, vcc, s2, v8
	; VI-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc			; VI-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc
	; VI-NEXT: v_add_u32_e32 v4, vcc, 16, v0			; VI-NEXT: v_add_u32_e32 v4, vcc, 16, v0
	; VI-NEXT: v_addc_u32_e32 v5, vcc, 0, v1, vcc			; VI-NEXT: v_addc_u32_e32 v5, vcc, 0, v1, vcc
	; VI-NEXT: flat_load_dwordx4 v[0:3], v[0:1]			; VI-NEXT: flat_load_dwordx4 v[0:3], v[0:1]
	; VI-NEXT: flat_load_dwordx4 v[4:7], v[4:5]			; VI-NEXT: flat_load_dwordx4 v[4:7], v[4:5]
	; VI-NEXT: v_mov_b32_e32 v9, s1			; VI-NEXT: v_mov_b32_e32 v9, s1
	; VI-NEXT: v_add_u32_e32 v8, vcc, s0, v8			; VI-NEXT: v_add_u32_e32 v8, vcc, s0, v8
	; VI-NEXT: v_addc_u32_e32 v9, vcc, 0, v9, vcc			; VI-NEXT: v_addc_u32_e32 v9, vcc, 0, v9, vcc
	; VI-NEXT: v_add_u32_e32 v10, vcc, 16, v8			; VI-NEXT: v_add_u32_e32 v10, vcc, 16, v8
	; VI-NEXT: s_mov_b32 s2, 0xffff
	; VI-NEXT: v_mov_b32_e32 v12, s4
	; VI-NEXT: v_addc_u32_e32 v11, vcc, 0, v9, vcc			; VI-NEXT: v_addc_u32_e32 v11, vcc, 0, v9, vcc
	; VI-NEXT: s_waitcnt vmcnt(1)			; VI-NEXT: s_waitcnt vmcnt(1)
	; VI-NEXT: v_bfi_b32 v3, s2, v12, v3			; VI-NEXT: v_perm_b32 v3, s4, v3, v12
	; VI-NEXT: s_waitcnt vmcnt(0)			; VI-NEXT: s_waitcnt vmcnt(0)
	; VI-NEXT: flat_store_dwordx4 v[10:11], v[4:7]			; VI-NEXT: flat_store_dwordx4 v[10:11], v[4:7]
	; VI-NEXT: flat_store_dwordx4 v[8:9], v[0:3]			; VI-NEXT: flat_store_dwordx4 v[8:9], v[0:3]
	; VI-NEXT: s_endpgm			; VI-NEXT: s_endpgm
	;			;
	; CI-LABEL: v_insertelement_v16i16_6:			; CI-LABEL: v_insertelement_v16i16_6:
	; CI: ; %bb.0:			; CI: ; %bb.0:
	; CI-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0			; CI-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0
	▲ Show 20 Lines • Show All 458 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/load-hi16.ll

	Show First 20 Lines • Show All 2,252 Lines • ▼ Show 20 Lines
	; GFX906-NEXT: s_setpc_b64 s[30:31]			; GFX906-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX803-LABEL: load_local_v2i16_split_multi_chain:			; GFX803-LABEL: load_local_v2i16_split_multi_chain:
	; GFX803: ; %bb.0: ; %entry			; GFX803: ; %bb.0: ; %entry
	; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX803-NEXT: s_mov_b32 m0, -1			; GFX803-NEXT: s_mov_b32 m0, -1
	; GFX803-NEXT: ds_read_u16 v1, v0			; GFX803-NEXT: ds_read_u16 v1, v0
	; GFX803-NEXT: ds_read_u16 v0, v0 offset:2			; GFX803-NEXT: ds_read_u16 v0, v0 offset:2
				; GFX803-NEXT: s_mov_b32 s4, 0x4050001
	; GFX803-NEXT: s_waitcnt lgkmcnt(0)			; GFX803-NEXT: s_waitcnt lgkmcnt(0)
	; GFX803-NEXT: v_lshlrev_b32_e32 v0, 16, v0			; GFX803-NEXT: v_perm_b32 v0, v1, v0, s4
	; GFX803-NEXT: v_or_b32_e32 v0, v1, v0
	; GFX803-NEXT: s_setpc_b64 s[30:31]			; GFX803-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX900-FLATSCR-LABEL: load_local_v2i16_split_multi_chain:			; GFX900-FLATSCR-LABEL: load_local_v2i16_split_multi_chain:
	; GFX900-FLATSCR: ; %bb.0: ; %entry			; GFX900-FLATSCR: ; %bb.0: ; %entry
	; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX900-FLATSCR-NEXT: ds_read_u16 v1, v0			; GFX900-FLATSCR-NEXT: ds_read_u16 v1, v0
	; GFX900-FLATSCR-NEXT: s_waitcnt lgkmcnt(0)			; GFX900-FLATSCR-NEXT: s_waitcnt lgkmcnt(0)
	; GFX900-FLATSCR-NEXT: ds_read_u16_d16_hi v1, v0 offset:2			; GFX900-FLATSCR-NEXT: ds_read_u16_d16_hi v1, v0 offset:2
	Show All 29 Lines
	; GFX906-NEXT: s_waitcnt lgkmcnt(0)			; GFX906-NEXT: s_waitcnt lgkmcnt(0)
	; GFX906-NEXT: v_perm_b32 v0, v0, v1, s4			; GFX906-NEXT: v_perm_b32 v0, v0, v1, s4
	; GFX906-NEXT: s_setpc_b64 s[30:31]			; GFX906-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX803-LABEL: load_local_lo_hi_v2i16_samechain:			; GFX803-LABEL: load_local_lo_hi_v2i16_samechain:
	; GFX803: ; %bb.0: ; %entry			; GFX803: ; %bb.0: ; %entry
	; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX803-NEXT: s_mov_b32 m0, -1			; GFX803-NEXT: s_mov_b32 m0, -1
	; GFX803-NEXT: ds_read_u16 v1, v0 offset:16			; GFX803-NEXT: ds_read_u16 v1, v0
	; GFX803-NEXT: ds_read_u16 v0, v0			; GFX803-NEXT: ds_read_u16 v0, v0 offset:16
	; GFX803-NEXT: s_waitcnt lgkmcnt(1)			; GFX803-NEXT: s_mov_b32 s4, 0x4050001
	; GFX803-NEXT: v_lshlrev_b32_e32 v1, 16, v1
	; GFX803-NEXT: s_waitcnt lgkmcnt(0)			; GFX803-NEXT: s_waitcnt lgkmcnt(0)
	; GFX803-NEXT: v_or_b32_e32 v0, v0, v1			; GFX803-NEXT: v_perm_b32 v0, v1, v0, s4
	; GFX803-NEXT: s_setpc_b64 s[30:31]			; GFX803-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX900-FLATSCR-LABEL: load_local_lo_hi_v2i16_samechain:			; GFX900-FLATSCR-LABEL: load_local_lo_hi_v2i16_samechain:
	; GFX900-FLATSCR: ; %bb.0: ; %entry			; GFX900-FLATSCR: ; %bb.0: ; %entry
	; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX900-FLATSCR-NEXT: ds_read_u16 v1, v0			; GFX900-FLATSCR-NEXT: ds_read_u16 v1, v0
	; GFX900-FLATSCR-NEXT: s_waitcnt lgkmcnt(0)			; GFX900-FLATSCR-NEXT: s_waitcnt lgkmcnt(0)
	; GFX900-FLATSCR-NEXT: ds_read_u16_d16_hi v1, v0 offset:16			; GFX900-FLATSCR-NEXT: ds_read_u16_d16_hi v1, v0 offset:16
	▲ Show 20 Lines • Show All 83 Lines • ▼ Show 20 Lines
	; GFX803-LABEL: load_local_lo_hi_v2i16_side_effect:			; GFX803-LABEL: load_local_lo_hi_v2i16_side_effect:
	; GFX803: ; %bb.0: ; %entry			; GFX803: ; %bb.0: ; %entry
	; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX803-NEXT: s_mov_b32 m0, -1			; GFX803-NEXT: s_mov_b32 m0, -1
	; GFX803-NEXT: v_mov_b32_e32 v3, 0x7b			; GFX803-NEXT: v_mov_b32_e32 v3, 0x7b
	; GFX803-NEXT: ds_read_u16 v2, v0			; GFX803-NEXT: ds_read_u16 v2, v0
	; GFX803-NEXT: ds_write_b16 v1, v3			; GFX803-NEXT: ds_write_b16 v1, v3
	; GFX803-NEXT: ds_read_u16 v0, v0 offset:16			; GFX803-NEXT: ds_read_u16 v0, v0 offset:16
				; GFX803-NEXT: s_mov_b32 s4, 0x4050001
	; GFX803-NEXT: s_waitcnt lgkmcnt(0)			; GFX803-NEXT: s_waitcnt lgkmcnt(0)
	; GFX803-NEXT: v_lshlrev_b32_e32 v0, 16, v0			; GFX803-NEXT: v_perm_b32 v0, v2, v0, s4
	; GFX803-NEXT: v_or_b32_e32 v0, v2, v0
	; GFX803-NEXT: s_setpc_b64 s[30:31]			; GFX803-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX900-FLATSCR-LABEL: load_local_lo_hi_v2i16_side_effect:			; GFX900-FLATSCR-LABEL: load_local_lo_hi_v2i16_side_effect:
	; GFX900-FLATSCR: ; %bb.0: ; %entry			; GFX900-FLATSCR: ; %bb.0: ; %entry
	; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX900-FLATSCR-NEXT: ds_read_u16 v2, v0			; GFX900-FLATSCR-NEXT: ds_read_u16 v2, v0
	; GFX900-FLATSCR-NEXT: v_mov_b32_e32 v3, 0x7b			; GFX900-FLATSCR-NEXT: v_mov_b32_e32 v3, 0x7b
	; GFX900-FLATSCR-NEXT: ds_write_b16 v1, v3			; GFX900-FLATSCR-NEXT: ds_write_b16 v1, v3
	Show All 39 Lines
	; GFX803: ; %bb.0: ; %entry			; GFX803: ; %bb.0: ; %entry
	; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX803-NEXT: v_add_u32_e32 v2, vcc, 2, v0			; GFX803-NEXT: v_add_u32_e32 v2, vcc, 2, v0
	; GFX803-NEXT: v_addc_u32_e32 v3, vcc, 0, v1, vcc			; GFX803-NEXT: v_addc_u32_e32 v3, vcc, 0, v1, vcc
	; GFX803-NEXT: flat_load_ushort v0, v[0:1] glc			; GFX803-NEXT: flat_load_ushort v0, v[0:1] glc
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: flat_load_ushort v1, v[2:3] glc			; GFX803-NEXT: flat_load_ushort v1, v[2:3] glc
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: v_lshlrev_b32_e32 v1, 16, v1			; GFX803-NEXT: s_mov_b32 s4, 0x4050001
	; GFX803-NEXT: v_or_b32_e32 v0, v0, v1			; GFX803-NEXT: v_perm_b32 v0, v0, v1, s4
	; GFX803-NEXT: s_setpc_b64 s[30:31]			; GFX803-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX900-FLATSCR-LABEL: load_global_v2i16_split:			; GFX900-FLATSCR-LABEL: load_global_v2i16_split:
	; GFX900-FLATSCR: ; %bb.0: ; %entry			; GFX900-FLATSCR: ; %bb.0: ; %entry
	; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX900-FLATSCR-NEXT: global_load_ushort v2, v[0:1], off glc			; GFX900-FLATSCR-NEXT: global_load_ushort v2, v[0:1], off glc
	; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)			; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
	; GFX900-FLATSCR-NEXT: global_load_short_d16_hi v2, v[0:1], off offset:2 glc			; GFX900-FLATSCR-NEXT: global_load_short_d16_hi v2, v[0:1], off offset:2 glc
	Show All 36 Lines
	; GFX803-LABEL: load_flat_v2i16_split:			; GFX803-LABEL: load_flat_v2i16_split:
	; GFX803: ; %bb.0: ; %entry			; GFX803: ; %bb.0: ; %entry
	; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX803-NEXT: v_add_u32_e32 v2, vcc, 2, v0			; GFX803-NEXT: v_add_u32_e32 v2, vcc, 2, v0
	; GFX803-NEXT: v_addc_u32_e32 v3, vcc, 0, v1, vcc			; GFX803-NEXT: v_addc_u32_e32 v3, vcc, 0, v1, vcc
	; GFX803-NEXT: flat_load_ushort v0, v[0:1] glc			; GFX803-NEXT: flat_load_ushort v0, v[0:1] glc
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: flat_load_ushort v1, v[2:3] glc			; GFX803-NEXT: flat_load_ushort v1, v[2:3] glc
	; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: v_lshlrev_b32_e32 v1, 16, v1			; GFX803-NEXT: s_mov_b32 s4, 0x4050001
	; GFX803-NEXT: v_or_b32_e32 v0, v0, v1			; GFX803-NEXT: s_waitcnt lgkmcnt(0)
				; GFX803-NEXT: v_perm_b32 v0, v0, v1, s4
	; GFX803-NEXT: s_setpc_b64 s[30:31]			; GFX803-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX900-FLATSCR-LABEL: load_flat_v2i16_split:			; GFX900-FLATSCR-LABEL: load_flat_v2i16_split:
	; GFX900-FLATSCR: ; %bb.0: ; %entry			; GFX900-FLATSCR: ; %bb.0: ; %entry
	; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX900-FLATSCR-NEXT: flat_load_ushort v2, v[0:1] glc			; GFX900-FLATSCR-NEXT: flat_load_ushort v2, v[0:1] glc
	; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)			; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
	; GFX900-FLATSCR-NEXT: flat_load_short_d16_hi v2, v[0:1] offset:2 glc			; GFX900-FLATSCR-NEXT: flat_load_short_d16_hi v2, v[0:1] offset:2 glc
	Show All 33 Lines
	;			;
	; GFX803-LABEL: load_constant_v2i16_split:			; GFX803-LABEL: load_constant_v2i16_split:
	; GFX803: ; %bb.0: ; %entry			; GFX803: ; %bb.0: ; %entry
	; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX803-NEXT: v_add_u32_e32 v2, vcc, 2, v0			; GFX803-NEXT: v_add_u32_e32 v2, vcc, 2, v0
	; GFX803-NEXT: v_addc_u32_e32 v3, vcc, 0, v1, vcc			; GFX803-NEXT: v_addc_u32_e32 v3, vcc, 0, v1, vcc
	; GFX803-NEXT: flat_load_ushort v0, v[0:1] glc			; GFX803-NEXT: flat_load_ushort v0, v[0:1] glc
	; GFX803-NEXT: flat_load_ushort v1, v[2:3] glc			; GFX803-NEXT: flat_load_ushort v1, v[2:3] glc
				; GFX803-NEXT: s_mov_b32 s4, 0x4050001
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: v_lshlrev_b32_e32 v1, 16, v1			; GFX803-NEXT: v_perm_b32 v0, v0, v1, s4
	; GFX803-NEXT: v_or_b32_e32 v0, v0, v1
	; GFX803-NEXT: s_setpc_b64 s[30:31]			; GFX803-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX900-FLATSCR-LABEL: load_constant_v2i16_split:			; GFX900-FLATSCR-LABEL: load_constant_v2i16_split:
	; GFX900-FLATSCR: ; %bb.0: ; %entry			; GFX900-FLATSCR: ; %bb.0: ; %entry
	; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX900-FLATSCR-NEXT: global_load_ushort v2, v[0:1], off glc			; GFX900-FLATSCR-NEXT: global_load_ushort v2, v[0:1], off glc
	; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)			; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
	; GFX900-FLATSCR-NEXT: global_load_short_d16_hi v2, v[0:1], off offset:2 glc			; GFX900-FLATSCR-NEXT: global_load_short_d16_hi v2, v[0:1], off offset:2 glc
	Show All 34 Lines
	;			;
	; GFX803-LABEL: load_private_v2i16_split:			; GFX803-LABEL: load_private_v2i16_split:
	; GFX803: ; %bb.0: ; %entry			; GFX803: ; %bb.0: ; %entry
	; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX803-NEXT: buffer_load_ushort v0, off, s[0:3], s32 glc			; GFX803-NEXT: buffer_load_ushort v0, off, s[0:3], s32 glc
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: buffer_load_ushort v1, off, s[0:3], s32 offset:2 glc			; GFX803-NEXT: buffer_load_ushort v1, off, s[0:3], s32 offset:2 glc
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: v_lshlrev_b32_e32 v1, 16, v1			; GFX803-NEXT: s_mov_b32 s4, 0x4050001
	; GFX803-NEXT: v_or_b32_e32 v0, v0, v1			; GFX803-NEXT: v_perm_b32 v0, v0, v1, s4
	; GFX803-NEXT: s_setpc_b64 s[30:31]			; GFX803-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX900-FLATSCR-LABEL: load_private_v2i16_split:			; GFX900-FLATSCR-LABEL: load_private_v2i16_split:
	; GFX900-FLATSCR: ; %bb.0: ; %entry			; GFX900-FLATSCR: ; %bb.0: ; %entry
	; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX900-FLATSCR-NEXT: scratch_load_ushort v0, off, s32 glc			; GFX900-FLATSCR-NEXT: scratch_load_ushort v0, off, s32 glc
	; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)			; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0)
	; GFX900-FLATSCR-NEXT: scratch_load_short_d16_hi v0, off, s32 offset:2 glc			; GFX900-FLATSCR-NEXT: scratch_load_short_d16_hi v0, off, s32 offset:2 glc
	▲ Show 20 Lines • Show All 71 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/load-lo16.ll

	Show First 20 Lines • Show All 215 Lines • ▼ Show 20 Lines
	; GFX906-NEXT: s_waitcnt vmcnt(0)			; GFX906-NEXT: s_waitcnt vmcnt(0)
	; GFX906-NEXT: s_setpc_b64 s[30:31]			; GFX906-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX803-LABEL: load_local_lo_v2f16_reghi_vreg:			; GFX803-LABEL: load_local_lo_v2f16_reghi_vreg:
	; GFX803: ; %bb.0: ; %entry			; GFX803: ; %bb.0: ; %entry
	; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX803-NEXT: s_mov_b32 m0, -1			; GFX803-NEXT: s_mov_b32 m0, -1
	; GFX803-NEXT: ds_read_u16 v0, v0			; GFX803-NEXT: ds_read_u16 v0, v0
	; GFX803-NEXT: v_and_b32_e32 v1, 0xffff0000, v1			; GFX803-NEXT: s_mov_b32 s4, 0x4050203
	; GFX803-NEXT: s_waitcnt lgkmcnt(0)			; GFX803-NEXT: s_waitcnt lgkmcnt(0)
	; GFX803-NEXT: v_or_b32_e32 v0, v0, v1			; GFX803-NEXT: v_perm_b32 v0, v0, v1, s4
	; GFX803-NEXT: flat_store_dword v[0:1], v0			; GFX803-NEXT: flat_store_dword v[0:1], v0
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: s_setpc_b64 s[30:31]			; GFX803-NEXT: s_setpc_b64 s[30:31]
	entry:			entry:
	%reg.bc = bitcast i32 %reg to <2 x half>			%reg.bc = bitcast i32 %reg to <2 x half>
	%load = load half, half addrspace(3)* %in			%load = load half, half addrspace(3)* %in
	%build1 = insertelement <2 x half> %reg.bc, half %load, i32 0			%build1 = insertelement <2 x half> %reg.bc, half %load, i32 0
	store <2 x half> %build1, <2 x half> addrspace(1)* undef			store <2 x half> %build1, <2 x half> addrspace(1)* undef
	▲ Show 20 Lines • Show All 442 Lines • ▼ Show 20 Lines
	; GFX906-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)			; GFX906-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
	; GFX906-NEXT: s_setpc_b64 s[30:31]			; GFX906-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX803-LABEL: load_local_lo_v2i16_reghi_vreg_multi_use_hi:			; GFX803-LABEL: load_local_lo_v2i16_reghi_vreg_multi_use_hi:
	; GFX803: ; %bb.0: ; %entry			; GFX803: ; %bb.0: ; %entry
	; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX803-NEXT: s_mov_b32 m0, -1			; GFX803-NEXT: s_mov_b32 m0, -1
	; GFX803-NEXT: ds_read_u16 v0, v0			; GFX803-NEXT: ds_read_u16 v0, v0
				; GFX803-NEXT: s_mov_b32 s4, 0x4050203
	; GFX803-NEXT: v_lshrrev_b32_e32 v2, 16, v1			; GFX803-NEXT: v_lshrrev_b32_e32 v2, 16, v1
	; GFX803-NEXT: v_and_b32_e32 v1, 0xffff0000, v1
	; GFX803-NEXT: v_mov_b32_e32 v3, 0			; GFX803-NEXT: v_mov_b32_e32 v3, 0
	; GFX803-NEXT: ds_write_b16 v3, v2			; GFX803-NEXT: ds_write_b16 v3, v2
	; GFX803-NEXT: s_waitcnt lgkmcnt(1)			; GFX803-NEXT: s_waitcnt lgkmcnt(1)
	; GFX803-NEXT: v_or_b32_e32 v0, v0, v1			; GFX803-NEXT: v_perm_b32 v0, v0, v1, s4
	; GFX803-NEXT: flat_store_dword v[0:1], v0			; GFX803-NEXT: flat_store_dword v[0:1], v0
	; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
	; GFX803-NEXT: s_setpc_b64 s[30:31]			; GFX803-NEXT: s_setpc_b64 s[30:31]
	entry:			entry:
	%load = load i16, i16 addrspace(3)* %in			%load = load i16, i16 addrspace(3)* %in
	%elt1 = extractelement <2 x i16> %reg, i32 1			%elt1 = extractelement <2 x i16> %reg, i32 1
	store i16 %elt1, i16 addrspace(3)* null			store i16 %elt1, i16 addrspace(3)* null
	%build1 = insertelement <2 x i16> %reg, i16 %load, i32 0			%build1 = insertelement <2 x i16> %reg, i16 %load, i32 0
	▲ Show 20 Lines • Show All 90 Lines • ▼ Show 20 Lines
	; GFX906-NEXT: s_setpc_b64 s[30:31]			; GFX906-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX803-LABEL: load_global_lo_v2i16_reglo_vreg:			; GFX803-LABEL: load_global_lo_v2i16_reglo_vreg:
	; GFX803: ; %bb.0: ; %entry			; GFX803: ; %bb.0: ; %entry
	; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX803-NEXT: v_add_u32_e32 v0, vcc, 0xfffff002, v0			; GFX803-NEXT: v_add_u32_e32 v0, vcc, 0xfffff002, v0
	; GFX803-NEXT: v_addc_u32_e32 v1, vcc, -1, v1, vcc			; GFX803-NEXT: v_addc_u32_e32 v1, vcc, -1, v1, vcc
	; GFX803-NEXT: flat_load_ushort v0, v[0:1]			; GFX803-NEXT: flat_load_ushort v0, v[0:1]
	; GFX803-NEXT: v_and_b32_e32 v1, 0xffff0000, v2			; GFX803-NEXT: s_mov_b32 s4, 0x4050203
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: v_or_b32_e32 v0, v0, v1			; GFX803-NEXT: v_perm_b32 v0, v0, v2, s4
	; GFX803-NEXT: flat_store_dword v[0:1], v0			; GFX803-NEXT: flat_store_dword v[0:1], v0
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: s_setpc_b64 s[30:31]			; GFX803-NEXT: s_setpc_b64 s[30:31]
	entry:			entry:
	%reg.bc = bitcast i32 %reg to <2 x i16>			%reg.bc = bitcast i32 %reg to <2 x i16>
	%gep = getelementptr inbounds i16, i16 addrspace(1)* %in, i64 -2047			%gep = getelementptr inbounds i16, i16 addrspace(1)* %in, i64 -2047
	%load = load i16, i16 addrspace(1)* %gep			%load = load i16, i16 addrspace(1)* %gep
	%build1 = insertelement <2 x i16> %reg.bc, i16 %load, i32 0			%build1 = insertelement <2 x i16> %reg.bc, i16 %load, i32 0
	Show All 23 Lines
	; GFX906-NEXT: s_setpc_b64 s[30:31]			; GFX906-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX803-LABEL: load_global_lo_v2f16_reglo_vreg:			; GFX803-LABEL: load_global_lo_v2f16_reglo_vreg:
	; GFX803: ; %bb.0: ; %entry			; GFX803: ; %bb.0: ; %entry
	; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX803-NEXT: v_add_u32_e32 v0, vcc, 0xfffff002, v0			; GFX803-NEXT: v_add_u32_e32 v0, vcc, 0xfffff002, v0
	; GFX803-NEXT: v_addc_u32_e32 v1, vcc, -1, v1, vcc			; GFX803-NEXT: v_addc_u32_e32 v1, vcc, -1, v1, vcc
	; GFX803-NEXT: flat_load_ushort v0, v[0:1]			; GFX803-NEXT: flat_load_ushort v0, v[0:1]
	; GFX803-NEXT: v_and_b32_e32 v1, 0xffff0000, v2			; GFX803-NEXT: s_mov_b32 s4, 0x4050203
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: v_or_b32_e32 v0, v0, v1			; GFX803-NEXT: v_perm_b32 v0, v0, v2, s4
	; GFX803-NEXT: flat_store_dword v[0:1], v0			; GFX803-NEXT: flat_store_dword v[0:1], v0
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: s_setpc_b64 s[30:31]			; GFX803-NEXT: s_setpc_b64 s[30:31]
	entry:			entry:
	%reg.bc = bitcast i32 %reg to <2 x half>			%reg.bc = bitcast i32 %reg to <2 x half>
	%gep = getelementptr inbounds half, half addrspace(1)* %in, i64 -2047			%gep = getelementptr inbounds half, half addrspace(1)* %in, i64 -2047
	%load = load half, half addrspace(1)* %gep			%load = load half, half addrspace(1)* %gep
	%build1 = insertelement <2 x half> %reg.bc, half %load, i32 0			%build1 = insertelement <2 x half> %reg.bc, half %load, i32 0
	▲ Show 20 Lines • Show All 195 Lines • ▼ Show 20 Lines
	; GFX906-NEXT: global_store_dword v[0:1], v0, off			; GFX906-NEXT: global_store_dword v[0:1], v0, off
	; GFX906-NEXT: s_waitcnt vmcnt(0)			; GFX906-NEXT: s_waitcnt vmcnt(0)
	; GFX906-NEXT: s_setpc_b64 s[30:31]			; GFX906-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX803-LABEL: load_flat_lo_v2i16_reghi_vreg:			; GFX803-LABEL: load_flat_lo_v2i16_reghi_vreg:
	; GFX803: ; %bb.0: ; %entry			; GFX803: ; %bb.0: ; %entry
	; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX803-NEXT: flat_load_ushort v0, v[0:1]			; GFX803-NEXT: flat_load_ushort v0, v[0:1]
	; GFX803-NEXT: v_and_b32_e32 v1, 0xffff0000, v2			; GFX803-NEXT: s_mov_b32 s4, 0x4050203
	; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
	; GFX803-NEXT: v_or_b32_e32 v0, v0, v1			; GFX803-NEXT: v_perm_b32 v0, v0, v2, s4
	; GFX803-NEXT: flat_store_dword v[0:1], v0			; GFX803-NEXT: flat_store_dword v[0:1], v0
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: s_setpc_b64 s[30:31]			; GFX803-NEXT: s_setpc_b64 s[30:31]
	entry:			entry:
	%reg.bc = bitcast i32 %reg to <2 x i16>			%reg.bc = bitcast i32 %reg to <2 x i16>
	%load = load i16, i16* %in			%load = load i16, i16* %in
	%build1 = insertelement <2 x i16> %reg.bc, i16 %load, i32 0			%build1 = insertelement <2 x i16> %reg.bc, i16 %load, i32 0
	store <2 x i16> %build1, <2 x i16> addrspace(1)* undef			store <2 x i16> %build1, <2 x i16> addrspace(1)* undef
	Show All 20 Lines
	; GFX906-NEXT: global_store_dword v[0:1], v0, off			; GFX906-NEXT: global_store_dword v[0:1], v0, off
	; GFX906-NEXT: s_waitcnt vmcnt(0)			; GFX906-NEXT: s_waitcnt vmcnt(0)
	; GFX906-NEXT: s_setpc_b64 s[30:31]			; GFX906-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX803-LABEL: load_flat_lo_v2f16_reghi_vreg:			; GFX803-LABEL: load_flat_lo_v2f16_reghi_vreg:
	; GFX803: ; %bb.0: ; %entry			; GFX803: ; %bb.0: ; %entry
	; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX803-NEXT: flat_load_ushort v0, v[0:1]			; GFX803-NEXT: flat_load_ushort v0, v[0:1]
	; GFX803-NEXT: v_and_b32_e32 v1, 0xffff0000, v2			; GFX803-NEXT: s_mov_b32 s4, 0x4050203
	; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
	; GFX803-NEXT: v_or_b32_e32 v0, v0, v1			; GFX803-NEXT: v_perm_b32 v0, v0, v2, s4
	; GFX803-NEXT: flat_store_dword v[0:1], v0			; GFX803-NEXT: flat_store_dword v[0:1], v0
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: s_setpc_b64 s[30:31]			; GFX803-NEXT: s_setpc_b64 s[30:31]

	; FIXME: the and above should be removable			; FIXME: the and above should be removable
	entry:			entry:
	%reg.bc = bitcast i32 %reg to <2 x half>			%reg.bc = bitcast i32 %reg to <2 x half>
	%load = load half, half* %in			%load = load half, half* %in
	▲ Show 20 Lines • Show All 184 Lines • ▼ Show 20 Lines
	; GFX906-NEXT: global_store_dword v[0:1], v0, off			; GFX906-NEXT: global_store_dword v[0:1], v0, off
	; GFX906-NEXT: s_waitcnt vmcnt(0)			; GFX906-NEXT: s_waitcnt vmcnt(0)
	; GFX906-NEXT: s_setpc_b64 s[30:31]			; GFX906-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX803-LABEL: load_private_lo_v2i16_reglo_vreg:			; GFX803-LABEL: load_private_lo_v2i16_reglo_vreg:
	; GFX803: ; %bb.0: ; %entry			; GFX803: ; %bb.0: ; %entry
	; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX803-NEXT: buffer_load_ushort v1, off, s[0:3], s32 offset:4094			; GFX803-NEXT: buffer_load_ushort v1, off, s[0:3], s32 offset:4094
	; GFX803-NEXT: v_and_b32_e32 v0, 0xffff0000, v0			; GFX803-NEXT: s_mov_b32 s4, 0x4050203
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: v_or_b32_e32 v0, v1, v0			; GFX803-NEXT: v_perm_b32 v0, v1, v0, s4
	; GFX803-NEXT: flat_store_dword v[0:1], v0			; GFX803-NEXT: flat_store_dword v[0:1], v0
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: s_setpc_b64 s[30:31]			; GFX803-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX900-FLATSCR-LABEL: load_private_lo_v2i16_reglo_vreg:			; GFX900-FLATSCR-LABEL: load_private_lo_v2i16_reglo_vreg:
	; GFX900-FLATSCR: ; %bb.0: ; %entry			; GFX900-FLATSCR: ; %bb.0: ; %entry
	; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX900-FLATSCR-NEXT: scratch_load_short_d16 v0, off, s32 offset:4094			; GFX900-FLATSCR-NEXT: scratch_load_short_d16 v0, off, s32 offset:4094
	▲ Show 20 Lines • Show All 83 Lines • ▼ Show 20 Lines
	; GFX906-NEXT: global_store_dword v[0:1], v0, off			; GFX906-NEXT: global_store_dword v[0:1], v0, off
	; GFX906-NEXT: s_waitcnt vmcnt(0)			; GFX906-NEXT: s_waitcnt vmcnt(0)
	; GFX906-NEXT: s_setpc_b64 s[30:31]			; GFX906-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX803-LABEL: load_private_lo_v2f16_reglo_vreg:			; GFX803-LABEL: load_private_lo_v2f16_reglo_vreg:
	; GFX803: ; %bb.0: ; %entry			; GFX803: ; %bb.0: ; %entry
	; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX803-NEXT: buffer_load_ushort v1, off, s[0:3], s32 offset:4094			; GFX803-NEXT: buffer_load_ushort v1, off, s[0:3], s32 offset:4094
	; GFX803-NEXT: v_and_b32_e32 v0, 0xffff0000, v0			; GFX803-NEXT: s_mov_b32 s4, 0x4050203
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: v_or_b32_e32 v0, v1, v0			; GFX803-NEXT: v_perm_b32 v0, v1, v0, s4
	; GFX803-NEXT: flat_store_dword v[0:1], v0			; GFX803-NEXT: flat_store_dword v[0:1], v0
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: s_setpc_b64 s[30:31]			; GFX803-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX900-FLATSCR-LABEL: load_private_lo_v2f16_reglo_vreg:			; GFX900-FLATSCR-LABEL: load_private_lo_v2f16_reglo_vreg:
	; GFX900-FLATSCR: ; %bb.0: ; %entry			; GFX900-FLATSCR: ; %bb.0: ; %entry
	; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX900-FLATSCR-NEXT: scratch_load_short_d16 v0, off, s32 offset:4094			; GFX900-FLATSCR-NEXT: scratch_load_short_d16 v0, off, s32 offset:4094
	Show All 31 Lines
	; GFX906-NEXT: s_waitcnt vmcnt(0)			; GFX906-NEXT: s_waitcnt vmcnt(0)
	; GFX906-NEXT: s_setpc_b64 s[30:31]			; GFX906-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX803-LABEL: load_private_lo_v2i16_reglo_vreg_nooff:			; GFX803-LABEL: load_private_lo_v2i16_reglo_vreg_nooff:
	; GFX803: ; %bb.0: ; %entry			; GFX803: ; %bb.0: ; %entry
	; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX803-NEXT: buffer_load_ushort v0, off, s[0:3], 0 offset:4094 glc			; GFX803-NEXT: buffer_load_ushort v0, off, s[0:3], 0 offset:4094 glc
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: v_and_b32_e32 v1, 0xffff0000, v1			; GFX803-NEXT: s_mov_b32 s4, 0x4050203
	; GFX803-NEXT: v_or_b32_e32 v0, v0, v1			; GFX803-NEXT: v_perm_b32 v0, v0, v1, s4
	; GFX803-NEXT: flat_store_dword v[0:1], v0			; GFX803-NEXT: flat_store_dword v[0:1], v0
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: s_setpc_b64 s[30:31]			; GFX803-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX900-FLATSCR-LABEL: load_private_lo_v2i16_reglo_vreg_nooff:			; GFX900-FLATSCR-LABEL: load_private_lo_v2i16_reglo_vreg_nooff:
	; GFX900-FLATSCR: ; %bb.0: ; %entry			; GFX900-FLATSCR: ; %bb.0: ; %entry
	; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX900-FLATSCR-NEXT: s_movk_i32 s0, 0xffe			; GFX900-FLATSCR-NEXT: s_movk_i32 s0, 0xffe
	Show All 31 Lines
	; GFX906-NEXT: s_waitcnt vmcnt(0)			; GFX906-NEXT: s_waitcnt vmcnt(0)
	; GFX906-NEXT: s_setpc_b64 s[30:31]			; GFX906-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX803-LABEL: load_private_lo_v2i16_reghi_vreg_nooff:			; GFX803-LABEL: load_private_lo_v2i16_reghi_vreg_nooff:
	; GFX803: ; %bb.0: ; %entry			; GFX803: ; %bb.0: ; %entry
	; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX803-NEXT: buffer_load_ushort v0, off, s[0:3], 0 offset:4094 glc			; GFX803-NEXT: buffer_load_ushort v0, off, s[0:3], 0 offset:4094 glc
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: v_and_b32_e32 v1, 0xffff0000, v1			; GFX803-NEXT: s_mov_b32 s4, 0x4050203
	; GFX803-NEXT: v_or_b32_e32 v0, v0, v1			; GFX803-NEXT: v_perm_b32 v0, v0, v1, s4
	; GFX803-NEXT: flat_store_dword v[0:1], v0			; GFX803-NEXT: flat_store_dword v[0:1], v0
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: s_setpc_b64 s[30:31]			; GFX803-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX900-FLATSCR-LABEL: load_private_lo_v2i16_reghi_vreg_nooff:			; GFX900-FLATSCR-LABEL: load_private_lo_v2i16_reghi_vreg_nooff:
	; GFX900-FLATSCR: ; %bb.0: ; %entry			; GFX900-FLATSCR: ; %bb.0: ; %entry
	; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX900-FLATSCR-NEXT: s_movk_i32 s0, 0xffe			; GFX900-FLATSCR-NEXT: s_movk_i32 s0, 0xffe
	Show All 31 Lines
	; GFX906-NEXT: s_waitcnt vmcnt(0)			; GFX906-NEXT: s_waitcnt vmcnt(0)
	; GFX906-NEXT: s_setpc_b64 s[30:31]			; GFX906-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX803-LABEL: load_private_lo_v2f16_reglo_vreg_nooff:			; GFX803-LABEL: load_private_lo_v2f16_reglo_vreg_nooff:
	; GFX803: ; %bb.0: ; %entry			; GFX803: ; %bb.0: ; %entry
	; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX803-NEXT: buffer_load_ushort v0, off, s[0:3], 0 offset:4094 glc			; GFX803-NEXT: buffer_load_ushort v0, off, s[0:3], 0 offset:4094 glc
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: v_and_b32_e32 v1, 0xffff0000, v1			; GFX803-NEXT: s_mov_b32 s4, 0x4050203
	; GFX803-NEXT: v_or_b32_e32 v0, v0, v1			; GFX803-NEXT: v_perm_b32 v0, v0, v1, s4
	; GFX803-NEXT: flat_store_dword v[0:1], v0			; GFX803-NEXT: flat_store_dword v[0:1], v0
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: s_setpc_b64 s[30:31]			; GFX803-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX900-FLATSCR-LABEL: load_private_lo_v2f16_reglo_vreg_nooff:			; GFX900-FLATSCR-LABEL: load_private_lo_v2f16_reglo_vreg_nooff:
	; GFX900-FLATSCR: ; %bb.0: ; %entry			; GFX900-FLATSCR: ; %bb.0: ; %entry
	; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX900-FLATSCR-NEXT: s_movk_i32 s0, 0xffe			; GFX900-FLATSCR-NEXT: s_movk_i32 s0, 0xffe
	▲ Show 20 Lines • Show All 283 Lines • ▼ Show 20 Lines
	; GFX906-NEXT: s_setpc_b64 s[30:31]			; GFX906-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX803-LABEL: load_constant_lo_v2i16_reglo_vreg:			; GFX803-LABEL: load_constant_lo_v2i16_reglo_vreg:
	; GFX803: ; %bb.0: ; %entry			; GFX803: ; %bb.0: ; %entry
	; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX803-NEXT: v_add_u32_e32 v0, vcc, 0xfffff002, v0			; GFX803-NEXT: v_add_u32_e32 v0, vcc, 0xfffff002, v0
	; GFX803-NEXT: v_addc_u32_e32 v1, vcc, -1, v1, vcc			; GFX803-NEXT: v_addc_u32_e32 v1, vcc, -1, v1, vcc
	; GFX803-NEXT: flat_load_ushort v0, v[0:1]			; GFX803-NEXT: flat_load_ushort v0, v[0:1]
	; GFX803-NEXT: v_and_b32_e32 v1, 0xffff0000, v2			; GFX803-NEXT: s_mov_b32 s4, 0x4050203
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: v_or_b32_e32 v0, v0, v1			; GFX803-NEXT: v_perm_b32 v0, v0, v2, s4
	; GFX803-NEXT: flat_store_dword v[0:1], v0			; GFX803-NEXT: flat_store_dword v[0:1], v0
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: s_setpc_b64 s[30:31]			; GFX803-NEXT: s_setpc_b64 s[30:31]
	entry:			entry:
	%reg.bc = bitcast i32 %reg to <2 x i16>			%reg.bc = bitcast i32 %reg to <2 x i16>
	%gep = getelementptr inbounds i16, i16 addrspace(4)* %in, i64 -2047			%gep = getelementptr inbounds i16, i16 addrspace(4)* %in, i64 -2047
	%load = load i16, i16 addrspace(4)* %gep			%load = load i16, i16 addrspace(4)* %gep
	%build1 = insertelement <2 x i16> %reg.bc, i16 %load, i32 0			%build1 = insertelement <2 x i16> %reg.bc, i16 %load, i32 0
	Show All 23 Lines
	; GFX906-NEXT: s_setpc_b64 s[30:31]			; GFX906-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX803-LABEL: load_constant_lo_v2f16_reglo_vreg:			; GFX803-LABEL: load_constant_lo_v2f16_reglo_vreg:
	; GFX803: ; %bb.0: ; %entry			; GFX803: ; %bb.0: ; %entry
	; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX803-NEXT: v_add_u32_e32 v0, vcc, 0xfffff002, v0			; GFX803-NEXT: v_add_u32_e32 v0, vcc, 0xfffff002, v0
	; GFX803-NEXT: v_addc_u32_e32 v1, vcc, -1, v1, vcc			; GFX803-NEXT: v_addc_u32_e32 v1, vcc, -1, v1, vcc
	; GFX803-NEXT: flat_load_ushort v0, v[0:1]			; GFX803-NEXT: flat_load_ushort v0, v[0:1]
	; GFX803-NEXT: v_and_b32_e32 v1, 0xffff0000, v2			; GFX803-NEXT: s_mov_b32 s4, 0x4050203
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: v_or_b32_e32 v0, v0, v1			; GFX803-NEXT: v_perm_b32 v0, v0, v2, s4
	; GFX803-NEXT: flat_store_dword v[0:1], v0			; GFX803-NEXT: flat_store_dword v[0:1], v0
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: s_setpc_b64 s[30:31]			; GFX803-NEXT: s_setpc_b64 s[30:31]
	entry:			entry:
	%reg.bc = bitcast i32 %reg to <2 x half>			%reg.bc = bitcast i32 %reg to <2 x half>
	%gep = getelementptr inbounds half, half addrspace(4)* %in, i64 -2047			%gep = getelementptr inbounds half, half addrspace(4)* %in, i64 -2047
	%load = load half, half addrspace(4)* %gep			%load = load half, half addrspace(4)* %gep
	%build1 = insertelement <2 x half> %reg.bc, half %load, i32 0			%build1 = insertelement <2 x half> %reg.bc, half %load, i32 0
	▲ Show 20 Lines • Show All 122 Lines • ▼ Show 20 Lines
	; GFX803: ; %bb.0: ; %entry			; GFX803: ; %bb.0: ; %entry
	; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX803-NEXT: v_mov_b32_e32 v1, 0x7b			; GFX803-NEXT: v_mov_b32_e32 v1, 0x7b
	; GFX803-NEXT: buffer_store_dword v1, off, s[0:3], s32 offset:4			; GFX803-NEXT: buffer_store_dword v1, off, s[0:3], s32 offset:4
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: v_mov_b32_e32 v2, 44			; GFX803-NEXT: v_mov_b32_e32 v2, 44
	; GFX803-NEXT: buffer_load_ushort v1, v2, s[0:3], s32 offen offset:4054 glc			; GFX803-NEXT: buffer_load_ushort v1, v2, s[0:3], s32 offen offset:4054 glc
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: v_and_b32_e32 v0, 0xffff0000, v0			; GFX803-NEXT: s_mov_b32 s4, 0x4050203
	; GFX803-NEXT: v_or_b32_e32 v0, v1, v0			; GFX803-NEXT: v_perm_b32 v0, v1, v0, s4
	; GFX803-NEXT: flat_store_dword v[0:1], v0			; GFX803-NEXT: flat_store_dword v[0:1], v0
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: s_setpc_b64 s[30:31]			; GFX803-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX900-FLATSCR-LABEL: load_private_lo_v2i16_reglo_vreg_to_offset:			; GFX900-FLATSCR-LABEL: load_private_lo_v2i16_reglo_vreg_to_offset:
	; GFX900-FLATSCR: ; %bb.0: ; %entry			; GFX900-FLATSCR: ; %bb.0: ; %entry
	; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX900-FLATSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX900-FLATSCR-NEXT: v_mov_b32_e32 v1, 0x7b			; GFX900-FLATSCR-NEXT: v_mov_b32_e32 v1, 0x7b
	▲ Show 20 Lines • Show All 304 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/load-local.128.ll

	Show First 20 Lines • Show All 65 Lines • ▼ Show 20 Lines
	; GFX9-NEXT: ds_read_u8 v9, v0 offset:8			; GFX9-NEXT: ds_read_u8 v9, v0 offset:8
	; GFX9-NEXT: ds_read_u8 v10, v0 offset:9			; GFX9-NEXT: ds_read_u8 v10, v0 offset:9
	; GFX9-NEXT: ds_read_u8 v11, v0 offset:10			; GFX9-NEXT: ds_read_u8 v11, v0 offset:10
	; GFX9-NEXT: ds_read_u8 v12, v0 offset:11			; GFX9-NEXT: ds_read_u8 v12, v0 offset:11
	; GFX9-NEXT: ds_read_u8 v13, v0 offset:12			; GFX9-NEXT: ds_read_u8 v13, v0 offset:12
	; GFX9-NEXT: ds_read_u8 v14, v0 offset:13			; GFX9-NEXT: ds_read_u8 v14, v0 offset:13
	; GFX9-NEXT: ds_read_u8 v15, v0 offset:14			; GFX9-NEXT: ds_read_u8 v15, v0 offset:14
	; GFX9-NEXT: ds_read_u8 v16, v0 offset:15			; GFX9-NEXT: ds_read_u8 v16, v0 offset:15
	; GFX9-NEXT: s_waitcnt lgkmcnt(14)
	; GFX9-NEXT: v_lshl_or_b32 v0, v2, 8, v1
	; GFX9-NEXT: s_waitcnt lgkmcnt(12)			; GFX9-NEXT: s_waitcnt lgkmcnt(12)
	; GFX9-NEXT: v_lshl_or_b32 v1, v4, 8, v3			; GFX9-NEXT: v_lshl_or_b32 v0, v4, 8, v3
	; GFX9-NEXT: v_lshl_or_b32 v0, v1, 16, v0			; GFX9-NEXT: v_lshl_or_b32 v1, v2, 8, v1
	; GFX9-NEXT: s_waitcnt lgkmcnt(10)			; GFX9-NEXT: s_mov_b32 s4, 0x4050001
	; GFX9-NEXT: v_lshl_or_b32 v1, v6, 8, v5			; GFX9-NEXT: v_perm_b32 v0, v1, v0, s4
	; GFX9-NEXT: s_waitcnt lgkmcnt(8)			; GFX9-NEXT: s_waitcnt lgkmcnt(8)
	; GFX9-NEXT: v_lshl_or_b32 v2, v8, 8, v7			; GFX9-NEXT: v_lshl_or_b32 v1, v8, 8, v7
	; GFX9-NEXT: v_lshl_or_b32 v1, v2, 16, v1			; GFX9-NEXT: v_lshl_or_b32 v2, v6, 8, v5
	; GFX9-NEXT: s_waitcnt lgkmcnt(6)			; GFX9-NEXT: v_perm_b32 v1, v2, v1, s4
	; GFX9-NEXT: v_lshl_or_b32 v2, v10, 8, v9
	; GFX9-NEXT: s_waitcnt lgkmcnt(4)			; GFX9-NEXT: s_waitcnt lgkmcnt(4)
	; GFX9-NEXT: v_lshl_or_b32 v3, v12, 8, v11			; GFX9-NEXT: v_lshl_or_b32 v2, v12, 8, v11
	; GFX9-NEXT: v_lshl_or_b32 v2, v3, 16, v2			; GFX9-NEXT: v_lshl_or_b32 v3, v10, 8, v9
	; GFX9-NEXT: s_waitcnt lgkmcnt(2)			; GFX9-NEXT: v_perm_b32 v2, v3, v2, s4
	; GFX9-NEXT: v_lshl_or_b32 v3, v14, 8, v13
	; GFX9-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-NEXT: v_lshl_or_b32 v4, v16, 8, v15			; GFX9-NEXT: v_lshl_or_b32 v3, v16, 8, v15
	; GFX9-NEXT: v_lshl_or_b32 v3, v4, 16, v3			; GFX9-NEXT: v_lshl_or_b32 v4, v14, 8, v13
				; GFX9-NEXT: v_perm_b32 v3, v4, v3, s4
	; GFX9-NEXT: s_setpc_b64 s[30:31]			; GFX9-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX7-LABEL: load_lds_v4i32_align1:			; GFX7-LABEL: load_lds_v4i32_align1:
	; GFX7: ; %bb.0:			; GFX7: ; %bb.0:
	; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX7-NEXT: s_mov_b32 m0, -1			; GFX7-NEXT: s_mov_b32 m0, -1
	; GFX7-NEXT: ds_read_u8 v1, v0 offset:6			; GFX7-NEXT: ds_read_u8 v1, v0 offset:6
	; GFX7-NEXT: ds_read_u8 v2, v0 offset:4			; GFX7-NEXT: ds_read_u8 v2, v0 offset:4
	▲ Show 20 Lines • Show All 130 Lines • ▼ Show 20 Lines
	; GFX10-NEXT: ds_read_u8 v5, v0 offset:4			; GFX10-NEXT: ds_read_u8 v5, v0 offset:4
	; GFX10-NEXT: ds_read_u8 v6, v0 offset:5			; GFX10-NEXT: ds_read_u8 v6, v0 offset:5
	; GFX10-NEXT: ds_read_u8 v7, v0 offset:6			; GFX10-NEXT: ds_read_u8 v7, v0 offset:6
	; GFX10-NEXT: ds_read_u8 v8, v0 offset:7			; GFX10-NEXT: ds_read_u8 v8, v0 offset:7
	; GFX10-NEXT: ds_read_u8 v9, v0 offset:8			; GFX10-NEXT: ds_read_u8 v9, v0 offset:8
	; GFX10-NEXT: ds_read_u8 v10, v0 offset:9			; GFX10-NEXT: ds_read_u8 v10, v0 offset:9
	; GFX10-NEXT: ds_read_u8 v11, v0 offset:10			; GFX10-NEXT: ds_read_u8 v11, v0 offset:10
	; GFX10-NEXT: ds_read_u8 v12, v0 offset:11			; GFX10-NEXT: ds_read_u8 v12, v0 offset:11
	; GFX10-NEXT: ds_read_u8 v13, v0 offset:12			; GFX10-NEXT: ds_read_u8 v13, v0 offset:14
	; GFX10-NEXT: ds_read_u8 v14, v0 offset:13			; GFX10-NEXT: ds_read_u8 v14, v0 offset:15
	; GFX10-NEXT: ds_read_u8 v15, v0 offset:14			; GFX10-NEXT: ds_read_u8 v15, v0 offset:12
	; GFX10-NEXT: ds_read_u8 v0, v0 offset:15			; GFX10-NEXT: ds_read_u8 v0, v0 offset:13
	; GFX10-NEXT: s_waitcnt lgkmcnt(14)			; GFX10-NEXT: s_waitcnt lgkmcnt(14)
	; GFX10-NEXT: v_lshl_or_b32 v1, v2, 8, v1			; GFX10-NEXT: v_lshl_or_b32 v1, v2, 8, v1
	; GFX10-NEXT: s_waitcnt lgkmcnt(12)			; GFX10-NEXT: s_waitcnt lgkmcnt(12)
	; GFX10-NEXT: v_lshl_or_b32 v2, v4, 8, v3			; GFX10-NEXT: v_lshl_or_b32 v3, v4, 8, v3
	; GFX10-NEXT: s_waitcnt lgkmcnt(10)			; GFX10-NEXT: s_waitcnt lgkmcnt(10)
	; GFX10-NEXT: v_lshl_or_b32 v3, v6, 8, v5			; GFX10-NEXT: v_lshl_or_b32 v4, v6, 8, v5
	; GFX10-NEXT: s_waitcnt lgkmcnt(8)			; GFX10-NEXT: s_waitcnt lgkmcnt(8)
	; GFX10-NEXT: v_lshl_or_b32 v4, v8, 8, v7			; GFX10-NEXT: v_lshl_or_b32 v2, v8, 8, v7
	; GFX10-NEXT: s_waitcnt lgkmcnt(6)			; GFX10-NEXT: s_waitcnt lgkmcnt(6)
	; GFX10-NEXT: v_lshl_or_b32 v5, v10, 8, v9			; GFX10-NEXT: v_lshl_or_b32 v6, v10, 8, v9
	; GFX10-NEXT: s_waitcnt lgkmcnt(4)			; GFX10-NEXT: s_waitcnt lgkmcnt(4)
	; GFX10-NEXT: v_lshl_or_b32 v6, v12, 8, v11			; GFX10-NEXT: v_lshl_or_b32 v5, v12, 8, v11
	; GFX10-NEXT: s_waitcnt lgkmcnt(2)			; GFX10-NEXT: s_waitcnt lgkmcnt(2)
	; GFX10-NEXT: v_lshl_or_b32 v7, v14, 8, v13			; GFX10-NEXT: v_lshl_or_b32 v7, v14, 8, v13
	; GFX10-NEXT: s_waitcnt lgkmcnt(0)			; GFX10-NEXT: s_waitcnt lgkmcnt(0)
	; GFX10-NEXT: v_lshl_or_b32 v8, v0, 8, v15			; GFX10-NEXT: v_lshl_or_b32 v8, v0, 8, v15
	; GFX10-NEXT: v_lshl_or_b32 v0, v2, 16, v1			; GFX10-NEXT: v_perm_b32 v0, v1, v3, 0x4050001
	; GFX10-NEXT: v_lshl_or_b32 v1, v4, 16, v3			; GFX10-NEXT: v_perm_b32 v1, v4, v2, 0x4050001
	; GFX10-NEXT: v_lshl_or_b32 v2, v6, 16, v5			; GFX10-NEXT: v_perm_b32 v2, v6, v5, 0x4050001
	; GFX10-NEXT: v_lshl_or_b32 v3, v8, 16, v7			; GFX10-NEXT: v_perm_b32 v3, v8, v7, 0x4050001
	; GFX10-NEXT: s_setpc_b64 s[30:31]			; GFX10-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX11-LABEL: load_lds_v4i32_align1:			; GFX11-LABEL: load_lds_v4i32_align1:
	; GFX11: ; %bb.0:			; GFX11: ; %bb.0:
	; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX11-NEXT: s_waitcnt_vscnt null, 0x0			; GFX11-NEXT: s_waitcnt_vscnt null, 0x0
	; GFX11-NEXT: ds_load_u8 v1, v0			; GFX11-NEXT: ds_load_u8 v1, v0
	; GFX11-NEXT: ds_load_u8 v2, v0 offset:1			; GFX11-NEXT: ds_load_u8 v2, v0 offset:1
	; GFX11-NEXT: ds_load_u8 v3, v0 offset:2			; GFX11-NEXT: ds_load_u8 v3, v0 offset:2
	; GFX11-NEXT: ds_load_u8 v4, v0 offset:3			; GFX11-NEXT: ds_load_u8 v4, v0 offset:3
	; GFX11-NEXT: ds_load_u8 v5, v0 offset:4			; GFX11-NEXT: ds_load_u8 v5, v0 offset:4
	; GFX11-NEXT: ds_load_u8 v6, v0 offset:5			; GFX11-NEXT: ds_load_u8 v6, v0 offset:5
	; GFX11-NEXT: ds_load_u8 v7, v0 offset:6			; GFX11-NEXT: ds_load_u8 v7, v0 offset:6
	; GFX11-NEXT: ds_load_u8 v8, v0 offset:7			; GFX11-NEXT: ds_load_u8 v8, v0 offset:7
	; GFX11-NEXT: ds_load_u8 v9, v0 offset:8			; GFX11-NEXT: ds_load_u8 v9, v0 offset:8
	; GFX11-NEXT: ds_load_u8 v10, v0 offset:9			; GFX11-NEXT: ds_load_u8 v10, v0 offset:9
	; GFX11-NEXT: ds_load_u8 v11, v0 offset:10			; GFX11-NEXT: ds_load_u8 v11, v0 offset:10
	; GFX11-NEXT: ds_load_u8 v12, v0 offset:11			; GFX11-NEXT: ds_load_u8 v12, v0 offset:11
	; GFX11-NEXT: ds_load_u8 v13, v0 offset:12			; GFX11-NEXT: ds_load_u8 v13, v0 offset:14
	; GFX11-NEXT: ds_load_u8 v14, v0 offset:13			; GFX11-NEXT: ds_load_u8 v14, v0 offset:15
	; GFX11-NEXT: ds_load_u8 v15, v0 offset:14			; GFX11-NEXT: ds_load_u8 v15, v0 offset:12
	; GFX11-NEXT: ds_load_u8 v0, v0 offset:15			; GFX11-NEXT: ds_load_u8 v0, v0 offset:13
	; GFX11-NEXT: s_waitcnt lgkmcnt(14)			; GFX11-NEXT: s_waitcnt lgkmcnt(14)
	; GFX11-NEXT: v_lshl_or_b32 v1, v2, 8, v1			; GFX11-NEXT: v_lshl_or_b32 v1, v2, 8, v1
	; GFX11-NEXT: s_waitcnt lgkmcnt(12)			; GFX11-NEXT: s_waitcnt lgkmcnt(12)
	; GFX11-NEXT: v_lshl_or_b32 v2, v4, 8, v3			; GFX11-NEXT: v_lshl_or_b32 v3, v4, 8, v3
	; GFX11-NEXT: s_waitcnt lgkmcnt(10)			; GFX11-NEXT: s_waitcnt lgkmcnt(10)
	; GFX11-NEXT: v_lshl_or_b32 v3, v6, 8, v5			; GFX11-NEXT: v_lshl_or_b32 v4, v6, 8, v5
	; GFX11-NEXT: s_waitcnt lgkmcnt(8)			; GFX11-NEXT: s_waitcnt lgkmcnt(8)
	; GFX11-NEXT: v_lshl_or_b32 v4, v8, 8, v7			; GFX11-NEXT: v_lshl_or_b32 v2, v8, 8, v7
	; GFX11-NEXT: s_waitcnt lgkmcnt(6)			; GFX11-NEXT: s_waitcnt lgkmcnt(6)
	; GFX11-NEXT: v_lshl_or_b32 v5, v10, 8, v9			; GFX11-NEXT: v_lshl_or_b32 v6, v10, 8, v9
	; GFX11-NEXT: s_waitcnt lgkmcnt(4)			; GFX11-NEXT: s_waitcnt lgkmcnt(4)
	; GFX11-NEXT: v_lshl_or_b32 v6, v12, 8, v11			; GFX11-NEXT: v_lshl_or_b32 v5, v12, 8, v11
	; GFX11-NEXT: s_waitcnt lgkmcnt(2)			; GFX11-NEXT: s_waitcnt lgkmcnt(2)
	; GFX11-NEXT: v_lshl_or_b32 v7, v14, 8, v13			; GFX11-NEXT: v_lshl_or_b32 v7, v14, 8, v13
	; GFX11-NEXT: s_waitcnt lgkmcnt(0)			; GFX11-NEXT: s_waitcnt lgkmcnt(0)
	; GFX11-NEXT: v_lshl_or_b32 v8, v0, 8, v15			; GFX11-NEXT: v_lshl_or_b32 v8, v0, 8, v15
	; GFX11-NEXT: v_lshl_or_b32 v0, v2, 16, v1			; GFX11-NEXT: v_perm_b32 v0, v1, v3, 0x4050001
	; GFX11-NEXT: v_lshl_or_b32 v1, v4, 16, v3			; GFX11-NEXT: v_perm_b32 v1, v4, v2, 0x4050001
	; GFX11-NEXT: v_lshl_or_b32 v2, v6, 16, v5			; GFX11-NEXT: v_perm_b32 v2, v6, v5, 0x4050001
	; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_4)			; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_4)
	; GFX11-NEXT: v_lshl_or_b32 v3, v8, 16, v7			; GFX11-NEXT: v_perm_b32 v3, v8, v7, 0x4050001
	; GFX11-NEXT: s_setpc_b64 s[30:31]			; GFX11-NEXT: s_setpc_b64 s[30:31]
	%load = load <4 x i32>, <4 x i32> addrspace(3)* %ptr, align 1			%load = load <4 x i32>, <4 x i32> addrspace(3)* %ptr, align 1
	ret <4 x i32> %load			ret <4 x i32> %load
	}			}

	define <4 x i32> @load_lds_v4i32_align2(<4 x i32> addrspace(3)* %ptr) {			define <4 x i32> @load_lds_v4i32_align2(<4 x i32> addrspace(3)* %ptr) {
	; GFX9-LABEL: load_lds_v4i32_align2:			; GFX9-LABEL: load_lds_v4i32_align2:
	; GFX9: ; %bb.0:			; GFX9: ; %bb.0:
	; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX9-NEXT: ds_read_u16 v1, v0			; GFX9-NEXT: ds_read_u16 v1, v0
	; GFX9-NEXT: ds_read_u16 v2, v0 offset:2			; GFX9-NEXT: ds_read_u16 v2, v0 offset:2
	; GFX9-NEXT: ds_read_u16 v3, v0 offset:4			; GFX9-NEXT: ds_read_u16 v3, v0 offset:4
	; GFX9-NEXT: ds_read_u16 v4, v0 offset:6			; GFX9-NEXT: ds_read_u16 v4, v0 offset:6
	; GFX9-NEXT: ds_read_u16 v5, v0 offset:8			; GFX9-NEXT: ds_read_u16 v5, v0 offset:8
	; GFX9-NEXT: ds_read_u16 v6, v0 offset:10			; GFX9-NEXT: ds_read_u16 v6, v0 offset:10
	; GFX9-NEXT: ds_read_u16 v7, v0 offset:12			; GFX9-NEXT: ds_read_u16 v7, v0 offset:12
	; GFX9-NEXT: ds_read_u16 v8, v0 offset:14			; GFX9-NEXT: ds_read_u16 v8, v0 offset:14
				; GFX9-NEXT: s_mov_b32 s4, 0x4050001
	; GFX9-NEXT: s_waitcnt lgkmcnt(6)			; GFX9-NEXT: s_waitcnt lgkmcnt(6)
	; GFX9-NEXT: v_lshl_or_b32 v0, v2, 16, v1			; GFX9-NEXT: v_perm_b32 v0, v1, v2, s4
	; GFX9-NEXT: s_waitcnt lgkmcnt(4)			; GFX9-NEXT: s_waitcnt lgkmcnt(4)
	; GFX9-NEXT: v_lshl_or_b32 v1, v4, 16, v3			; GFX9-NEXT: v_perm_b32 v1, v3, v4, s4
	; GFX9-NEXT: s_waitcnt lgkmcnt(2)			; GFX9-NEXT: s_waitcnt lgkmcnt(2)
	; GFX9-NEXT: v_lshl_or_b32 v2, v6, 16, v5			; GFX9-NEXT: v_perm_b32 v2, v5, v6, s4
	; GFX9-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-NEXT: v_lshl_or_b32 v3, v8, 16, v7			; GFX9-NEXT: v_perm_b32 v3, v7, v8, s4
	; GFX9-NEXT: s_setpc_b64 s[30:31]			; GFX9-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX7-LABEL: load_lds_v4i32_align2:			; GFX7-LABEL: load_lds_v4i32_align2:
	; GFX7: ; %bb.0:			; GFX7: ; %bb.0:
	; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX7-NEXT: s_mov_b32 m0, -1			; GFX7-NEXT: s_mov_b32 m0, -1
	; GFX7-NEXT: ds_read_u16 v3, v0 offset:12			; GFX7-NEXT: ds_read_u16 v3, v0 offset:12
	; GFX7-NEXT: ds_read_u16 v2, v0 offset:8			; GFX7-NEXT: ds_read_u16 v2, v0 offset:8
	▲ Show 20 Lines • Show All 62 Lines • ▼ Show 20 Lines
	; GFX10-NEXT: ds_read_u16 v2, v0 offset:2			; GFX10-NEXT: ds_read_u16 v2, v0 offset:2
	; GFX10-NEXT: ds_read_u16 v3, v0 offset:4			; GFX10-NEXT: ds_read_u16 v3, v0 offset:4
	; GFX10-NEXT: ds_read_u16 v4, v0 offset:6			; GFX10-NEXT: ds_read_u16 v4, v0 offset:6
	; GFX10-NEXT: ds_read_u16 v5, v0 offset:8			; GFX10-NEXT: ds_read_u16 v5, v0 offset:8
	; GFX10-NEXT: ds_read_u16 v6, v0 offset:10			; GFX10-NEXT: ds_read_u16 v6, v0 offset:10
	; GFX10-NEXT: ds_read_u16 v7, v0 offset:12			; GFX10-NEXT: ds_read_u16 v7, v0 offset:12
	; GFX10-NEXT: ds_read_u16 v8, v0 offset:14			; GFX10-NEXT: ds_read_u16 v8, v0 offset:14
	; GFX10-NEXT: s_waitcnt lgkmcnt(6)			; GFX10-NEXT: s_waitcnt lgkmcnt(6)
	; GFX10-NEXT: v_lshl_or_b32 v0, v2, 16, v1			; GFX10-NEXT: v_perm_b32 v0, v1, v2, 0x4050001
	; GFX10-NEXT: s_waitcnt lgkmcnt(4)			; GFX10-NEXT: s_waitcnt lgkmcnt(4)
	; GFX10-NEXT: v_lshl_or_b32 v1, v4, 16, v3			; GFX10-NEXT: v_perm_b32 v1, v3, v4, 0x4050001
	; GFX10-NEXT: s_waitcnt lgkmcnt(2)			; GFX10-NEXT: s_waitcnt lgkmcnt(2)
	; GFX10-NEXT: v_lshl_or_b32 v2, v6, 16, v5			; GFX10-NEXT: v_perm_b32 v2, v5, v6, 0x4050001
	; GFX10-NEXT: s_waitcnt lgkmcnt(0)			; GFX10-NEXT: s_waitcnt lgkmcnt(0)
	; GFX10-NEXT: v_lshl_or_b32 v3, v8, 16, v7			; GFX10-NEXT: v_perm_b32 v3, v7, v8, 0x4050001
	; GFX10-NEXT: s_setpc_b64 s[30:31]			; GFX10-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX11-LABEL: load_lds_v4i32_align2:			; GFX11-LABEL: load_lds_v4i32_align2:
	; GFX11: ; %bb.0:			; GFX11: ; %bb.0:
	; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX11-NEXT: s_waitcnt_vscnt null, 0x0			; GFX11-NEXT: s_waitcnt_vscnt null, 0x0
	; GFX11-NEXT: ds_load_u16 v1, v0			; GFX11-NEXT: ds_load_u16 v1, v0
	; GFX11-NEXT: ds_load_u16 v2, v0 offset:2			; GFX11-NEXT: ds_load_u16 v2, v0 offset:2
	; GFX11-NEXT: ds_load_u16 v3, v0 offset:4			; GFX11-NEXT: ds_load_u16 v3, v0 offset:4
	; GFX11-NEXT: ds_load_u16 v4, v0 offset:6			; GFX11-NEXT: ds_load_u16 v4, v0 offset:6
	; GFX11-NEXT: ds_load_u16 v5, v0 offset:8			; GFX11-NEXT: ds_load_u16 v5, v0 offset:8
	; GFX11-NEXT: ds_load_u16 v6, v0 offset:10			; GFX11-NEXT: ds_load_u16 v6, v0 offset:10
	; GFX11-NEXT: ds_load_u16 v7, v0 offset:12			; GFX11-NEXT: ds_load_u16 v7, v0 offset:12
	; GFX11-NEXT: ds_load_u16 v8, v0 offset:14			; GFX11-NEXT: ds_load_u16 v8, v0 offset:14
	; GFX11-NEXT: s_waitcnt lgkmcnt(6)			; GFX11-NEXT: s_waitcnt lgkmcnt(6)
	; GFX11-NEXT: v_lshl_or_b32 v0, v2, 16, v1			; GFX11-NEXT: v_perm_b32 v0, v1, v2, 0x4050001
	; GFX11-NEXT: s_waitcnt lgkmcnt(4)			; GFX11-NEXT: s_waitcnt lgkmcnt(4)
	; GFX11-NEXT: v_lshl_or_b32 v1, v4, 16, v3			; GFX11-NEXT: v_perm_b32 v1, v3, v4, 0x4050001
	; GFX11-NEXT: s_waitcnt lgkmcnt(2)			; GFX11-NEXT: s_waitcnt lgkmcnt(2)
	; GFX11-NEXT: v_lshl_or_b32 v2, v6, 16, v5			; GFX11-NEXT: v_perm_b32 v2, v5, v6, 0x4050001
	; GFX11-NEXT: s_waitcnt lgkmcnt(0)			; GFX11-NEXT: s_waitcnt lgkmcnt(0)
	; GFX11-NEXT: v_lshl_or_b32 v3, v8, 16, v7			; GFX11-NEXT: v_perm_b32 v3, v7, v8, 0x4050001
	; GFX11-NEXT: s_setpc_b64 s[30:31]			; GFX11-NEXT: s_setpc_b64 s[30:31]
	%load = load <4 x i32>, <4 x i32> addrspace(3)* %ptr, align 2			%load = load <4 x i32>, <4 x i32> addrspace(3)* %ptr, align 2
	ret <4 x i32> %load			ret <4 x i32> %load
	}			}

	define <4 x i32> @load_lds_v4i32_align4(<4 x i32> addrspace(3)* %ptr) {			define <4 x i32> @load_lds_v4i32_align4(<4 x i32> addrspace(3)* %ptr) {
	; GFX9-LABEL: load_lds_v4i32_align4:			; GFX9-LABEL: load_lds_v4i32_align4:
	; GFX9: ; %bb.0:			; GFX9: ; %bb.0:
	▲ Show 20 Lines • Show All 145 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/load-local.96.ll

	Show First 20 Lines • Show All 61 Lines • ▼ Show 20 Lines
	; GFX9-NEXT: ds_read_u8 v5, v0 offset:4			; GFX9-NEXT: ds_read_u8 v5, v0 offset:4
	; GFX9-NEXT: ds_read_u8 v6, v0 offset:5			; GFX9-NEXT: ds_read_u8 v6, v0 offset:5
	; GFX9-NEXT: ds_read_u8 v7, v0 offset:6			; GFX9-NEXT: ds_read_u8 v7, v0 offset:6
	; GFX9-NEXT: ds_read_u8 v8, v0 offset:7			; GFX9-NEXT: ds_read_u8 v8, v0 offset:7
	; GFX9-NEXT: ds_read_u8 v9, v0 offset:8			; GFX9-NEXT: ds_read_u8 v9, v0 offset:8
	; GFX9-NEXT: ds_read_u8 v10, v0 offset:9			; GFX9-NEXT: ds_read_u8 v10, v0 offset:9
	; GFX9-NEXT: ds_read_u8 v11, v0 offset:10			; GFX9-NEXT: ds_read_u8 v11, v0 offset:10
	; GFX9-NEXT: ds_read_u8 v12, v0 offset:11			; GFX9-NEXT: ds_read_u8 v12, v0 offset:11
	; GFX9-NEXT: s_waitcnt lgkmcnt(10)
	; GFX9-NEXT: v_lshl_or_b32 v0, v2, 8, v1
	; GFX9-NEXT: s_waitcnt lgkmcnt(8)			; GFX9-NEXT: s_waitcnt lgkmcnt(8)
	; GFX9-NEXT: v_lshl_or_b32 v1, v4, 8, v3			; GFX9-NEXT: v_lshl_or_b32 v0, v4, 8, v3
	; GFX9-NEXT: v_lshl_or_b32 v0, v1, 16, v0			; GFX9-NEXT: v_lshl_or_b32 v1, v2, 8, v1
	; GFX9-NEXT: s_waitcnt lgkmcnt(6)			; GFX9-NEXT: s_mov_b32 s4, 0x4050001
	; GFX9-NEXT: v_lshl_or_b32 v1, v6, 8, v5			; GFX9-NEXT: v_perm_b32 v0, v1, v0, s4
	; GFX9-NEXT: s_waitcnt lgkmcnt(4)			; GFX9-NEXT: s_waitcnt lgkmcnt(4)
	; GFX9-NEXT: v_lshl_or_b32 v2, v8, 8, v7			; GFX9-NEXT: v_lshl_or_b32 v1, v8, 8, v7
	; GFX9-NEXT: v_lshl_or_b32 v1, v2, 16, v1			; GFX9-NEXT: v_lshl_or_b32 v2, v6, 8, v5
	; GFX9-NEXT: s_waitcnt lgkmcnt(2)			; GFX9-NEXT: v_perm_b32 v1, v2, v1, s4
	; GFX9-NEXT: v_lshl_or_b32 v2, v10, 8, v9
	; GFX9-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-NEXT: v_lshl_or_b32 v3, v12, 8, v11			; GFX9-NEXT: v_lshl_or_b32 v2, v12, 8, v11
	; GFX9-NEXT: v_lshl_or_b32 v2, v3, 16, v2			; GFX9-NEXT: v_lshl_or_b32 v3, v10, 8, v9
				; GFX9-NEXT: v_perm_b32 v2, v3, v2, s4
	; GFX9-NEXT: s_setpc_b64 s[30:31]			; GFX9-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX7-LABEL: load_lds_v3i32_align1:			; GFX7-LABEL: load_lds_v3i32_align1:
	; GFX7: ; %bb.0:			; GFX7: ; %bb.0:
	; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX7-NEXT: s_mov_b32 m0, -1			; GFX7-NEXT: s_mov_b32 m0, -1
	; GFX7-NEXT: ds_read_u8 v1, v0 offset:6			; GFX7-NEXT: ds_read_u8 v1, v0 offset:6
	; GFX7-NEXT: ds_read_u8 v2, v0 offset:4			; GFX7-NEXT: ds_read_u8 v2, v0 offset:4
	▲ Show 20 Lines • Show All 100 Lines • ▼ Show 20 Lines
	; GFX10-NEXT: ds_read_u8 v1, v0			; GFX10-NEXT: ds_read_u8 v1, v0
	; GFX10-NEXT: ds_read_u8 v2, v0 offset:1			; GFX10-NEXT: ds_read_u8 v2, v0 offset:1
	; GFX10-NEXT: ds_read_u8 v3, v0 offset:2			; GFX10-NEXT: ds_read_u8 v3, v0 offset:2
	; GFX10-NEXT: ds_read_u8 v4, v0 offset:3			; GFX10-NEXT: ds_read_u8 v4, v0 offset:3
	; GFX10-NEXT: ds_read_u8 v5, v0 offset:4			; GFX10-NEXT: ds_read_u8 v5, v0 offset:4
	; GFX10-NEXT: ds_read_u8 v6, v0 offset:5			; GFX10-NEXT: ds_read_u8 v6, v0 offset:5
	; GFX10-NEXT: ds_read_u8 v7, v0 offset:6			; GFX10-NEXT: ds_read_u8 v7, v0 offset:6
	; GFX10-NEXT: ds_read_u8 v8, v0 offset:7			; GFX10-NEXT: ds_read_u8 v8, v0 offset:7
	; GFX10-NEXT: ds_read_u8 v9, v0 offset:8			; GFX10-NEXT: ds_read_u8 v9, v0 offset:10
	; GFX10-NEXT: ds_read_u8 v10, v0 offset:9			; GFX10-NEXT: ds_read_u8 v10, v0 offset:11
	; GFX10-NEXT: ds_read_u8 v11, v0 offset:10			; GFX10-NEXT: ds_read_u8 v11, v0 offset:8
	; GFX10-NEXT: ds_read_u8 v0, v0 offset:11			; GFX10-NEXT: ds_read_u8 v0, v0 offset:9
	; GFX10-NEXT: s_waitcnt lgkmcnt(10)			; GFX10-NEXT: s_waitcnt lgkmcnt(10)
	; GFX10-NEXT: v_lshl_or_b32 v1, v2, 8, v1			; GFX10-NEXT: v_lshl_or_b32 v1, v2, 8, v1
	; GFX10-NEXT: s_waitcnt lgkmcnt(8)			; GFX10-NEXT: s_waitcnt lgkmcnt(8)
	; GFX10-NEXT: v_lshl_or_b32 v2, v4, 8, v3			; GFX10-NEXT: v_lshl_or_b32 v3, v4, 8, v3
	; GFX10-NEXT: s_waitcnt lgkmcnt(6)			; GFX10-NEXT: s_waitcnt lgkmcnt(6)
	; GFX10-NEXT: v_lshl_or_b32 v3, v6, 8, v5			; GFX10-NEXT: v_lshl_or_b32 v4, v6, 8, v5
	; GFX10-NEXT: s_waitcnt lgkmcnt(4)			; GFX10-NEXT: s_waitcnt lgkmcnt(4)
	; GFX10-NEXT: v_lshl_or_b32 v4, v8, 8, v7			; GFX10-NEXT: v_lshl_or_b32 v2, v8, 8, v7
	; GFX10-NEXT: s_waitcnt lgkmcnt(2)			; GFX10-NEXT: s_waitcnt lgkmcnt(2)
	; GFX10-NEXT: v_lshl_or_b32 v5, v10, 8, v9			; GFX10-NEXT: v_lshl_or_b32 v5, v10, 8, v9
	; GFX10-NEXT: s_waitcnt lgkmcnt(0)			; GFX10-NEXT: s_waitcnt lgkmcnt(0)
	; GFX10-NEXT: v_lshl_or_b32 v6, v0, 8, v11			; GFX10-NEXT: v_lshl_or_b32 v6, v0, 8, v11
	; GFX10-NEXT: v_lshl_or_b32 v0, v2, 16, v1			; GFX10-NEXT: v_perm_b32 v0, v1, v3, 0x4050001
	; GFX10-NEXT: v_lshl_or_b32 v1, v4, 16, v3			; GFX10-NEXT: v_perm_b32 v1, v4, v2, 0x4050001
	; GFX10-NEXT: v_lshl_or_b32 v2, v6, 16, v5			; GFX10-NEXT: v_perm_b32 v2, v6, v5, 0x4050001
	; GFX10-NEXT: s_setpc_b64 s[30:31]			; GFX10-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX11-LABEL: load_lds_v3i32_align1:			; GFX11-LABEL: load_lds_v3i32_align1:
	; GFX11: ; %bb.0:			; GFX11: ; %bb.0:
	; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX11-NEXT: s_waitcnt_vscnt null, 0x0			; GFX11-NEXT: s_waitcnt_vscnt null, 0x0
	; GFX11-NEXT: ds_load_u8 v1, v0			; GFX11-NEXT: ds_load_u8 v1, v0
	; GFX11-NEXT: ds_load_u8 v2, v0 offset:1			; GFX11-NEXT: ds_load_u8 v2, v0 offset:1
	; GFX11-NEXT: ds_load_u8 v3, v0 offset:2			; GFX11-NEXT: ds_load_u8 v3, v0 offset:2
	; GFX11-NEXT: ds_load_u8 v4, v0 offset:3			; GFX11-NEXT: ds_load_u8 v4, v0 offset:3
	; GFX11-NEXT: ds_load_u8 v5, v0 offset:4			; GFX11-NEXT: ds_load_u8 v5, v0 offset:4
	; GFX11-NEXT: ds_load_u8 v6, v0 offset:5			; GFX11-NEXT: ds_load_u8 v6, v0 offset:5
	; GFX11-NEXT: ds_load_u8 v7, v0 offset:6			; GFX11-NEXT: ds_load_u8 v7, v0 offset:6
	; GFX11-NEXT: ds_load_u8 v8, v0 offset:7			; GFX11-NEXT: ds_load_u8 v8, v0 offset:7
	; GFX11-NEXT: ds_load_u8 v9, v0 offset:8			; GFX11-NEXT: ds_load_u8 v9, v0 offset:10
	; GFX11-NEXT: ds_load_u8 v10, v0 offset:9			; GFX11-NEXT: ds_load_u8 v10, v0 offset:11
	; GFX11-NEXT: ds_load_u8 v11, v0 offset:10			; GFX11-NEXT: ds_load_u8 v11, v0 offset:8
	; GFX11-NEXT: ds_load_u8 v0, v0 offset:11			; GFX11-NEXT: ds_load_u8 v0, v0 offset:9
	; GFX11-NEXT: s_waitcnt lgkmcnt(10)			; GFX11-NEXT: s_waitcnt lgkmcnt(10)
	; GFX11-NEXT: v_lshl_or_b32 v1, v2, 8, v1			; GFX11-NEXT: v_lshl_or_b32 v1, v2, 8, v1
	; GFX11-NEXT: s_waitcnt lgkmcnt(8)			; GFX11-NEXT: s_waitcnt lgkmcnt(8)
	; GFX11-NEXT: v_lshl_or_b32 v2, v4, 8, v3			; GFX11-NEXT: v_lshl_or_b32 v3, v4, 8, v3
	; GFX11-NEXT: s_waitcnt lgkmcnt(6)			; GFX11-NEXT: s_waitcnt lgkmcnt(6)
	; GFX11-NEXT: v_lshl_or_b32 v3, v6, 8, v5			; GFX11-NEXT: v_lshl_or_b32 v4, v6, 8, v5
	; GFX11-NEXT: s_waitcnt lgkmcnt(4)			; GFX11-NEXT: s_waitcnt lgkmcnt(4)
	; GFX11-NEXT: v_lshl_or_b32 v4, v8, 8, v7			; GFX11-NEXT: v_lshl_or_b32 v2, v8, 8, v7
	; GFX11-NEXT: s_waitcnt lgkmcnt(2)			; GFX11-NEXT: s_waitcnt lgkmcnt(2)
	; GFX11-NEXT: v_lshl_or_b32 v5, v10, 8, v9			; GFX11-NEXT: v_lshl_or_b32 v5, v10, 8, v9
	; GFX11-NEXT: s_waitcnt lgkmcnt(0)			; GFX11-NEXT: s_waitcnt lgkmcnt(0)
	; GFX11-NEXT: v_lshl_or_b32 v6, v0, 8, v11			; GFX11-NEXT: v_lshl_or_b32 v6, v0, 8, v11
	; GFX11-NEXT: v_lshl_or_b32 v0, v2, 16, v1			; GFX11-NEXT: v_perm_b32 v0, v1, v3, 0x4050001
	; GFX11-NEXT: v_lshl_or_b32 v1, v4, 16, v3			; GFX11-NEXT: v_perm_b32 v1, v4, v2, 0x4050001
	; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_3)			; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_3)
	; GFX11-NEXT: v_lshl_or_b32 v2, v6, 16, v5			; GFX11-NEXT: v_perm_b32 v2, v6, v5, 0x4050001
	; GFX11-NEXT: s_setpc_b64 s[30:31]			; GFX11-NEXT: s_setpc_b64 s[30:31]
	%load = load <3 x i32>, <3 x i32> addrspace(3)* %ptr, align 1			%load = load <3 x i32>, <3 x i32> addrspace(3)* %ptr, align 1
	ret <3 x i32> %load			ret <3 x i32> %load
	}			}

	define <3 x i32> @load_lds_v3i32_align2(<3 x i32> addrspace(3)* %ptr) {			define <3 x i32> @load_lds_v3i32_align2(<3 x i32> addrspace(3)* %ptr) {
	; GFX9-LABEL: load_lds_v3i32_align2:			; GFX9-LABEL: load_lds_v3i32_align2:
	; GFX9: ; %bb.0:			; GFX9: ; %bb.0:
	; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX9-NEXT: ds_read_u16 v1, v0			; GFX9-NEXT: ds_read_u16 v1, v0
	; GFX9-NEXT: ds_read_u16 v2, v0 offset:2			; GFX9-NEXT: ds_read_u16 v2, v0 offset:2
	; GFX9-NEXT: ds_read_u16 v3, v0 offset:4			; GFX9-NEXT: ds_read_u16 v3, v0 offset:4
	; GFX9-NEXT: ds_read_u16 v4, v0 offset:6			; GFX9-NEXT: ds_read_u16 v4, v0 offset:6
	; GFX9-NEXT: ds_read_u16 v5, v0 offset:8			; GFX9-NEXT: ds_read_u16 v5, v0 offset:8
	; GFX9-NEXT: ds_read_u16 v6, v0 offset:10			; GFX9-NEXT: ds_read_u16 v6, v0 offset:10
				; GFX9-NEXT: s_mov_b32 s4, 0x4050001
	; GFX9-NEXT: s_waitcnt lgkmcnt(4)			; GFX9-NEXT: s_waitcnt lgkmcnt(4)
	; GFX9-NEXT: v_lshl_or_b32 v0, v2, 16, v1			; GFX9-NEXT: v_perm_b32 v0, v1, v2, s4
	; GFX9-NEXT: s_waitcnt lgkmcnt(2)			; GFX9-NEXT: s_waitcnt lgkmcnt(2)
	; GFX9-NEXT: v_lshl_or_b32 v1, v4, 16, v3			; GFX9-NEXT: v_perm_b32 v1, v3, v4, s4
	; GFX9-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-NEXT: v_lshl_or_b32 v2, v6, 16, v5			; GFX9-NEXT: v_perm_b32 v2, v5, v6, s4
	; GFX9-NEXT: s_setpc_b64 s[30:31]			; GFX9-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX7-LABEL: load_lds_v3i32_align2:			; GFX7-LABEL: load_lds_v3i32_align2:
	; GFX7: ; %bb.0:			; GFX7: ; %bb.0:
	; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX7-NEXT: s_mov_b32 m0, -1			; GFX7-NEXT: s_mov_b32 m0, -1
	; GFX7-NEXT: ds_read_u16 v2, v0 offset:8			; GFX7-NEXT: ds_read_u16 v2, v0 offset:8
	; GFX7-NEXT: ds_read_u16 v1, v0 offset:4			; GFX7-NEXT: ds_read_u16 v1, v0 offset:4
	▲ Show 20 Lines • Show All 47 Lines • ▼ Show 20 Lines
	; GFX10-NEXT: s_waitcnt_vscnt null, 0x0			; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
	; GFX10-NEXT: ds_read_u16 v1, v0			; GFX10-NEXT: ds_read_u16 v1, v0
	; GFX10-NEXT: ds_read_u16 v2, v0 offset:2			; GFX10-NEXT: ds_read_u16 v2, v0 offset:2
	; GFX10-NEXT: ds_read_u16 v3, v0 offset:4			; GFX10-NEXT: ds_read_u16 v3, v0 offset:4
	; GFX10-NEXT: ds_read_u16 v4, v0 offset:6			; GFX10-NEXT: ds_read_u16 v4, v0 offset:6
	; GFX10-NEXT: ds_read_u16 v5, v0 offset:8			; GFX10-NEXT: ds_read_u16 v5, v0 offset:8
	; GFX10-NEXT: ds_read_u16 v6, v0 offset:10			; GFX10-NEXT: ds_read_u16 v6, v0 offset:10
	; GFX10-NEXT: s_waitcnt lgkmcnt(4)			; GFX10-NEXT: s_waitcnt lgkmcnt(4)
	; GFX10-NEXT: v_lshl_or_b32 v0, v2, 16, v1			; GFX10-NEXT: v_perm_b32 v0, v1, v2, 0x4050001
	; GFX10-NEXT: s_waitcnt lgkmcnt(2)			; GFX10-NEXT: s_waitcnt lgkmcnt(2)
	; GFX10-NEXT: v_lshl_or_b32 v1, v4, 16, v3			; GFX10-NEXT: v_perm_b32 v1, v3, v4, 0x4050001
	; GFX10-NEXT: s_waitcnt lgkmcnt(0)			; GFX10-NEXT: s_waitcnt lgkmcnt(0)
	; GFX10-NEXT: v_lshl_or_b32 v2, v6, 16, v5			; GFX10-NEXT: v_perm_b32 v2, v5, v6, 0x4050001
	; GFX10-NEXT: s_setpc_b64 s[30:31]			; GFX10-NEXT: s_setpc_b64 s[30:31]
	;			;
	; GFX11-LABEL: load_lds_v3i32_align2:			; GFX11-LABEL: load_lds_v3i32_align2:
	; GFX11: ; %bb.0:			; GFX11: ; %bb.0:
	; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)			; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	; GFX11-NEXT: s_waitcnt_vscnt null, 0x0			; GFX11-NEXT: s_waitcnt_vscnt null, 0x0
	; GFX11-NEXT: ds_load_u16 v1, v0			; GFX11-NEXT: ds_load_u16 v1, v0
	; GFX11-NEXT: ds_load_u16 v2, v0 offset:2			; GFX11-NEXT: ds_load_u16 v2, v0 offset:2
	; GFX11-NEXT: ds_load_u16 v3, v0 offset:4			; GFX11-NEXT: ds_load_u16 v3, v0 offset:4
	; GFX11-NEXT: ds_load_u16 v4, v0 offset:6			; GFX11-NEXT: ds_load_u16 v4, v0 offset:6
	; GFX11-NEXT: ds_load_u16 v5, v0 offset:8			; GFX11-NEXT: ds_load_u16 v5, v0 offset:8
	; GFX11-NEXT: ds_load_u16 v6, v0 offset:10			; GFX11-NEXT: ds_load_u16 v6, v0 offset:10
	; GFX11-NEXT: s_waitcnt lgkmcnt(4)			; GFX11-NEXT: s_waitcnt lgkmcnt(4)
	; GFX11-NEXT: v_lshl_or_b32 v0, v2, 16, v1			; GFX11-NEXT: v_perm_b32 v0, v1, v2, 0x4050001
	; GFX11-NEXT: s_waitcnt lgkmcnt(2)			; GFX11-NEXT: s_waitcnt lgkmcnt(2)
	; GFX11-NEXT: v_lshl_or_b32 v1, v4, 16, v3			; GFX11-NEXT: v_perm_b32 v1, v3, v4, 0x4050001
	; GFX11-NEXT: s_waitcnt lgkmcnt(0)			; GFX11-NEXT: s_waitcnt lgkmcnt(0)
	; GFX11-NEXT: v_lshl_or_b32 v2, v6, 16, v5			; GFX11-NEXT: v_perm_b32 v2, v5, v6, 0x4050001
	; GFX11-NEXT: s_setpc_b64 s[30:31]			; GFX11-NEXT: s_setpc_b64 s[30:31]
	%load = load <3 x i32>, <3 x i32> addrspace(3)* %ptr, align 2			%load = load <3 x i32>, <3 x i32> addrspace(3)* %ptr, align 2
	ret <3 x i32> %load			ret <3 x i32> %load
	}			}

	define <3 x i32> @load_lds_v3i32_align4(<3 x i32> addrspace(3)* %ptr) {			define <3 x i32> @load_lds_v3i32_align4(<3 x i32> addrspace(3)* %ptr) {
	; GFX9-LABEL: load_lds_v3i32_align4:			; GFX9-LABEL: load_lds_v3i32_align4:
	; GFX9: ; %bb.0:			; GFX9: ; %bb.0:
	▲ Show 20 Lines • Show All 151 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/pack.v2f16.ll

	Show First 20 Lines • Show All 185 Lines • ▼ Show 20 Lines
	; GFX8-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc			; GFX8-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc
	; GFX8-NEXT: v_mov_b32_e32 v3, s3			; GFX8-NEXT: v_mov_b32_e32 v3, s3
	; GFX8-NEXT: v_add_u32_e32 v2, vcc, s2, v2			; GFX8-NEXT: v_add_u32_e32 v2, vcc, s2, v2
	; GFX8-NEXT: v_addc_u32_e32 v3, vcc, 0, v3, vcc			; GFX8-NEXT: v_addc_u32_e32 v3, vcc, 0, v3, vcc
	; GFX8-NEXT: flat_load_dword v0, v[0:1] glc			; GFX8-NEXT: flat_load_dword v0, v[0:1] glc
	; GFX8-NEXT: s_waitcnt vmcnt(0)			; GFX8-NEXT: s_waitcnt vmcnt(0)
	; GFX8-NEXT: flat_load_dword v1, v[2:3] glc			; GFX8-NEXT: flat_load_dword v1, v[2:3] glc
	; GFX8-NEXT: s_waitcnt vmcnt(0)			; GFX8-NEXT: s_waitcnt vmcnt(0)
	; GFX8-NEXT: v_lshlrev_b32_e32 v1, 16, v1			; GFX8-NEXT: s_mov_b32 s0, 0x4050001
	; GFX8-NEXT: v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD			; GFX8-NEXT: v_perm_b32 v0, v0, v1, s0
	; GFX8-NEXT: ;;#ASMSTART			; GFX8-NEXT: ;;#ASMSTART
	; GFX8-NEXT: ; use v0			; GFX8-NEXT: ; use v0
	; GFX8-NEXT: ;;#ASMEND			; GFX8-NEXT: ;;#ASMEND
	; GFX8-NEXT: s_endpgm			; GFX8-NEXT: s_endpgm
	;			;
	; GFX7-LABEL: v_pack_v2f16:			; GFX7-LABEL: v_pack_v2f16:
	; GFX7: ; %bb.0:			; GFX7: ; %bb.0:
	; GFX7-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0			; GFX7-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0
	▲ Show 20 Lines • Show All 62 Lines • ▼ Show 20 Lines
	; GFX8-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc			; GFX8-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc
	; GFX8-NEXT: v_mov_b32_e32 v3, s3			; GFX8-NEXT: v_mov_b32_e32 v3, s3
	; GFX8-NEXT: v_add_u32_e32 v2, vcc, s2, v2			; GFX8-NEXT: v_add_u32_e32 v2, vcc, s2, v2
	; GFX8-NEXT: v_addc_u32_e32 v3, vcc, 0, v3, vcc			; GFX8-NEXT: v_addc_u32_e32 v3, vcc, 0, v3, vcc
	; GFX8-NEXT: flat_load_dword v0, v[0:1] glc			; GFX8-NEXT: flat_load_dword v0, v[0:1] glc
	; GFX8-NEXT: s_waitcnt vmcnt(0)			; GFX8-NEXT: s_waitcnt vmcnt(0)
	; GFX8-NEXT: flat_load_dword v1, v[2:3] glc			; GFX8-NEXT: flat_load_dword v1, v[2:3] glc
	; GFX8-NEXT: s_waitcnt vmcnt(0)			; GFX8-NEXT: s_waitcnt vmcnt(0)
				; GFX8-NEXT: s_mov_b32 s0, 0x4050001
	; GFX8-NEXT: s_mov_b32 s3, 0x1100f000			; GFX8-NEXT: s_mov_b32 s3, 0x1100f000
	; GFX8-NEXT: s_mov_b32 s2, -1			; GFX8-NEXT: s_mov_b32 s2, -1
	; GFX8-NEXT: v_lshlrev_b32_e32 v1, 16, v1			; GFX8-NEXT: v_perm_b32 v0, v0, v1, s0
	; GFX8-NEXT: v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
	; GFX8-NEXT: v_add_u32_e32 v0, vcc, 9, v0			; GFX8-NEXT: v_add_u32_e32 v0, vcc, 9, v0
	; GFX8-NEXT: buffer_store_dword v0, off, s[0:3], 0			; GFX8-NEXT: buffer_store_dword v0, off, s[0:3], 0
	; GFX8-NEXT: s_waitcnt vmcnt(0)			; GFX8-NEXT: s_waitcnt vmcnt(0)
	; GFX8-NEXT: s_endpgm			; GFX8-NEXT: s_endpgm
	;			;
	; GFX7-LABEL: v_pack_v2f16_user:			; GFX7-LABEL: v_pack_v2f16_user:
	; GFX7: ; %bb.0:			; GFX7: ; %bb.0:
	; GFX7-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0			; GFX7-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0
	▲ Show 20 Lines • Show All 351 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/pack.v2i16.ll

	Show First 20 Lines • Show All 181 Lines • ▼ Show 20 Lines
	; GFX803-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc			; GFX803-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc
	; GFX803-NEXT: v_mov_b32_e32 v3, s3			; GFX803-NEXT: v_mov_b32_e32 v3, s3
	; GFX803-NEXT: v_add_u32_e32 v2, vcc, s2, v2			; GFX803-NEXT: v_add_u32_e32 v2, vcc, s2, v2
	; GFX803-NEXT: v_addc_u32_e32 v3, vcc, 0, v3, vcc			; GFX803-NEXT: v_addc_u32_e32 v3, vcc, 0, v3, vcc
	; GFX803-NEXT: flat_load_dword v0, v[0:1] glc			; GFX803-NEXT: flat_load_dword v0, v[0:1] glc
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: flat_load_dword v1, v[2:3] glc			; GFX803-NEXT: flat_load_dword v1, v[2:3] glc
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: v_lshlrev_b32_e32 v1, 16, v1			; GFX803-NEXT: s_mov_b32 s0, 0x4050001
	; GFX803-NEXT: v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD			; GFX803-NEXT: v_perm_b32 v0, v0, v1, s0
	; GFX803-NEXT: ;;#ASMSTART			; GFX803-NEXT: ;;#ASMSTART
	; GFX803-NEXT: ; use v0			; GFX803-NEXT: ; use v0
	; GFX803-NEXT: ;;#ASMEND			; GFX803-NEXT: ;;#ASMEND
	; GFX803-NEXT: s_endpgm			; GFX803-NEXT: s_endpgm
	;			;
	; GFX7-LABEL: v_pack_v2i16:			; GFX7-LABEL: v_pack_v2i16:
	; GFX7: ; %bb.0:			; GFX7: ; %bb.0:
	; GFX7-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0			; GFX7-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0
	▲ Show 20 Lines • Show All 60 Lines • ▼ Show 20 Lines
	; GFX803-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc			; GFX803-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc
	; GFX803-NEXT: v_mov_b32_e32 v3, s3			; GFX803-NEXT: v_mov_b32_e32 v3, s3
	; GFX803-NEXT: v_add_u32_e32 v2, vcc, s2, v2			; GFX803-NEXT: v_add_u32_e32 v2, vcc, s2, v2
	; GFX803-NEXT: v_addc_u32_e32 v3, vcc, 0, v3, vcc			; GFX803-NEXT: v_addc_u32_e32 v3, vcc, 0, v3, vcc
	; GFX803-NEXT: flat_load_dword v0, v[0:1] glc			; GFX803-NEXT: flat_load_dword v0, v[0:1] glc
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: flat_load_dword v1, v[2:3] glc			; GFX803-NEXT: flat_load_dword v1, v[2:3] glc
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
				; GFX803-NEXT: s_mov_b32 s0, 0x4050001
	; GFX803-NEXT: s_mov_b32 s3, 0x1100f000			; GFX803-NEXT: s_mov_b32 s3, 0x1100f000
	; GFX803-NEXT: s_mov_b32 s2, -1			; GFX803-NEXT: s_mov_b32 s2, -1
	; GFX803-NEXT: v_lshlrev_b32_e32 v1, 16, v1			; GFX803-NEXT: v_perm_b32 v0, v0, v1, s0
	; GFX803-NEXT: v_or_b32_sdwa v0, v0, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
	; GFX803-NEXT: v_add_u32_e32 v0, vcc, 9, v0			; GFX803-NEXT: v_add_u32_e32 v0, vcc, 9, v0
	; GFX803-NEXT: buffer_store_dword v0, off, s[0:3], 0			; GFX803-NEXT: buffer_store_dword v0, off, s[0:3], 0
	; GFX803-NEXT: s_waitcnt vmcnt(0)			; GFX803-NEXT: s_waitcnt vmcnt(0)
	; GFX803-NEXT: s_endpgm			; GFX803-NEXT: s_endpgm
	;			;
	; GFX7-LABEL: v_pack_v2i16_user:			; GFX7-LABEL: v_pack_v2i16_user:
	; GFX7: ; %bb.0:			; GFX7: ; %bb.0:
	; GFX7-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0			; GFX7-NEXT: s_load_dwordx4 s[0:3], s[4:5], 0x0
	▲ Show 20 Lines • Show All 283 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/permute.ll

	Show First 20 Lines • Show All 250 Lines • ▼ Show 20 Lines
	}			}

	define amdgpu_kernel void @or_and_or(i32 addrspace(1)* nocapture %arg, i32 %arg1) {			define amdgpu_kernel void @or_and_or(i32 addrspace(1)* nocapture %arg, i32 %arg1) {
	; GCN-LABEL: or_and_or:			; GCN-LABEL: or_and_or:
	; GCN: ; %bb.0: ; %bb			; GCN: ; %bb.0: ; %bb
	; GCN-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x24			; GCN-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x24
	; GCN-NEXT: s_load_dword s0, s[0:1], 0x2c			; GCN-NEXT: s_load_dword s0, s[0:1], 0x2c
	; GCN-NEXT: v_lshlrev_b32_e32 v0, 2, v0			; GCN-NEXT: v_lshlrev_b32_e32 v0, 2, v0
	; GCN-NEXT: v_mov_b32_e32 v3, 0x7020104			; GCN-NEXT: v_mov_b32_e32 v3, 0x4050607
	; GCN-NEXT: s_waitcnt lgkmcnt(0)			; GCN-NEXT: s_waitcnt lgkmcnt(0)
	; GCN-NEXT: v_mov_b32_e32 v1, s3			; GCN-NEXT: v_mov_b32_e32 v1, s3
	; GCN-NEXT: v_add_u32_e32 v0, vcc, s2, v0			; GCN-NEXT: v_add_u32_e32 v0, vcc, s2, v0
	; GCN-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc			; GCN-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc
	; GCN-NEXT: flat_load_dword v2, v[0:1]			; GCN-NEXT: flat_load_dword v2, v[0:1]
				; GCN-NEXT: s_or_b32 s0, s0, 0xff0000ff
	; GCN-NEXT: s_waitcnt vmcnt(0)			; GCN-NEXT: s_waitcnt vmcnt(0)
	; GCN-NEXT: v_perm_b32 v2, v2, s0, v3			; GCN-NEXT: v_perm_b32 v2, v2, 0, v3
				; GCN-NEXT: v_and_b32_e32 v2, s0, v2
	; GCN-NEXT: flat_store_dword v[0:1], v2			; GCN-NEXT: flat_store_dword v[0:1], v2
	; GCN-NEXT: s_endpgm			; GCN-NEXT: s_endpgm
	bb:			bb:
	%id = tail call i32 @llvm.amdgcn.workitem.id.x()			%id = tail call i32 @llvm.amdgcn.workitem.id.x()
	%gep = getelementptr i32, i32 addrspace(1)* %arg, i32 %id			%gep = getelementptr i32, i32 addrspace(1)* %arg, i32 %id
	%tmp = load i32, i32 addrspace(1)* %gep, align 4			%tmp = load i32, i32 addrspace(1)* %gep, align 4
	%or1 = or i32 %tmp, 16776960 ; 0x00ffff00			%or1 = or i32 %tmp, 16776960 ; 0x00ffff00
	%or2 = or i32 %arg1, 4278190335 ; 0xff0000ff			%or2 = or i32 %arg1, 4278190335 ; 0xff0000ff
	%and = and i32 %or1, %or2			%and = and i32 %or1, %or2
	store i32 %and, i32 addrspace(1)* %gep, align 4			store i32 %and, i32 addrspace(1)* %gep, align 4
	ret void			ret void
	}			}

	; FIXME here should have been "v_perm_b32" with 0xffff0500 mask.			; FIXME here should have been "v_perm_b32" with 0xffff0500 mask.
	define amdgpu_kernel void @known_ffff0500(i32 addrspace(1)* nocapture %arg, i32 %arg1) {			define amdgpu_kernel void @known_ffff0500(i32 addrspace(1)* nocapture %arg, i32 %arg1) {
	; GCN-LABEL: known_ffff0500:			; GCN-LABEL: known_ffff0500:
	; GCN: ; %bb.0: ; %bb			; GCN: ; %bb.0: ; %bb
	; GCN-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x24			; GCN-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x24
	; GCN-NEXT: s_load_dword s0, s[0:1], 0x2c			; GCN-NEXT: s_load_dword s0, s[0:1], 0x2c
	; GCN-NEXT: v_lshlrev_b32_e32 v0, 2, v0			; GCN-NEXT: v_lshlrev_b32_e32 v0, 2, v0
	; GCN-NEXT: v_mov_b32_e32 v5, 0xffff8004			; GCN-NEXT: v_mov_b32_e32 v5, 0xc050c07
				; GCN-NEXT: v_mov_b32_e32 v6, 0xffff8004
	; GCN-NEXT: s_waitcnt lgkmcnt(0)			; GCN-NEXT: s_waitcnt lgkmcnt(0)
	; GCN-NEXT: v_mov_b32_e32 v1, s3			; GCN-NEXT: v_mov_b32_e32 v1, s3
	; GCN-NEXT: v_add_u32_e32 v0, vcc, s2, v0			; GCN-NEXT: v_add_u32_e32 v0, vcc, s2, v0
	; GCN-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc			; GCN-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc
	; GCN-NEXT: flat_load_dword v4, v[0:1]			; GCN-NEXT: flat_load_dword v4, v[0:1]
	; GCN-NEXT: s_bitset1_b32 s0, 15			; GCN-NEXT: s_bitset1_b32 s0, 15
	; GCN-NEXT: s_and_b32 s0, s0, 0xff00			; GCN-NEXT: s_and_b32 s0, s0, 0xff00
	; GCN-NEXT: s_or_b32 s0, s0, 0xffff0000			; GCN-NEXT: s_or_b32 s0, s0, 0xffff0000
	; GCN-NEXT: v_mov_b32_e32 v2, s2			; GCN-NEXT: v_mov_b32_e32 v2, s2
	; GCN-NEXT: v_mov_b32_e32 v3, s3			; GCN-NEXT: v_mov_b32_e32 v3, s3
	; GCN-NEXT: s_waitcnt vmcnt(0)			; GCN-NEXT: s_waitcnt vmcnt(0)
	; GCN-NEXT: v_or_b32_e32 v4, 4, v4			; GCN-NEXT: v_perm_b32 v4, v4, 0, v5
	; GCN-NEXT: v_and_b32_e32 v4, 0xff00ff, v4
	; GCN-NEXT: v_or_b32_e32 v4, s0, v4			; GCN-NEXT: v_or_b32_e32 v4, s0, v4
	; GCN-NEXT: flat_store_dword v[0:1], v4			; GCN-NEXT: flat_store_dword v[0:1], v4
	; GCN-NEXT: flat_store_dword v[2:3], v5			; GCN-NEXT: flat_store_dword v[2:3], v6
	; GCN-NEXT: s_endpgm			; GCN-NEXT: s_endpgm
	bb:			bb:
	%id = tail call i32 @llvm.amdgcn.workitem.id.x()			%id = tail call i32 @llvm.amdgcn.workitem.id.x()
	%gep = getelementptr i32, i32 addrspace(1)* %arg, i32 %id			%gep = getelementptr i32, i32 addrspace(1)* %arg, i32 %id
	%load = load i32, i32 addrspace(1)* %gep, align 4			%load = load i32, i32 addrspace(1)* %gep, align 4
	%mask1 = or i32 %arg1, 32768 ; 0x8000			%mask1 = or i32 %arg1, 32768 ; 0x8000
	%mask2 = or i32 %load, 4			%mask2 = or i32 %load, 4
	%and = and i32 %mask2, 16711935 ; 0x00ff00ff			%and = and i32 %mask2, 16711935 ; 0x00ff00ff
	▲ Show 20 Lines • Show All 43 Lines • ▼ Show 20 Lines
	}			}

	define amdgpu_kernel void @known_ffff8004(i32 addrspace(1)* nocapture %arg, i32 %arg1) {			define amdgpu_kernel void @known_ffff8004(i32 addrspace(1)* nocapture %arg, i32 %arg1) {
	; GCN-LABEL: known_ffff8004:			; GCN-LABEL: known_ffff8004:
	; GCN: ; %bb.0: ; %bb			; GCN: ; %bb.0: ; %bb
	; GCN-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x24			; GCN-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x24
	; GCN-NEXT: s_load_dword s0, s[0:1], 0x2c			; GCN-NEXT: s_load_dword s0, s[0:1], 0x2c
	; GCN-NEXT: v_lshlrev_b32_e32 v0, 2, v0			; GCN-NEXT: v_lshlrev_b32_e32 v0, 2, v0
	; GCN-NEXT: v_mov_b32_e32 v5, 0xffff0500			; GCN-NEXT: v_mov_b32_e32 v5, 0x4050607
	; GCN-NEXT: v_mov_b32_e32 v6, 0xffff8004			; GCN-NEXT: v_mov_b32_e32 v6, 0xffff0500
	; GCN-NEXT: s_waitcnt lgkmcnt(0)			; GCN-NEXT: s_waitcnt lgkmcnt(0)
	; GCN-NEXT: v_mov_b32_e32 v1, s3			; GCN-NEXT: v_mov_b32_e32 v1, s3
	; GCN-NEXT: v_add_u32_e32 v0, vcc, s2, v0			; GCN-NEXT: v_add_u32_e32 v0, vcc, s2, v0
	; GCN-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc			; GCN-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc
	; GCN-NEXT: flat_load_dword v4, v[0:1]			; GCN-NEXT: flat_load_dword v4, v[0:1]
	; GCN-NEXT: s_or_b32 s0, s0, 4			; GCN-NEXT: s_or_b32 s0, s0, 4
	; GCN-NEXT: v_mov_b32_e32 v2, s2			; GCN-NEXT: v_mov_b32_e32 v2, s2
				; GCN-NEXT: v_mov_b32_e32 v7, 0xffff8004
	; GCN-NEXT: v_mov_b32_e32 v3, s3			; GCN-NEXT: v_mov_b32_e32 v3, s3
	; GCN-NEXT: s_waitcnt vmcnt(0)			; GCN-NEXT: s_waitcnt vmcnt(0)
	; GCN-NEXT: v_or_b32_e32 v4, 0x8000, v4			; GCN-NEXT: v_perm_b32 v4, v4, 0, v5
	; GCN-NEXT: v_perm_b32 v4, v4, s0, v5			; GCN-NEXT: v_perm_b32 v4, v4, s0, v6
	; GCN-NEXT: flat_store_dword v[0:1], v4			; GCN-NEXT: flat_store_dword v[0:1], v4
	; GCN-NEXT: flat_store_dword v[2:3], v6			; GCN-NEXT: flat_store_dword v[2:3], v7
	; GCN-NEXT: s_endpgm			; GCN-NEXT: s_endpgm
	bb:			bb:
	%id = tail call i32 @llvm.amdgcn.workitem.id.x()			%id = tail call i32 @llvm.amdgcn.workitem.id.x()
	%gep = getelementptr i32, i32 addrspace(1)* %arg, i32 %id			%gep = getelementptr i32, i32 addrspace(1)* %arg, i32 %id
	%load = load i32, i32 addrspace(1)* %gep, align 4			%load = load i32, i32 addrspace(1)* %gep, align 4
	%mask1 = or i32 %arg1, 4			%mask1 = or i32 %arg1, 4
	%mask2 = or i32 %load, 32768 ; 0x8000			%mask2 = or i32 %load, 32768 ; 0x8000
	%and = and i32 %mask1, 16711935 ; 0x00ff00ff			%and = and i32 %mask1, 16711935 ; 0x00ff00ff
	Show All 10 Lines

llvm/test/CodeGen/AMDGPU/permute_i8.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -mtriple=amdgcn-- -mcpu=gfx1010 -verify-machineinstrs < %s \| FileCheck %s -check-prefixes=GFX10
				; RUN: llc -mtriple=amdgcn-- -mcpu=gfx908 -verify-machineinstrs < %s \| FileCheck %s -check-prefixes=GFX9
				arsenmUnsubmitted Done Reply Inline Actions Drop -opaque-pointers (also direction doesn't make sense with the test contents) arsenm: Drop -opaque-pointers (also direction doesn't make sense with the test contents)

				define hidden void @shuffle6766(<4 x i8>* %in0, <4 x i8>* %in1, <4 x i8>* %out0) {
				; GFX10-LABEL: shuffle6766:
				; GFX10: ; %bb.0:
				arsenmUnsubmitted Done Reply Inline Actions Need to use typed pointers, also should prefer global loads to flat arsenm: Need to use typed pointers, also should prefer global loads to flat
				arsenmUnsubmitted Done Reply Inline Actions Opaque pointer tests are not additional, the tests need to be just converted. There are 0 remaining typed pointer AMDGPU tests arsenm: Opaque pointer tests are not additional, the tests need to be just converted. There are 0…
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: flat_load_dword v0, v[2:3]
				; GFX10-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX10-NEXT: v_perm_b32 v0, v0, 0, 0x6070606
				; GFX10-NEXT: flat_store_dword v[4:5], v0
				; GFX10-NEXT: s_waitcnt lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: shuffle6766:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: flat_load_dword v0, v[2:3]
				; GFX9-NEXT: v_mov_b32_e32 v1, 0x6070606
				; GFX9-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX9-NEXT: v_perm_b32 v0, v0, 0, v1
				; GFX9-NEXT: flat_store_dword v[4:5], v0
				; GFX9-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				%vec0 = load <4 x i8>, <4 x i8>* %in0, align 4
				%vec1 = load <4 x i8>, <4 x i8>* %in1, align 4
				%shuffle0_0 = shufflevector <4 x i8> %vec0, <4 x i8> %vec1, <4 x i32> <i32 6, i32 7, i32 6, i32 6>
				store <4 x i8> %shuffle0_0, <4 x i8>* %out0, align 4
				ret void
				}

				define hidden void @shuffle3746(<4 x i8>* %in0, <4 x i8>* %in1, <4 x i8>* %out0) {
				; GFX10-LABEL: shuffle3746:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: flat_load_dword v6, v[0:1]
				; GFX10-NEXT: flat_load_dword v7, v[2:3]
				; GFX10-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX10-NEXT: v_perm_b32 v0, v6, v7, 0x7030000
				; GFX10-NEXT: flat_store_dword v[4:5], v0
				; GFX10-NEXT: s_waitcnt lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: shuffle3746:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: flat_load_dword v6, v[0:1]
				; GFX9-NEXT: flat_load_dword v7, v[2:3]
				; GFX9-NEXT: s_mov_b32 s4, 0x7030000
				; GFX9-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX9-NEXT: v_perm_b32 v0, v6, v7, s4
				; GFX9-NEXT: flat_store_dword v[4:5], v0
				; GFX9-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				%vec0 = load <4 x i8>, <4 x i8>* %in0, align 4
				%vec1 = load <4 x i8>, <4 x i8>* %in1, align 4
				%shuffle0_0 = shufflevector <4 x i8> %vec0, <4 x i8> %vec1, <4 x i32> <i32 3, i32 7, i32 4, i32 4>
				store <4 x i8> %shuffle0_0, <4 x i8>* %out0, align 4
				ret void
				}

				define hidden void @shuffle4445(<4 x i8>* %in0, <4 x i8>* %in1, <4 x i8>* %out0) {
				; GFX10-LABEL: shuffle4445:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: flat_load_dword v0, v[2:3]
				; GFX10-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX10-NEXT: v_perm_b32 v0, v0, 0, 0x4040405
				; GFX10-NEXT: flat_store_dword v[4:5], v0
				; GFX10-NEXT: s_waitcnt lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: shuffle4445:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: flat_load_dword v0, v[2:3]
				; GFX9-NEXT: v_mov_b32_e32 v1, 0x4040405
				; GFX9-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX9-NEXT: v_perm_b32 v0, v0, 0, v1
				; GFX9-NEXT: flat_store_dword v[4:5], v0
				; GFX9-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				%vec0 = load <4 x i8>, <4 x i8>* %in0, align 4
				%vec1 = load <4 x i8>, <4 x i8>* %in1, align 4
				arsenmUnsubmitted Done Reply Inline Actions Test needs to use opaque pointers arsenm: Test needs to use opaque pointers
				%shuffle0_0 = shufflevector <4 x i8> %vec0, <4 x i8> %vec1, <4 x i32> <i32 4, i32 4, i32 4, i32 5>
				store <4 x i8> %shuffle0_0, <4 x i8>* %out0, align 4
				ret void
				}

				define hidden void @shuffle0101(<4 x i8>* %in0, <4 x i8>* %in1, <4 x i8>* %out0) {
				; GFX10-LABEL: shuffle0101:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: flat_load_dword v0, v[0:1]
				; GFX10-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX10-NEXT: v_perm_b32 v0, v0, 0, 0x4050405
				; GFX10-NEXT: flat_store_dword v[4:5], v0
				; GFX10-NEXT: s_waitcnt lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: shuffle0101:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: flat_load_dword v0, v[0:1]
				; GFX9-NEXT: v_mov_b32_e32 v1, 0x4050405
				; GFX9-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX9-NEXT: v_perm_b32 v0, v0, 0, v1
				; GFX9-NEXT: flat_store_dword v[4:5], v0
				; GFX9-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				%vec0 = load <4 x i8>, <4 x i8>* %in0, align 4
				%vec1 = load <4 x i8>, <4 x i8>* %in1, align 4
				%shuffle0_0 = shufflevector <4 x i8> %vec0, <4 x i8> %vec1, <4 x i32> <i32 0, i32 1, i32 0, i32 1>
				store <4 x i8> %shuffle0_0, <4 x i8>* %out0, align 4
				ret void
				}

				define hidden void @shuffle7533(<4 x i8>* %in0, <4 x i8>* %in1, <4 x i8>* %out0) {
				; GFX10-LABEL: shuffle7533:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: flat_load_dword v6, v[0:1]
				; GFX10-NEXT: flat_load_dword v7, v[2:3]
				; GFX10-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX10-NEXT: v_perm_b32 v0, v7, v6, 0x7050303
				; GFX10-NEXT: flat_store_dword v[4:5], v0
				; GFX10-NEXT: s_waitcnt lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: shuffle7533:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: flat_load_dword v6, v[0:1]
				; GFX9-NEXT: flat_load_dword v7, v[2:3]
				; GFX9-NEXT: s_mov_b32 s4, 0x7050303
				; GFX9-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX9-NEXT: v_perm_b32 v0, v7, v6, s4
				; GFX9-NEXT: flat_store_dword v[4:5], v0
				; GFX9-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				%vec0 = load <4 x i8>, <4 x i8>* %in0, align 4
				%vec1 = load <4 x i8>, <4 x i8>* %in1, align 4
				%shuffle0_0 = shufflevector <4 x i8> %vec0, <4 x i8> %vec1, <4 x i32> <i32 7, i32 5, i32 3, i32 3>
				store <4 x i8> %shuffle0_0, <4 x i8>* %out0, align 4
				ret void
				}

				define hidden void @shuffle7767(<4 x i8>* %in0, <4 x i8>* %in1, <4 x i8>* %out0) {
				; GFX10-LABEL: shuffle7767:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: flat_load_dword v0, v[2:3]
				; GFX10-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX10-NEXT: v_perm_b32 v0, v0, 0, 0x7070607
				; GFX10-NEXT: flat_store_dword v[4:5], v0
				; GFX10-NEXT: s_waitcnt lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: shuffle7767:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: flat_load_dword v0, v[2:3]
				; GFX9-NEXT: v_mov_b32_e32 v1, 0x7070607
				; GFX9-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX9-NEXT: v_perm_b32 v0, v0, 0, v1
				; GFX9-NEXT: flat_store_dword v[4:5], v0
				; GFX9-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				%vec0 = load <4 x i8>, <4 x i8>* %in0, align 4
				%vec1 = load <4 x i8>, <4 x i8>* %in1, align 4
				%shuffle0_0 = shufflevector <4 x i8> %vec0, <4 x i8> %vec1, <4 x i32> <i32 7, i32 7, i32 6, i32 7>
				store <4 x i8> %shuffle0_0, <4 x i8>* %out0, align 4
				ret void
				}

				define hidden void @shuffle0554(<4 x i8>* %in0, <4 x i8>* %in1, <4 x i8>* %out0) {
				; GFX10-LABEL: shuffle0554:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: flat_load_dword v6, v[0:1]
				; GFX10-NEXT: flat_load_dword v7, v[2:3]
				; GFX10-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX10-NEXT: v_perm_b32 v0, v6, v7, 0x4010100
				; GFX10-NEXT: flat_store_dword v[4:5], v0
				; GFX10-NEXT: s_waitcnt lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: shuffle0554:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: flat_load_dword v6, v[0:1]
				; GFX9-NEXT: flat_load_dword v7, v[2:3]
				; GFX9-NEXT: s_mov_b32 s4, 0x4010100
				; GFX9-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX9-NEXT: v_perm_b32 v0, v6, v7, s4
				; GFX9-NEXT: flat_store_dword v[4:5], v0
				; GFX9-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				%vec0 = load <4 x i8>, <4 x i8>* %in0, align 4
				%vec1 = load <4 x i8>, <4 x i8>* %in1, align 4
				%shuffle0_0 = shufflevector <4 x i8> %vec0, <4 x i8> %vec1, <4 x i32> <i32 0, i32 5, i32 5, i32 4>
				store <4 x i8> %shuffle0_0, <4 x i8>* %out0, align 4
				ret void
				}

				define hidden void @shuffle2127(<4 x i8>* %in0, <4 x i8>* %in1, <4 x i8>* %out0) {
				; GFX10-LABEL: shuffle2127:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: flat_load_dword v6, v[0:1]
				; GFX10-NEXT: flat_load_dword v7, v[2:3]
				; GFX10-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX10-NEXT: v_perm_b32 v0, v6, v7, 0x6050603
				; GFX10-NEXT: flat_store_dword v[4:5], v0
				; GFX10-NEXT: s_waitcnt lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: shuffle2127:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: flat_load_dword v6, v[0:1]
				; GFX9-NEXT: flat_load_dword v7, v[2:3]
				; GFX9-NEXT: s_mov_b32 s4, 0x6050603
				; GFX9-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX9-NEXT: v_perm_b32 v0, v6, v7, s4
				; GFX9-NEXT: flat_store_dword v[4:5], v0
				; GFX9-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				%vec0 = load <4 x i8>, <4 x i8>* %in0, align 4
				%vec1 = load <4 x i8>, <4 x i8>* %in1, align 4
				%shuffle0_0 = shufflevector <4 x i8> %vec0, <4 x i8> %vec1, <4 x i32> <i32 2, i32 1, i32 2, i32 7>
				store <4 x i8> %shuffle0_0, <4 x i8>* %out0, align 4
				ret void
				}

				define hidden void @shuffle5047(<4 x i8>* %in0, <4 x i8>* %in1, <4 x i8>* %out0) {
				; GFX10-LABEL: shuffle5047:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: flat_load_dword v6, v[0:1]
				; GFX10-NEXT: flat_load_dword v7, v[2:3]
				; GFX10-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX10-NEXT: v_perm_b32 v0, v7, v6, 0x5000407
				; GFX10-NEXT: flat_store_dword v[4:5], v0
				; GFX10-NEXT: s_waitcnt lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: shuffle5047:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: flat_load_dword v6, v[0:1]
				; GFX9-NEXT: flat_load_dword v7, v[2:3]
				; GFX9-NEXT: s_mov_b32 s4, 0x5000407
				; GFX9-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX9-NEXT: v_perm_b32 v0, v7, v6, s4
				; GFX9-NEXT: flat_store_dword v[4:5], v0
				; GFX9-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				%vec0 = load <4 x i8>, <4 x i8>* %in0, align 4
				%vec1 = load <4 x i8>, <4 x i8>* %in1, align 4
				%shuffle0_0 = shufflevector <4 x i8> %vec0, <4 x i8> %vec1, <4 x i32> <i32 5, i32 0, i32 4, i32 7>
				store <4 x i8> %shuffle0_0, <4 x i8>* %out0, align 4
				ret void
				}

				define hidden void @shuffle3546(<4 x i8>* %in0, <4 x i8>* %in1, <4 x i8>* %out0) {
				; GFX10-LABEL: shuffle3546:
				; GFX10: ; %bb.0:
				; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: flat_load_dword v6, v[0:1]
				; GFX10-NEXT: flat_load_dword v7, v[2:3]
				; GFX10-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX10-NEXT: v_perm_b32 v0, v6, v7, 0x7010002
				; GFX10-NEXT: flat_store_dword v[4:5], v0
				; GFX10-NEXT: s_waitcnt lgkmcnt(0)
				; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
				; GFX10-NEXT: s_setpc_b64 s[30:31]
				;
				; GFX9-LABEL: shuffle3546:
				; GFX9: ; %bb.0:
				; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
				; GFX9-NEXT: flat_load_dword v6, v[0:1]
				; GFX9-NEXT: flat_load_dword v7, v[2:3]
				; GFX9-NEXT: s_mov_b32 s4, 0x7010002
				; GFX9-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX9-NEXT: v_perm_b32 v0, v6, v7, s4
				; GFX9-NEXT: flat_store_dword v[4:5], v0
				; GFX9-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
				; GFX9-NEXT: s_setpc_b64 s[30:31]
				%vec0 = load <4 x i8>, <4 x i8>* %in0, align 4
				%vec1 = load <4 x i8>, <4 x i8>* %in1, align 4
				%shuffle0_0 = shufflevector <4 x i8> %vec0, <4 x i8> %vec1, <4 x i32> <i32 3, i32 5, i32 4, i32 6>
				store <4 x i8> %shuffle0_0, <4 x i8>* %out0, align 4
				ret void
				}

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add basic support for extended i8 perm matchingClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 493458

llvm/include/llvm/CodeGen/DAGCombine.h

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/test/CodeGen/AMDGPU/combine-vload-extract.ll

llvm/test/CodeGen/AMDGPU/cvt_f32_ubyte.ll

llvm/test/CodeGen/AMDGPU/ds_read2.ll

llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.global.ll

llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.private.ll

llvm/test/CodeGen/AMDGPU/insert_vector_elt.v2i16.ll

llvm/test/CodeGen/AMDGPU/load-hi16.ll

llvm/test/CodeGen/AMDGPU/load-lo16.ll

llvm/test/CodeGen/AMDGPU/load-local.128.ll

llvm/test/CodeGen/AMDGPU/load-local.96.ll

llvm/test/CodeGen/AMDGPU/pack.v2f16.ll

llvm/test/CodeGen/AMDGPU/pack.v2i16.ll

llvm/test/CodeGen/AMDGPU/permute.ll

llvm/test/CodeGen/AMDGPU/permute_i8.ll

[AMDGPU] Add basic support for extended i8 perm matching
ClosedPublic