This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/CodeGen/
-
llvm/
-
CodeGen/
2/2
SelectionDAGAddressAnalysis.h
-
lib/CodeGen/SelectionDAG/
-
CodeGen/
-
SelectionDAG/
13/15
DAGCombiner.cpp
-
test/CodeGen/
-
CodeGen/
-
AArch64/
-
load-combine.ll
-
AMDGPU/
2/2
combine-vload-extract.ll
-
fast-unaligned-load-store.global.ll
-
fast-unaligned-load-store.private.ll

Differential D133584

[DAGCombiner] [AMDGPU] Allow vector loads in MatchLoadCombine
ClosedPublic

Authored by jrbyrnes on Sep 9 2022, 8:59 AM.

Download Raw Diff

Details

Reviewers

kerbowa
rampitec
arsenm
bogner
RKSimon
spatel

Summary

Since SROA chooses promotion based on reaching load / stores of allocas, we may run into scenarios in which we alloca a vector, but promote it to an integer. The result of which is the familiar LoadCombine pattern (i.e. ZEXT, SHL, OR). However, instead of coming directly from distinct loads, the elements to be combined are coming from ExtractVectorElements which stem from a shared load.

This patch identifies such a pattern and combines it into a load.

Change-Id: I0bc06588f11e88a0a975cde1fd71e9143e6c42dd

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

jrbyrnes created this revision.Sep 9 2022, 8:59 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 9 2022, 8:59 AM

Herald added subscribers: kosarev, ecnelises, kerbowa and 7 others. · View Herald Transcript

jrbyrnes requested review of this revision.Sep 9 2022, 8:59 AM

Herald added a project: Restricted Project. · View Herald TranscriptSep 9 2022, 8:59 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

jrbyrnes added reviewers: kerbowa, rampitec, arsenm.Sep 9 2022, 9:02 AM

Harbormaster completed remote builds in B185863: Diff 459089.Sep 9 2022, 9:55 AM

arsenm added inline comments.Sep 9 2022, 10:15 AM

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
7941	Why implicitly cast to int to assert the value? Just use uint64_t?
7949	Early return on condition instead of wrapping ternary operator
llvm/test/CodeGen/AMDGPU/combine-vload-extract.ll
2	Don’t need -O3?

Address comments.

jrbyrnes marked 2 inline comments as done.Sep 9 2022, 3:58 PM

Harbormaster completed remote builds in B185963: Diff 459217.Sep 9 2022, 5:36 PM

Adding reviewers for increased perspective.

jmmartinez added a subscriber: jmmartinez.Sep 19 2022, 1:40 AM

jmmartinez added inline comments.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
7981	Is is normal that the conditions is checking if `VectorIndex` is `0` ? Shouldn't it be `auto BPVectorIndex = VectorIndex.value_or(0U);` ?
8295	This variable seems unused.

precommit combine-vload-extract.ll with current (trunk) codegen and rebase the patch to show the codegen diff

Both SLP and VectorCombine should try to make patterns like this better in IR, so there might be some target cost/legality checks that need adjusting.
There's also an in-progress patch for -aggressive-instcombine that could be relevant:
D127392

Would it be better to transform this before codegen?
https://alive2.llvm.org/ce/z/uyxHSW

In D133584#3799669, @spatel wrote:

Both SLP and VectorCombine should try to make patterns like this better in IR, so there might be some target cost/legality checks that need adjusting.
There's also an in-progress patch for -aggressive-instcombine that could be relevant:
D127392

Would it be better to transform this before codegen?
https://alive2.llvm.org/ce/z/uyxHSW

Hi, thanks for your comment! The reason I tagged you is because you seem to be involved in the most closely related issues to the one here (D67841, https://bugs.llvm.org/show_bug.cgi?id=42708). It seems the conclusion is to have vectorization passes (and optimization passes in general) leave LoadCombine patterns untouched, and resolve it in the backend, no? That was the logic I used for the design here.

On the other hand, it seems D127392 is using the opposite design approach. Is the current consensus to do load combining in optimizer?

At a glance, D127392 will not address the issue identified here because it does not handle vector loads.

In D133584#3800232, @jrbyrnes wrote:

In D133584#3799669, @spatel wrote:

Both SLP and VectorCombine should try to make patterns like this better in IR, so there might be some target cost/legality checks that need adjusting.
There's also an in-progress patch for -aggressive-instcombine that could be relevant:
D127392

Would it be better to transform this before codegen?
https://alive2.llvm.org/ce/z/uyxHSW

Hi, thanks for your comment! The reason I tagged you is because you seem to be involved in the most closely related issues to the one here (D67841, https://bugs.llvm.org/show_bug.cgi?id=42708). It seems the conclusion is to have vectorization passes (and optimization passes in general) leave LoadCombine patterns untouched, and resolve it in the backend, no? That was the logic I used for the design here.

On the other hand, it seems D127392 is using the opposite design approach. Is the current approach to do load combining in optimizer?

LLVM has gone back and forth on this. There was a general load combine pass for IR, but it was removed because it interfered with other transforms in IR. So we started hacking away at codegen instead, but there are programs where doing the transform in codegen is too late to get the optimal results. So we have some limited transforms in the vectorization passes, and now we're trying to reintroduce load combining as a canonicalization (but in very limited cases and gated by target-specific legality checks).

At a glance, D127392 will not address the issue identified here because it does not handle vector loads.

Right - getting that to work correctly on the most basic integer load patterns is the first step, but we could enhance the transform for more cases (hopefully without too much work).

LLVM has gone back and forth on this. There was a general load combine pass for IR, but it was removed because it interfered with other transforms in IR. So we started hacking away at codegen instead, but there are programs where doing the transform in codegen is too late to get the optimal results. So we have some limited transforms in the vectorization passes, and now we're trying to reintroduce load combining as a canonicalization (but in very limited cases and gated by target-specific legality checks).

With this in mind, perhaps the most consistent / best way to handle this pattern is to catch it in CodeGen (this patch), and, in a separate patch, handle this pattern in a vectorization / instcombine pass (gated by legality checks). It seems to that catching it in CodeGen will only help things (e.g. in scenarios where it is not handled by transform passes).

jrbyrnes mentioned this in rG1bb293f6582b: [AMDGPU] [DAGCombiner] Precommit test for D133584.Sep 19 2022, 11:39 AM

In D133584#3800314, @jrbyrnes wrote:

LLVM has gone back and forth on this. There was a general load combine pass for IR, but it was removed because it interfered with other transforms in IR. So we started hacking away at codegen instead, but there are programs where doing the transform in codegen is too late to get the optimal results. So we have some limited transforms in the vectorization passes, and now we're trying to reintroduce load combining as a canonicalization (but in very limited cases and gated by target-specific legality checks).

With this in mind, perhaps the most consistent / best way to handle this pattern is to catch it in CodeGen (this patch), and, in a separate patch, handle this pattern in a vectorization / instcombine pass (gated by legality checks). It seems to that catching it in CodeGen will only help things (e.g. in scenarios where it is not handled by transform passes).

Sure - I didn't look at the diffs closely, but I don't object to improving the SDAG implementation. Just wanted to let you know that there are potential other places to try this kind of transform.

Address review comments -- update usage of Optional API.

Harbormaster completed remote builds in B187567: Diff 461322.Sep 19 2022, 2:34 PM

Sure - I didn't look at the diffs closely, but I don't object to improving the SDAG implementation. Just wanted to let you know that there are potential other places to try this kind of transform.

Thanks, I appreciate your feedback & thoughts on this -- especially with regard to the approach -- as you seem to be very knowledgeable about the design decisions for this issue.

Anyway, it sounds like the approach for now is to land this upstream. Are there additional thoughts about the implementation details, or is this ready?

spatel mentioned this in rGef7d61d67cb9: [AArch64] add tests for vector load combining; NFC.Sep 22 2022, 8:43 AM

spatel added inline comments.Sep 22 2022, 9:35 AM

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
7855–7859	I added more tests with: ef7d61d67cb9 ...so please update the auto-generated checks there. It might help to add some test or code comments to explain the transforms in those examples. Just looking at this code, I can't tell what the relationship is between the 3 index parameters.
7869	and -> an?
8194	Can this just be updated to use getScalarSizeInBits()?

Address review comments.

Update tests in ef7d61d67cb9

Harbormaster completed remote builds in B188267: Diff 462296.Sep 22 2022, 2:38 PM

jrbyrnes added inline comments.Sep 22 2022, 2:39 PM

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
7855–7859	Thanks for taking a look! I updated the tests you added, please let me know if you need additional info on the results.

spatel added inline comments.Sep 23 2022, 7:39 AM

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
7864	Check that `Depth != 0` instead of adding the Root parameter?
8274–8275	Still need to resolve this - if this can't be: unsigned LoadBitWidth = P.Load->getMemoryVT().getScalarSizeInBits(); ...then please add a test to demonstrate how that fails or a code comment to explain why the simpler code is not valid.
8323–8324	Do we have a test where VectorOffset is non-zero? If not, please add one (add baseline tests as needed, no pre-commit review is necessary).

Address review comments (remove unnecessary "Root" parameter).

Harbormaster completed remote builds in B188421: Diff 462504.Sep 23 2022, 8:34 AM

spatel mentioned this in D127392: [AggressiveInstCombine] Combine consecutive loads which are being merged to form a wider load..Sep 27 2022, 7:41 AM

I'm still not confident in my understanding of the various index values even with the added code comments.
I'll try to step through some of these tests in the debugger to get a better idea, but it would be good if another reviewer can have a look too for a second opinion.

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
8323–8324	This is marked as done, but I can't tell from just looking at the tests which one exercises that path. Add an example/comment here or next to a test, so it's clearer what this looks like?

RKSimon added inline comments.Sep 27 2022, 8:24 AM

llvm/include/llvm/CodeGen/SelectionDAGAddressAnalysis.h
52	Would it be better to replace this with addToOffset() that adds to the existing offset (or set it if == None)?

replace setOffset with addToOffset

In D133584#3818235, @spatel wrote:

I'm still not confident in my understanding of the various index values even with the added code comments.
I'll try to step through some of these tests in the debugger to get a better idea, but it would be good if another reviewer can have a look too for a second opinion.

Hi -- thanks for your comments and reviews. I realize I never submitted my last round of comments, so I've done so here. Apologies for the delay -- this has probably added to the incomplete understanding.

I have included a description of the various index parameters via example in an inline comment.

llvm/include/llvm/CodeGen/SelectionDAGAddressAnalysis.h
52	That does make more sense, thanks!
llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
7864	Thanks for pointing this out! In fact, this patch did not add the Root parameter, but it does look unnecessary, so I think it makes sense to remove it along with the other changes in this patch.
8323–8324	The non-zero VectorOffset path is covered by any test that uses a ExtractElement with a non-zero index. As an example, `extractelement <4 x i8> %ld, i32 1` will have a VectorOffset of 1. CalculateByteProvider will only allow such a VectorOffset if we are trying to Provide for the 1th byte. We enforce this by making sure the StartingIndex == VectorOffset. The idea is to match {0, 1, 2, 3} bytes with VectorLoad -> ExtractElement {0, 1, 2, 3}. If we find an ExtractElement index that does not match the VectorOffset, we conservatively assume that we are shuffling the elements in the vector and can not combine into a load. So this covers the `VectorIndex` (e.g. VectorOffset) and `StartingIndex` parameters in CalculateByteProvider. The remaining parameter is the original `Index` parameter. This parameter ensures the shift and loadwidth we find are able to provide for the relevant byte. For example, if we are trying to provide for the 2th byte, then we must find either a Load 8+bit -> SHL 16, or Load 16+bit -> SHL 8, or Load 24+bit. Basically, the combination of shift and byte width must cover the byte we are trying to provide for. If so, then the check `(Index >= NarrowByteWidth)` will be false, and we will return the ByteProvider.

Harbormaster completed remote builds in B188979: Diff 463269.Sep 27 2022, 10:03 AM

RKSimon added inline comments.Sep 27 2022, 10:36 AM

llvm/test/CodeGen/AMDGPU/combine-vload-extract.ll
75	Please can you pre-commit these to trunk and then rebase to show the codegen change from this patch/

Rebase

jrbyrnes marked an inline comment as done.Sep 27 2022, 11:38 AM

Harbormaster completed remote builds in B188994: Diff 463292.Sep 27 2022, 12:28 PM

jrbyrnes mentioned this in D134463: [AMDGPU] Use V_PERM to match buildvectors when inputs are not canonicalized (i.e. can't use V_PACK).Sep 29 2022, 11:11 AM

Extend ByteProvider / VectorOffset handling to support vectorScalarTypes > 1 Byte.

Additional comments, tests

jrbyrnes mentioned this in rGf6a2e6afed21: [AMDGPU] Precommit test case for D133584.Sep 30 2022, 12:43 PM

Harbormaster completed remote builds in B189749: Diff 464357.Sep 30 2022, 1:04 PM

Rebase on top of precommitted tests.

Harbormaster completed remote builds in B189761: Diff 464377.Sep 30 2022, 2:27 PM

I still can't say that I see all of the potential corner cases, but the extra comments help to explain what's going on, and it seems to work as expected, so LGTM.
Might still be good to get a 2nd approval from another reviewer since several people have commented.

This revision is now accepted and ready to land.Oct 3 2022, 9:16 AM

In D133584#3830968, @spatel wrote:

I still can't say that I see all of the potential corner cases, but the extra comments help to explain what's going on, and it seems to work as expected, so LGTM.
Might still be good to get a 2nd approval from another reviewer since several people have commented.

Hey @spatel , thanks! I appreciate your help and thoughts on this patch. I'll keep the review up for a bit longer while I do some testing to see if anyone else is willing to approve.

LGTM with nit

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp
7865	deMorgan this

Resolve nit, fix a few comment typos. NFC

arsenm accepted this revision.Oct 4 2022, 9:03 AM

Harbormaster completed remote builds in B190230: Diff 465035.Oct 4 2022, 10:10 AM

Landed via rGcebec4208982

jrbyrnes mentioned this in rGcebec4208982: [DAGCombiner] [AMDGPU] Allow vector loads in MatchLoadCombine.Oct 4 2022, 12:21 PM

Revision Contents

Path

Size

llvm/

include/

llvm/

CodeGen/

SelectionDAGAddressAnalysis.h

3 lines

lib/

CodeGen/

SelectionDAG/

DAGCombiner.cpp

152 lines

test/

CodeGen/

AArch64/

load-combine.ll

39 lines

AMDGPU/

combine-vload-extract.ll

11 lines

fast-unaligned-load-store.global.ll

23 lines

fast-unaligned-load-store.private.ll

23 lines

Diff 465035

llvm/include/llvm/CodeGen/SelectionDAGAddressAnalysis.h

Show First 20 Lines • Show All 43 Lines • ▼ Show 20 Lines	BaseIndexOffset(SDValue Base, SDValue Index, int64_t Offset,
bool IsIndexSignExt)		bool IsIndexSignExt)
: Base(Base), Index(Index), Offset(Offset),		: Base(Base), Index(Index), Offset(Offset),
IsIndexSignExt(IsIndexSignExt) {}		IsIndexSignExt(IsIndexSignExt) {}

SDValue getBase() { return Base; }		SDValue getBase() { return Base; }
SDValue getBase() const { return Base; }		SDValue getBase() const { return Base; }
SDValue getIndex() { return Index; }		SDValue getIndex() { return Index; }
SDValue getIndex() const { return Index; }		SDValue getIndex() const { return Index; }
		void addToOffset(int64_t VectorOff) {
		RKSimonUnsubmitted Done Reply Inline Actions Would it be better to replace this with addToOffset() that adds to the existing offset (or set it if == None)? RKSimon: Would it be better to replace this with addToOffset() that adds to the existing offset (or set…
		jrbyrnesAuthorUnsubmitted Done Reply Inline Actions That does make more sense, thanks! jrbyrnes: That does make more sense, thanks!
		Offset = Offset.value_or(0) + VectorOff;
		}
bool hasValidOffset() const { return Offset.has_value(); }		bool hasValidOffset() const { return Offset.has_value(); }
int64_t getOffset() const { return *Offset; }		int64_t getOffset() const { return *Offset; }

// Returns true if `Other` and `*this` are both some offset from the same base		// Returns true if `Other` and `*this` are both some offset from the same base
// pointer. In that case, `Off` is set to the offset between `*this` and		// pointer. In that case, `Off` is set to the offset between `*this` and
// `Other` (negative if `Other` is before `*this`).		// `Other` (negative if `Other` is before `*this`).
bool equalBaseIndex(const BaseIndexOffset &Other, const SelectionDAG &DAG,		bool equalBaseIndex(const BaseIndexOffset &Other, const SelectionDAG &DAG,
int64_t &Off) const;		int64_t &Off) const;
Show All 37 Lines

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,779 Lines • ▼ Show 20 Lines
/// Represents known origin of an individual byte in load combine pattern. The		/// Represents known origin of an individual byte in load combine pattern. The
/// value of the byte is either constant zero or comes from memory.		/// value of the byte is either constant zero or comes from memory.
struct ByteProvider {		struct ByteProvider {
// For constant zero providers Load is set to nullptr. For memory providers		// For constant zero providers Load is set to nullptr. For memory providers
// Load represents the node which loads the byte from memory.		// Load represents the node which loads the byte from memory.
// ByteOffset is the offset of the byte in the value produced by the load.		// ByteOffset is the offset of the byte in the value produced by the load.
LoadSDNode *Load = nullptr;		LoadSDNode *Load = nullptr;
unsigned ByteOffset = 0;		unsigned ByteOffset = 0;
		unsigned VectorOffset = 0;

ByteProvider() = default;		ByteProvider() = default;

static ByteProvider getMemory(LoadSDNode *Load, unsigned ByteOffset) {		static ByteProvider getMemory(LoadSDNode *Load, unsigned ByteOffset,
return ByteProvider(Load, ByteOffset);		unsigned VectorOffset) {
		return ByteProvider(Load, ByteOffset, VectorOffset);
}		}

static ByteProvider getConstantZero() { return ByteProvider(nullptr, 0); }		static ByteProvider getConstantZero() { return ByteProvider(nullptr, 0, 0); }

bool isConstantZero() const { return !Load; }		bool isConstantZero() const { return !Load; }
bool isMemory() const { return Load; }		bool isMemory() const { return Load; }

bool operator==(const ByteProvider &Other) const {		bool operator==(const ByteProvider &Other) const {
return Other.Load == Load && Other.ByteOffset == ByteOffset;		return Other.Load == Load && Other.ByteOffset == ByteOffset &&
		Other.VectorOffset == VectorOffset;
}		}

private:		private:
ByteProvider(LoadSDNode *Load, unsigned ByteOffset)		ByteProvider(LoadSDNode *Load, unsigned ByteOffset, unsigned VectorOffset)
: Load(Load), ByteOffset(ByteOffset) {}		: Load(Load), ByteOffset(ByteOffset), VectorOffset(VectorOffset) {}
};		};

} // end anonymous namespace		} // end anonymous namespace

/// Recursively traverses the expression calculating the origin of the requested		/// Recursively traverses the expression calculating the origin of the requested
/// byte of the given value. Returns None if the provider can't be calculated.		/// byte of the given value. Returns None if the provider can't be calculated.
///		///
/// For all the values except the root of the expression verifies that the value		/// For all the values except the root of the expression, we verify that the
/// has exactly one use and if it's not true return None. This way if the origin		/// value has exactly one use and if not then return None. This way if the
/// of the byte is returned it's guaranteed that the values which contribute to		/// origin of the byte is returned it's guaranteed that the values which
/// the byte are not used outside of this expression.		/// contribute to the byte are not used outside of this expression.

		/// However, there is a special case when dealing with vector loads -- we allow
		/// more than one use if the load is a vector type. Since the values that
		/// contribute to the byte ultimately come from the ExtractVectorElements of the
		/// Load, we don't care if the Load has uses other than ExtractVectorElements,
		/// because those operations are independent from the pattern to be combined.
		/// For vector loads, we simply care that the ByteProviders are adjacent
		/// positions of the same vector, and their index matches the byte that is being
		/// provided. This is captured by the \p VectorIndex algorithm. \p VectorIndex
		/// is the index used in an ExtractVectorElement, and \p StartingIndex is the
		/// byte position we are trying to provide for the LoadCombine. If these do
		/// not match, then we can not combine the vector loads. \p Index uses the
		/// byte position we are trying to provide for and is matched against the
		/// shl and load size. The \p Index algorithm ensures the requested byte is
		/// provided for by the pattern, and the pattern does not over provide bytes.
///		///
/// Because the parts of the expression are not allowed to have more than one		///
/// use this function iterates over trees, not DAGs. So it never visits the same		/// The supported LoadCombine pattern for vector loads is as follows
/// node more than once.		/// or
		/// / \
		/// or shl
		/// / \ \|
		/// or shl zext
		/// / \ \| \|
		/// shl zext zext EVE*
		/// \| \| \| \|
		/// zext EVE* EVE* LOAD
		/// \| \| \|
		/// EVE* LOAD LOAD
		/// \|
		/// LOAD
		///
		/// *ExtractVectorElement
static const Optional<ByteProvider>		static const Optional<ByteProvider>
calculateByteProvider(SDValue Op, unsigned Index, unsigned Depth,		calculateByteProvider(SDValue Op, unsigned Index, unsigned Depth,
bool Root = false) {		Optional<uint64_t> VectorIndex,
		unsigned StartingIndex = 0) {

// Typical i64 by i8 pattern requires recursion up to 8 calls depth		// Typical i64 by i8 pattern requires recursion up to 8 calls depth
		spatelUnsubmitted Done Reply Inline Actions I added more tests with: ef7d61d67cb9 ...so please update the auto-generated checks there. It might help to add some test or code comments to explain the transforms in those examples. Just looking at this code, I can't tell what the relationship is between the 3 index parameters. spatel: I added more tests with: ef7d61d67cb9 ...so please update the auto-generated checks there. It…
		jrbyrnesAuthorUnsubmitted Done Reply Inline Actions Thanks for taking a look! I updated the tests you added, please let me know if you need additional info on the results. jrbyrnes: Thanks for taking a look! I updated the tests you added, please let me know if you need…
if (Depth == 10)		if (Depth == 10)
return None;		return None;

if (!Root && !Op.hasOneUse())		// Only allow multiple uses if the instruction is a vector load (in which
		// case we will use the load for every ExtractVectorElement)
		spatelUnsubmitted Done Reply Inline Actions Check that `Depth != 0` instead of adding the Root parameter? spatel: Check that `Depth != 0` instead of adding the Root parameter?
		jrbyrnesAuthorUnsubmitted Done Reply Inline Actions Thanks for pointing this out! In fact, this patch did not add the Root parameter, but it does look unnecessary, so I think it makes sense to remove it along with the other changes in this patch. jrbyrnes: Thanks for pointing this out! In fact, this patch did not add the Root parameter, but it does…
		if (Depth && !Op.hasOneUse() &&
		arsenmUnsubmitted Not Done Reply Inline Actions deMorgan this arsenm: deMorgan this
		(Op.getOpcode() != ISD::LOAD \|\| !Op.getValueType().isVector()))
		return None;

		// Fail to combine if we have encountered anything but a LOAD after handling
		spatelUnsubmitted Done Reply Inline Actions and -> an? spatel: and -> an?
		// an ExtractVectorElement.
		if (Op.getOpcode() != ISD::LOAD && VectorIndex.has_value())
return None;		return None;

assert(Op.getValueType().isScalarInteger() && "can't handle other types");
unsigned BitWidth = Op.getValueSizeInBits();		unsigned BitWidth = Op.getValueSizeInBits();
if (BitWidth % 8 != 0)		if (BitWidth % 8 != 0)
return None;		return None;
unsigned ByteWidth = BitWidth / 8;		unsigned ByteWidth = BitWidth / 8;
assert(Index < ByteWidth && "invalid index requested");		assert(Index < ByteWidth && "invalid index requested");
(void) ByteWidth;		(void) ByteWidth;

switch (Op.getOpcode()) {		switch (Op.getOpcode()) {
case ISD::OR: {		case ISD::OR: {
auto LHS = calculateByteProvider(Op->getOperand(0), Index, Depth + 1);		auto LHS =
		calculateByteProvider(Op->getOperand(0), Index, Depth + 1, VectorIndex);
if (!LHS)		if (!LHS)
return None;		return None;
auto RHS = calculateByteProvider(Op->getOperand(1), Index, Depth + 1);		auto RHS =
		calculateByteProvider(Op->getOperand(1), Index, Depth + 1, VectorIndex);
if (!RHS)		if (!RHS)
return None;		return None;

if (LHS->isConstantZero())		if (LHS->isConstantZero())
return RHS;		return RHS;
if (RHS->isConstantZero())		if (RHS->isConstantZero())
return LHS;		return LHS;
return None;		return None;
}		}
case ISD::SHL: {		case ISD::SHL: {
auto ShiftOp = dyn_cast<ConstantSDNode>(Op->getOperand(1));		auto ShiftOp = dyn_cast<ConstantSDNode>(Op->getOperand(1));
if (!ShiftOp)		if (!ShiftOp)
return None;		return None;

uint64_t BitShift = ShiftOp->getZExtValue();		uint64_t BitShift = ShiftOp->getZExtValue();

if (BitShift % 8 != 0)		if (BitShift % 8 != 0)
return None;		return None;
uint64_t ByteShift = BitShift / 8;		uint64_t ByteShift = BitShift / 8;

		// If we are shifting by an amount greater than the index we are trying to
		// provide, then do not provide anything. Otherwise, subtract the index by
		// the amount we shifted by.
return Index < ByteShift		return Index < ByteShift
? ByteProvider::getConstantZero()		? ByteProvider::getConstantZero()
: calculateByteProvider(Op->getOperand(0), Index - ByteShift,		: calculateByteProvider(Op->getOperand(0), Index - ByteShift,
Depth + 1);		Depth + 1, VectorIndex, Index);
}		}
case ISD::ANY_EXTEND:		case ISD::ANY_EXTEND:
case ISD::SIGN_EXTEND:		case ISD::SIGN_EXTEND:
case ISD::ZERO_EXTEND: {		case ISD::ZERO_EXTEND: {
SDValue NarrowOp = Op->getOperand(0);		SDValue NarrowOp = Op->getOperand(0);
unsigned NarrowBitWidth = NarrowOp.getScalarValueSizeInBits();		unsigned NarrowBitWidth = NarrowOp.getScalarValueSizeInBits();
if (NarrowBitWidth % 8 != 0)		if (NarrowBitWidth % 8 != 0)
return None;		return None;
uint64_t NarrowByteWidth = NarrowBitWidth / 8;		uint64_t NarrowByteWidth = NarrowBitWidth / 8;

if (Index >= NarrowByteWidth)		if (Index >= NarrowByteWidth)
return Op.getOpcode() == ISD::ZERO_EXTEND		return Op.getOpcode() == ISD::ZERO_EXTEND
? Optional<ByteProvider>(ByteProvider::getConstantZero())		? Optional<ByteProvider>(ByteProvider::getConstantZero())
: None;		: None;
return calculateByteProvider(NarrowOp, Index, Depth + 1);		return calculateByteProvider(NarrowOp, Index, Depth + 1, VectorIndex,
		StartingIndex);
}		}
case ISD::BSWAP:		case ISD::BSWAP:
return calculateByteProvider(Op->getOperand(0), ByteWidth - Index - 1,		return calculateByteProvider(Op->getOperand(0), ByteWidth - Index - 1,
Depth + 1);		Depth + 1, VectorIndex, StartingIndex);
		case ISD::EXTRACT_VECTOR_ELT: {
		auto OffsetOp = dyn_cast<ConstantSDNode>(Op->getOperand(1));
		if (!OffsetOp)
		return None;

		VectorIndex = OffsetOp->getZExtValue();
		arsenmUnsubmitted Done Reply Inline Actions Why implicitly cast to int to assert the value? Just use uint64_t? arsenm: Why implicitly cast to int to assert the value? Just use uint64_t?

		SDValue NarrowOp = Op->getOperand(0);
		unsigned NarrowBitWidth = NarrowOp.getScalarValueSizeInBits();
		if (NarrowBitWidth % 8 != 0)
		return None;
		uint64_t NarrowByteWidth = NarrowBitWidth / 8;

		// Check to see if the position of the element in the vector corresponds
		arsenmUnsubmitted Done Reply Inline Actions Early return on condition instead of wrapping ternary operator arsenm: Early return on condition instead of wrapping ternary operator
		// with the byte we are trying to provide for. In the case of a vector of
		// i8, this simply means the VectorIndex == StartingIndex. For non i8 cases,
		// the element will provide a range of bytes. For example, if we have a
		// vector of i16s, each element provides two bytes (V[1] provides byte 2 and
		// 3).
		if (VectorIndex.value() * NarrowByteWidth > StartingIndex)
		return None;
		if ((VectorIndex.value() + 1) * NarrowByteWidth <= StartingIndex)
		return None;

		return calculateByteProvider(Op->getOperand(0), Index, Depth + 1,
		VectorIndex, StartingIndex);
		}
case ISD::LOAD: {		case ISD::LOAD: {
auto L = cast<LoadSDNode>(Op.getNode());		auto L = cast<LoadSDNode>(Op.getNode());
if (!L->isSimple() \|\| L->isIndexed())		if (!L->isSimple() \|\| L->isIndexed())
return None;		return None;

unsigned NarrowBitWidth = L->getMemoryVT().getSizeInBits();		unsigned NarrowBitWidth = L->getMemoryVT().getSizeInBits();
if (NarrowBitWidth % 8 != 0)		if (NarrowBitWidth % 8 != 0)
return None;		return None;
uint64_t NarrowByteWidth = NarrowBitWidth / 8;		uint64_t NarrowByteWidth = NarrowBitWidth / 8;

		// If the width of the load does not reach byte we are trying to provide for
		// and it is not a ZEXTLOAD, then the load does not provide for the byte in
		// question
if (Index >= NarrowByteWidth)		if (Index >= NarrowByteWidth)
return L->getExtensionType() == ISD::ZEXTLOAD		return L->getExtensionType() == ISD::ZEXTLOAD
? Optional<ByteProvider>(ByteProvider::getConstantZero())		? Optional<ByteProvider>(ByteProvider::getConstantZero())
: None;		: None;
return ByteProvider::getMemory(L, Index);
		unsigned BPVectorIndex = VectorIndex.value_or(0U);
		jmmartinezUnsubmitted Done Reply Inline Actions Is is normal that the conditions is checking if `VectorIndex` is `0` ? Shouldn't it be `auto BPVectorIndex = VectorIndex.value_or(0U);` ? jmmartinez: Is is normal that the conditions is checking if `VectorIndex` is `0` ? Shouldn't it be `auto…
		return ByteProvider::getMemory(L, Index, BPVectorIndex);
}		}
}		}

return None;		return None;
}		}

static unsigned littleEndianByteAt(unsigned BW, unsigned i) {		static unsigned littleEndianByteAt(unsigned BW, unsigned i) {
return i;		return i;
▲ Show 20 Lines • Show All 275 Lines • ▼ Show 20 Lines	SDValue DAGCombiner::MatchLoadCombine(SDNode *N) {
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
if (VT != MVT::i16 && VT != MVT::i32 && VT != MVT::i64)		if (VT != MVT::i16 && VT != MVT::i32 && VT != MVT::i64)
return SDValue();		return SDValue();
unsigned ByteWidth = VT.getSizeInBits() / 8;		unsigned ByteWidth = VT.getSizeInBits() / 8;

bool IsBigEndianTarget = DAG.getDataLayout().isBigEndian();		bool IsBigEndianTarget = DAG.getDataLayout().isBigEndian();
auto MemoryByteOffset = [&] (ByteProvider P) {		auto MemoryByteOffset = [&] (ByteProvider P) {
assert(P.isMemory() && "Must be a memory byte provider");		assert(P.isMemory() && "Must be a memory byte provider");
unsigned LoadBitWidth = P.Load->getMemoryVT().getSizeInBits();		unsigned LoadBitWidth = P.Load->getMemoryVT().getScalarSizeInBits();
spatelUnsubmitted Not Done Reply Inline Actions Can this just be updated to use getScalarSizeInBits()? spatel: Can this just be updated to use getScalarSizeInBits()?

		spatelUnsubmitted Done Reply Inline Actions Still need to resolve this - if this can't be: unsigned LoadBitWidth = P.Load->getMemoryVT().getScalarSizeInBits(); ...then please add a test to demonstrate how that fails or a code comment to explain why the simpler code is not valid. spatel: Still need to resolve this - if this can't be: unsigned LoadBitWidth = P.Load->getMemoryVT…
assert(LoadBitWidth % 8 == 0 &&		assert(LoadBitWidth % 8 == 0 &&
"can only analyze providers for individual bytes not bit");		"can only analyze providers for individual bytes not bit");
unsigned LoadByteWidth = LoadBitWidth / 8;		unsigned LoadByteWidth = LoadBitWidth / 8;
return IsBigEndianTarget		return IsBigEndianTarget
? bigEndianByteAt(LoadByteWidth, P.ByteOffset)		? bigEndianByteAt(LoadByteWidth, P.ByteOffset)
: littleEndianByteAt(LoadByteWidth, P.ByteOffset);		: littleEndianByteAt(LoadByteWidth, P.ByteOffset);
};		};

Optional<BaseIndexOffset> Base;		Optional<BaseIndexOffset> Base;
SDValue Chain;		SDValue Chain;

SmallPtrSet<LoadSDNode *, 8> Loads;		SmallPtrSet<LoadSDNode *, 8> Loads;
Optional<ByteProvider> FirstByteProvider;		Optional<ByteProvider> FirstByteProvider;
int64_t FirstOffset = INT64_MAX;		int64_t FirstOffset = INT64_MAX;

// Check if all the bytes of the OR we are looking at are loaded from the same		// Check if all the bytes of the OR we are looking at are loaded from the same
// base address. Collect bytes offsets from Base address in ByteOffsets.		// base address. Collect bytes offsets from Base address in ByteOffsets.
SmallVector<int64_t, 8> ByteOffsets(ByteWidth);		SmallVector<int64_t, 8> ByteOffsets(ByteWidth);
unsigned ZeroExtendedBytes = 0;		unsigned ZeroExtendedBytes = 0;
for (int i = ByteWidth - 1; i >= 0; --i) {		for (int i = ByteWidth - 1; i >= 0; --i) {
		jmmartinezUnsubmitted Done Reply Inline Actions This variable seems unused. jmmartinez: This variable seems unused.
auto P = calculateByteProvider(SDValue(N, 0), i, 0, /Root=/true);		auto P = calculateByteProvider(SDValue(N, 0), i, 0, /VectorIndex/ None,
		/StartingIndex/ i);
if (!P)		if (!P)
return SDValue();		return SDValue();

if (P->isConstantZero()) {		if (P->isConstantZero()) {
// It's OK for the N most significant bytes to be 0, we can just		// It's OK for the N most significant bytes to be 0, we can just
// zero-extend the load.		// zero-extend the load.
if (++ZeroExtendedBytes != (ByteWidth - static_cast<unsigned>(i)))		if (++ZeroExtendedBytes != (ByteWidth - static_cast<unsigned>(i)))
return SDValue();		return SDValue();
continue;		continue;
}		}
assert(P->isMemory() && "provenance should either be memory or zero");		assert(P->isMemory() && "provenance should either be memory or zero");

LoadSDNode *L = P->Load;		LoadSDNode *L = P->Load;
assert(L->hasNUsesOfValue(1, 0) && L->isSimple() &&
!L->isIndexed() &&
"Must be enforced by calculateByteProvider");
assert(L->getOffset().isUndef() && "Unindexed load must have undef offset");

// All loads must share the same chain		// All loads must share the same chain
SDValue LChain = L->getChain();		SDValue LChain = L->getChain();
if (!Chain)		if (!Chain)
Chain = LChain;		Chain = LChain;
else if (Chain != LChain)		else if (Chain != LChain)
return SDValue();		return SDValue();

// Loads must share the same base address		// Loads must share the same base address
BaseIndexOffset Ptr = BaseIndexOffset::match(L, DAG);		BaseIndexOffset Ptr = BaseIndexOffset::match(L, DAG);
int64_t ByteOffsetFromBase = 0;		int64_t ByteOffsetFromBase = 0;

		// For vector loads, the expected load combine pattern will have an
		// ExtractElement for each index in the vector. While each of these
		spatelUnsubmitted Done Reply Inline Actions Do we have a test where VectorOffset is non-zero? If not, please add one (add baseline tests as needed, no pre-commit review is necessary). spatel: Do we have a test where VectorOffset is non-zero? If not, please add one (add baseline tests as…
		spatelUnsubmitted Done Reply Inline Actions This is marked as done, but I can't tell from just looking at the tests which one exercises that path. Add an example/comment here or next to a test, so it's clearer what this looks like? spatel: This is marked as done, but I can't tell from just looking at the tests which one exercises…
		jrbyrnesAuthorUnsubmitted Done Reply Inline Actions The non-zero VectorOffset path is covered by any test that uses a ExtractElement with a non-zero index. As an example, `extractelement <4 x i8> %ld, i32 1` will have a VectorOffset of 1. CalculateByteProvider will only allow such a VectorOffset if we are trying to Provide for the 1th byte. We enforce this by making sure the StartingIndex == VectorOffset. The idea is to match {0, 1, 2, 3} bytes with VectorLoad -> ExtractElement {0, 1, 2, 3}. If we find an ExtractElement index that does not match the VectorOffset, we conservatively assume that we are shuffling the elements in the vector and can not combine into a load. So this covers the `VectorIndex` (e.g. VectorOffset) and `StartingIndex` parameters in CalculateByteProvider. The remaining parameter is the original `Index` parameter. This parameter ensures the shift and loadwidth we find are able to provide for the relevant byte. For example, if we are trying to provide for the 2th byte, then we must find either a Load 8+bit -> SHL 16, or Load 16+bit -> SHL 8, or Load 24+bit. Basically, the combination of shift and byte width must cover the byte we are trying to provide for. If so, then the check `(Index >= NarrowByteWidth)` will be false, and we will return the ByteProvider. jrbyrnes: The non-zero VectorOffset path is covered by any test that uses a ExtractElement with a non…
		// ExtractElements will be accessing the same base address as determined
		// by the load instruction, the actual bytes they interact with will differ
		// due to different ExtractElement indices. To accurately determine the
		// byte position of an ExtractElement, we offset the base load ptr with
		// the index multiplied by the byte size of each element in the vector.
		if (L->getMemoryVT().isVector()) {
		unsigned LoadWidthInBit = L->getMemoryVT().getScalarSizeInBits();
		if (LoadWidthInBit % 8 != 0)
		return SDValue();
		unsigned ByteOffsetFromVector = P->VectorOffset * LoadWidthInBit / 8;
		Ptr.addToOffset(ByteOffsetFromVector);
		}

if (!Base)		if (!Base)
Base = Ptr;		Base = Ptr;

else if (!Base->equalBaseIndex(Ptr, DAG, ByteOffsetFromBase))		else if (!Base->equalBaseIndex(Ptr, DAG, ByteOffsetFromBase))
return SDValue();		return SDValue();

// Calculate the offset of the current byte from the base address		// Calculate the offset of the current byte from the base address
ByteOffsetFromBase += MemoryByteOffset(*P);		ByteOffsetFromBase += MemoryByteOffset(*P);
ByteOffsets[i] = ByteOffsetFromBase;		ByteOffsets[i] = ByteOffsetFromBase;

// Remember the first byte load		// Remember the first byte load
if (ByteOffsetFromBase < FirstOffset) {		if (ByteOffsetFromBase < FirstOffset) {
FirstByteProvider = P;		FirstByteProvider = P;
FirstOffset = ByteOffsetFromBase;		FirstOffset = ByteOffsetFromBase;
}		}

Loads.insert(L);		Loads.insert(L);
}		}

assert(!Loads.empty() && "All the bytes of the value must be loaded from "		assert(!Loads.empty() && "All the bytes of the value must be loaded from "
"memory, so there must be at least one load which produces the value");		"memory, so there must be at least one load which produces the value");
assert(Base && "Base address of the accessed memory location must be set");		assert(Base && "Base address of the accessed memory location must be set");
assert(FirstOffset != INT64_MAX && "First byte offset must be set");		assert(FirstOffset != INT64_MAX && "First byte offset must be set");

bool NeedsZext = ZeroExtendedBytes > 0;		bool NeedsZext = ZeroExtendedBytes > 0;

EVT MemVT =		EVT MemVT =
▲ Show 20 Lines • Show All 17,020 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/load-combine.ll

Show First 20 Lines • Show All 556 Lines • ▼ Show 20 Lines	; CHECK-NEXT: ret
%tmp4 = getelementptr inbounds i8, i8* %tmp, i32 0		%tmp4 = getelementptr inbounds i8, i8* %tmp, i32 0
%tmp5 = load i8, i8* %tmp4, align 2		%tmp5 = load i8, i8* %tmp4, align 2
%tmp6 = zext i8 %tmp5 to i32		%tmp6 = zext i8 %tmp5 to i32
%tmp7 = shl nuw nsw i32 %tmp6, 24		%tmp7 = shl nuw nsw i32 %tmp6, 24
%tmp8 = or i32 %tmp7, %tmp30		%tmp8 = or i32 %tmp7, %tmp30
ret i32 %tmp8		ret i32 %tmp8
}		}

		; x1 = x0
define void @short_vector_to_i32(<4 x i8>* %in, i32* %out, i32* %p) {		define void @short_vector_to_i32(<4 x i8>* %in, i32* %out, i32* %p) {
; CHECK-LABEL: short_vector_to_i32:		; CHECK-LABEL: short_vector_to_i32:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: ldr s0, [x0]		; CHECK-NEXT: ldr w8, [x0]
; CHECK-NEXT: ushll v0.8h, v0.8b, #0
; CHECK-NEXT: umov w8, v0.h[0]
; CHECK-NEXT: umov w9, v0.h[1]
; CHECK-NEXT: umov w10, v0.h[2]
; CHECK-NEXT: umov w11, v0.h[3]
; CHECK-NEXT: bfi w8, w9, #8, #8
; CHECK-NEXT: bfi w8, w10, #16, #8
; CHECK-NEXT: bfi w8, w11, #24, #8
; CHECK-NEXT: str w8, [x1]		; CHECK-NEXT: str w8, [x1]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%ld = load <4 x i8>, <4 x i8>* %in, align 4		%ld = load <4 x i8>, <4 x i8>* %in, align 4

%e1 = extractelement <4 x i8> %ld, i32 0		%e1 = extractelement <4 x i8> %ld, i32 0
%e2 = extractelement <4 x i8> %ld, i32 1		%e2 = extractelement <4 x i8> %ld, i32 1
%e3 = extractelement <4 x i8> %ld, i32 2		%e3 = extractelement <4 x i8> %ld, i32 2
%e4 = extractelement <4 x i8> %ld, i32 3		%e4 = extractelement <4 x i8> %ld, i32 3
▲ Show 20 Lines • Show All 48 Lines • ▼ Show 20 Lines	; CHECK-NEXT: ret
store i32 %i3, i32* %out		store i32 %i3, i32* %out
ret void		ret void
}		}

define void @short_vector_to_i32_unused_high_i8(<4 x i8>* %in, i32* %out, i32* %p) {		define void @short_vector_to_i32_unused_high_i8(<4 x i8>* %in, i32* %out, i32* %p) {
; CHECK-LABEL: short_vector_to_i32_unused_high_i8:		; CHECK-LABEL: short_vector_to_i32_unused_high_i8:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: ldr s0, [x0]		; CHECK-NEXT: ldr s0, [x0]
		; CHECK-NEXT: ldrh w9, [x0]
; CHECK-NEXT: ushll v0.8h, v0.8b, #0		; CHECK-NEXT: ushll v0.8h, v0.8b, #0
; CHECK-NEXT: umov w8, v0.h[0]		; CHECK-NEXT: umov w8, v0.h[2]
; CHECK-NEXT: umov w9, v0.h[1]		; CHECK-NEXT: bfi w9, w8, #16, #8
; CHECK-NEXT: umov w10, v0.h[2]		; CHECK-NEXT: str w9, [x1]
; CHECK-NEXT: bfi w8, w9, #8, #8
; CHECK-NEXT: bfi w8, w10, #16, #8
; CHECK-NEXT: str w8, [x1]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%ld = load <4 x i8>, <4 x i8>* %in, align 4		%ld = load <4 x i8>, <4 x i8>* %in, align 4

%e1 = extractelement <4 x i8> %ld, i32 0		%e1 = extractelement <4 x i8> %ld, i32 0
%e2 = extractelement <4 x i8> %ld, i32 1		%e2 = extractelement <4 x i8> %ld, i32 1
%e3 = extractelement <4 x i8> %ld, i32 2		%e3 = extractelement <4 x i8> %ld, i32 2

%z0 = zext i8 %e1 to i32		%z0 = zext i8 %e1 to i32
Show All 33 Lines	; CHECK-NEXT: ret
%s3 = shl nuw i32 %z3, 24		%s3 = shl nuw i32 %z3, 24

%i3 = or i32 %s2, %s3		%i3 = or i32 %s2, %s3

store i32 %i3, i32* %out		store i32 %i3, i32* %out
ret void		ret void
}		}

		; x1 = x0[0:1]
define void @short_vector_to_i32_unused_high_i16(<4 x i8>* %in, i32* %out, i32* %p) {		define void @short_vector_to_i32_unused_high_i16(<4 x i8>* %in, i32* %out, i32* %p) {
; CHECK-LABEL: short_vector_to_i32_unused_high_i16:		; CHECK-LABEL: short_vector_to_i32_unused_high_i16:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: ldr s0, [x0]		; CHECK-NEXT: ldrh w8, [x0]
; CHECK-NEXT: ushll v0.8h, v0.8b, #0
; CHECK-NEXT: umov w8, v0.h[0]
; CHECK-NEXT: umov w9, v0.h[1]
; CHECK-NEXT: bfi w8, w9, #8, #8
; CHECK-NEXT: str w8, [x1]		; CHECK-NEXT: str w8, [x1]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%ld = load <4 x i8>, <4 x i8>* %in, align 4		%ld = load <4 x i8>, <4 x i8>* %in, align 4

%e1 = extractelement <4 x i8> %ld, i32 0		%e1 = extractelement <4 x i8> %ld, i32 0
%e2 = extractelement <4 x i8> %ld, i32 1		%e2 = extractelement <4 x i8> %ld, i32 1

%z0 = zext i8 %e1 to i32		%z0 = zext i8 %e1 to i32
%z1 = zext i8 %e2 to i32		%z1 = zext i8 %e2 to i32

%s1 = shl nuw nsw i32 %z1, 8		%s1 = shl nuw nsw i32 %z1, 8

%i1 = or i32 %s1, %z0		%i1 = or i32 %s1, %z0

store i32 %i1, i32* %out		store i32 %i1, i32* %out
ret void		ret void
}		}

		; x1 = x0
define void @short_vector_to_i64(<4 x i8>* %in, i64* %out, i64* %p) {		define void @short_vector_to_i64(<4 x i8>* %in, i64* %out, i64* %p) {
; CHECK-LABEL: short_vector_to_i64:		; CHECK-LABEL: short_vector_to_i64:
; CHECK: // %bb.0:		; CHECK: // %bb.0:
; CHECK-NEXT: ldr s0, [x0]		; CHECK-NEXT: ldr w8, [x0]
; CHECK-NEXT: ushll v0.8h, v0.8b, #0
; CHECK-NEXT: umov w8, v0.h[0]
; CHECK-NEXT: umov w9, v0.h[1]
; CHECK-NEXT: umov w10, v0.h[2]
; CHECK-NEXT: umov w11, v0.h[3]
; CHECK-NEXT: bfi x8, x9, #8, #8
; CHECK-NEXT: bfi x8, x10, #16, #8
; CHECK-NEXT: bfi x8, x11, #24, #8
; CHECK-NEXT: str x8, [x1]		; CHECK-NEXT: str x8, [x1]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
%ld = load <4 x i8>, <4 x i8>* %in, align 4		%ld = load <4 x i8>, <4 x i8>* %in, align 4

%e1 = extractelement <4 x i8> %ld, i32 0		%e1 = extractelement <4 x i8> %ld, i32 0
%e2 = extractelement <4 x i8> %ld, i32 1		%e2 = extractelement <4 x i8> %ld, i32 1
%e3 = extractelement <4 x i8> %ld, i32 2		%e3 = extractelement <4 x i8> %ld, i32 2
%e4 = extractelement <4 x i8> %ld, i32 3		%e4 = extractelement <4 x i8> %ld, i32 3
Show All 17 Lines

llvm/test/CodeGen/AMDGPU/combine-vload-extract.ll

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc -march=amdgcn -mcpu=gfx90a -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s		; RUN: llc -march=amdgcn -mcpu=gfx90a -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s
		arsenmUnsubmitted Done Reply Inline Actions Don’t need -O3? arsenm: Don’t need -O3?

define amdgpu_kernel void @vectorLoadCombine(<4 x i8>* %in, i32* %out) {		define amdgpu_kernel void @vectorLoadCombine(<4 x i8>* %in, i32* %out) {
; GCN-LABEL: vectorLoadCombine:		; GCN-LABEL: vectorLoadCombine:
; GCN: ; %bb.0: ; %entry		; GCN: ; %bb.0: ; %entry
; GCN-NEXT: s_load_dwordx4 s[0:3], s[0:1], 0x24		; GCN-NEXT: s_load_dwordx4 s[0:3], s[0:1], 0x24
; GCN-NEXT: s_waitcnt lgkmcnt(0)		; GCN-NEXT: s_waitcnt lgkmcnt(0)
; GCN-NEXT: v_mov_b32_e32 v0, s0		; GCN-NEXT: v_mov_b32_e32 v0, s0
; GCN-NEXT: v_mov_b32_e32 v1, s1		; GCN-NEXT: v_mov_b32_e32 v1, s1
; GCN-NEXT: flat_load_dword v2, v[0:1]		; GCN-NEXT: flat_load_dword v2, v[0:1]
; GCN-NEXT: s_mov_b32 s0, 0x6050400
; GCN-NEXT: v_mov_b32_e32 v0, s2		; GCN-NEXT: v_mov_b32_e32 v0, s2
; GCN-NEXT: v_mov_b32_e32 v1, s3		; GCN-NEXT: v_mov_b32_e32 v1, s3
; GCN-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)		; GCN-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GCN-NEXT: v_bfe_u32 v3, v2, 8, 8
; GCN-NEXT: v_and_b32_e32 v4, 0xff0000, v2
; GCN-NEXT: v_perm_b32 v3, v3, v2, s0
; GCN-NEXT: v_and_b32_e32 v2, 0xff000000, v2
; GCN-NEXT: v_or3_b32 v2, v3, v4, v2
; GCN-NEXT: flat_store_dword v[0:1], v2		; GCN-NEXT: flat_store_dword v[0:1], v2
; GCN-NEXT: s_endpgm		; GCN-NEXT: s_endpgm
entry:		entry:
%0 = load <4 x i8>, <4 x i8>* %in, align 4		%0 = load <4 x i8>, <4 x i8>* %in, align 4
%1 = extractelement <4 x i8> %0, i32 0		%1 = extractelement <4 x i8> %0, i32 0
%2 = extractelement <4 x i8> %0, i32 1		%2 = extractelement <4 x i8> %0, i32 1
%3 = extractelement <4 x i8> %0, i32 2		%3 = extractelement <4 x i8> %0, i32 2
%4 = extractelement <4 x i8> %0, i32 3		%4 = extractelement <4 x i8> %0, i32 3
▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	entry:
%zext2 = zext i8 %2 to i32		%zext2 = zext i8 %2 to i32
%shift2 = shl nuw nsw i32 %zext2, 16		%shift2 = shl nuw nsw i32 %zext2, 16
%insert2 = or i32 %insert1, %shift2		%insert2 = or i32 %insert1, %shift2
%zext3 = zext i8 %4 to i32		%zext3 = zext i8 %4 to i32
%shift3 = shl nuw i32 %zext3, 24		%shift3 = shl nuw i32 %zext3, 24
%insert3 = or i32 %insert2, %shift3		%insert3 = or i32 %insert2, %shift3
store i32 %insert3, i32* %out		store i32 %insert3, i32* %out
ret void		ret void
}		}
		RKSimonUnsubmitted Done Reply Inline Actions Please can you pre-commit these to trunk and then rebase to show the codegen change from this patch/ RKSimon: Please can you pre-commit these to trunk and then rebase to show the codegen change from this…
define i32 @load_2xi16_combine(i16 addrspace(1)* %p) #0 {		define i32 @load_2xi16_combine(i16 addrspace(1)* %p) #0 {
; GCN-LABEL: load_2xi16_combine:		; GCN-LABEL: load_2xi16_combine:
; GCN: ; %bb.0:		; GCN: ; %bb.0:
; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GCN-NEXT: global_load_dword v0, v[0:1], off		; GCN-NEXT: global_load_dword v0, v[0:1], off
; GCN-NEXT: s_mov_b32 s4, 0xffff
; GCN-NEXT: s_waitcnt vmcnt(0)		; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: v_and_b32_e32 v1, 0xffff0000, v0
; GCN-NEXT: v_and_or_b32 v0, v0, s4, v1
; GCN-NEXT: s_setpc_b64 s[30:31]		; GCN-NEXT: s_setpc_b64 s[30:31]
%gep.p = getelementptr i16, i16 addrspace(1)* %p, i32 1		%gep.p = getelementptr i16, i16 addrspace(1)* %p, i32 1
%p.0 = load i16, i16 addrspace(1)* %p, align 4		%p.0 = load i16, i16 addrspace(1)* %p, align 4
%p.1 = load i16, i16 addrspace(1)* %gep.p, align 4		%p.1 = load i16, i16 addrspace(1)* %gep.p, align 4
%zext.0 = zext i16 %p.0 to i32		%zext.0 = zext i16 %p.0 to i32
%zext.1 = zext i16 %p.1 to i32		%zext.1 = zext i16 %p.1 to i32
%shl.1 = shl i32 %zext.1, 16		%shl.1 = shl i32 %zext.1, 16
%or = or i32 %zext.0, %shl.1		%or = or i32 %zext.0, %shl.1
▲ Show 20 Lines • Show All 58 Lines • ▼ Show 20 Lines
}		}

define i64 @load_4xi16_combine(i16 addrspace(1)* %p) #0 {		define i64 @load_4xi16_combine(i16 addrspace(1)* %p) #0 {
; GCN-LABEL: load_4xi16_combine:		; GCN-LABEL: load_4xi16_combine:
; GCN: ; %bb.0:		; GCN: ; %bb.0:
; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GCN-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GCN-NEXT: global_load_dwordx2 v[0:1], v[0:1], off		; GCN-NEXT: global_load_dwordx2 v[0:1], v[0:1], off
; GCN-NEXT: s_waitcnt vmcnt(0)		; GCN-NEXT: s_waitcnt vmcnt(0)
; GCN-NEXT: v_and_b32_e32 v2, 0xffff0000, v1
; GCN-NEXT: v_or_b32_sdwa v1, v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_0 src1_sel:DWORD
; GCN-NEXT: s_setpc_b64 s[30:31]		; GCN-NEXT: s_setpc_b64 s[30:31]
%gep.p = getelementptr i16, i16 addrspace(1)* %p, i32 1		%gep.p = getelementptr i16, i16 addrspace(1)* %p, i32 1
%gep.2p = getelementptr i16, i16 addrspace(1)* %p, i32 2		%gep.2p = getelementptr i16, i16 addrspace(1)* %p, i32 2
%gep.3p = getelementptr i16, i16 addrspace(1)* %p, i32 3		%gep.3p = getelementptr i16, i16 addrspace(1)* %p, i32 3
%p.0 = load i16, i16 addrspace(1)* %p, align 4		%p.0 = load i16, i16 addrspace(1)* %p, align 4
%p.1 = load i16, i16 addrspace(1)* %gep.p, align 4		%p.1 = load i16, i16 addrspace(1)* %gep.p, align 4
%p.2 = load i16, i16 addrspace(1)* %gep.2p, align 4		%p.2 = load i16, i16 addrspace(1)* %gep.2p, align 4
%p.3 = load i16, i16 addrspace(1)* %gep.3p, align 4		%p.3 = load i16, i16 addrspace(1)* %gep.3p, align 4
▲ Show 20 Lines • Show All 97 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.global.ll

Show First 20 Lines • Show All 178 Lines • ▼ Show 20 Lines
; GFX7-UNALIGNED-NEXT: flat_load_dword v0, v[0:1]		; GFX7-UNALIGNED-NEXT: flat_load_dword v0, v[0:1]
; GFX7-UNALIGNED-NEXT: s_waitcnt vmcnt(0)		; GFX7-UNALIGNED-NEXT: s_waitcnt vmcnt(0)
; GFX7-UNALIGNED-NEXT: s_setpc_b64 s[30:31]		; GFX7-UNALIGNED-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX9-LABEL: global_load_2xi16_align1:		; GFX9-LABEL: global_load_2xi16_align1:
; GFX9: ; %bb.0:		; GFX9: ; %bb.0:
; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX9-NEXT: global_load_dword v0, v[0:1], off		; GFX9-NEXT: global_load_dword v0, v[0:1], off
; GFX9-NEXT: s_mov_b32 s4, 0xffff
; GFX9-NEXT: s_waitcnt vmcnt(0)		; GFX9-NEXT: s_waitcnt vmcnt(0)
; GFX9-NEXT: v_and_b32_e32 v1, 0xffff0000, v0
; GFX9-NEXT: v_and_or_b32 v0, v0, s4, v1
; GFX9-NEXT: s_setpc_b64 s[30:31]		; GFX9-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX10-LABEL: global_load_2xi16_align1:		; GFX10-LABEL: global_load_2xi16_align1:
; GFX10: ; %bb.0:		; GFX10: ; %bb.0:
; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX10-NEXT: s_waitcnt_vscnt null, 0x0		; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
; GFX10-NEXT: global_load_dword v0, v[0:1], off		; GFX10-NEXT: global_load_dword v0, v[0:1], off
; GFX10-NEXT: s_waitcnt vmcnt(0)		; GFX10-NEXT: s_waitcnt vmcnt(0)
; GFX10-NEXT: v_and_b32_e32 v1, 0xffff0000, v0
; GFX10-NEXT: v_and_or_b32 v0, 0xffff, v0, v1
; GFX10-NEXT: s_setpc_b64 s[30:31]		; GFX10-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX11-LABEL: global_load_2xi16_align1:		; GFX11-LABEL: global_load_2xi16_align1:
; GFX11: ; %bb.0:		; GFX11: ; %bb.0:
; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-NEXT: s_waitcnt_vscnt null, 0x0		; GFX11-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-NEXT: global_load_b32 v0, v[0:1], off		; GFX11-NEXT: global_load_b32 v0, v[0:1], off
; GFX11-NEXT: s_waitcnt vmcnt(0)		; GFX11-NEXT: s_waitcnt vmcnt(0)
; GFX11-NEXT: v_and_b32_e32 v1, 0xffff0000, v0
; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-NEXT: v_and_or_b32 v0, 0xffff, v0, v1
; GFX11-NEXT: s_setpc_b64 s[30:31]		; GFX11-NEXT: s_setpc_b64 s[30:31]
%gep.p = getelementptr i16, i16 addrspace(1)* %p, i64 1		%gep.p = getelementptr i16, i16 addrspace(1)* %p, i64 1
%p.0 = load i16, i16 addrspace(1)* %p, align 1		%p.0 = load i16, i16 addrspace(1)* %p, align 1
%p.1 = load i16, i16 addrspace(1)* %gep.p, align 1		%p.1 = load i16, i16 addrspace(1)* %gep.p, align 1
%zext.0 = zext i16 %p.0 to i32		%zext.0 = zext i16 %p.0 to i32
%zext.1 = zext i16 %p.1 to i32		%zext.1 = zext i16 %p.1 to i32
%shl.1 = shl i32 %zext.1, 16		%shl.1 = shl i32 %zext.1, 16
%or = or i32 %zext.0, %shl.1		%or = or i32 %zext.0, %shl.1
▲ Show 20 Lines • Show All 68 Lines • ▼ Show 20 Lines	; GFX11-NEXT: s_endpgm
%gep.r = getelementptr i16, i16 addrspace(1)* %r, i64 1		%gep.r = getelementptr i16, i16 addrspace(1)* %r, i64 1
store i16 1, i16 addrspace(1)* %r, align 1		store i16 1, i16 addrspace(1)* %r, align 1
store i16 2, i16 addrspace(1)* %gep.r, align 1		store i16 2, i16 addrspace(1)* %gep.r, align 1
ret void		ret void
}		}

; Should merge this to a dword load		; Should merge this to a dword load
define i32 @global_load_2xi16_align4(i16 addrspace(1)* %p) #0 {		define i32 @global_load_2xi16_align4(i16 addrspace(1)* %p) #0 {
; GFX7-LABEL: load_2xi16_align4:
; GFX7: ; %bb.0:
; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX7-NEXT: flat_load_dword v0, v[0:1]
; GFX7-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX7-NEXT: s_setpc_b64 s[30:31]
;
; GFX7-ALIGNED-LABEL: global_load_2xi16_align4:		; GFX7-ALIGNED-LABEL: global_load_2xi16_align4:
; GFX7-ALIGNED: ; %bb.0:		; GFX7-ALIGNED: ; %bb.0:
; GFX7-ALIGNED-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX7-ALIGNED-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX7-ALIGNED-NEXT: flat_load_dword v0, v[0:1]		; GFX7-ALIGNED-NEXT: flat_load_dword v0, v[0:1]
; GFX7-ALIGNED-NEXT: s_waitcnt vmcnt(0)		; GFX7-ALIGNED-NEXT: s_waitcnt vmcnt(0)
; GFX7-ALIGNED-NEXT: s_setpc_b64 s[30:31]		; GFX7-ALIGNED-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX7-UNALIGNED-LABEL: global_load_2xi16_align4:		; GFX7-UNALIGNED-LABEL: global_load_2xi16_align4:
; GFX7-UNALIGNED: ; %bb.0:		; GFX7-UNALIGNED: ; %bb.0:
; GFX7-UNALIGNED-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX7-UNALIGNED-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX7-UNALIGNED-NEXT: flat_load_dword v0, v[0:1]		; GFX7-UNALIGNED-NEXT: flat_load_dword v0, v[0:1]
; GFX7-UNALIGNED-NEXT: s_waitcnt vmcnt(0)		; GFX7-UNALIGNED-NEXT: s_waitcnt vmcnt(0)
; GFX7-UNALIGNED-NEXT: s_setpc_b64 s[30:31]		; GFX7-UNALIGNED-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX9-LABEL: global_load_2xi16_align4:		; GFX9-LABEL: global_load_2xi16_align4:
; GFX9: ; %bb.0:		; GFX9: ; %bb.0:
; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX9-NEXT: global_load_dword v0, v[0:1], off		; GFX9-NEXT: global_load_dword v0, v[0:1], off
; GFX9-NEXT: s_mov_b32 s4, 0xffff
; GFX9-NEXT: s_waitcnt vmcnt(0)		; GFX9-NEXT: s_waitcnt vmcnt(0)
; GFX9-NEXT: v_and_b32_e32 v1, 0xffff0000, v0
; GFX9-NEXT: v_and_or_b32 v0, v0, s4, v1
; GFX9-NEXT: s_setpc_b64 s[30:31]		; GFX9-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX10-LABEL: global_load_2xi16_align4:		; GFX10-LABEL: global_load_2xi16_align4:
; GFX10: ; %bb.0:		; GFX10: ; %bb.0:
; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX10-NEXT: s_waitcnt_vscnt null, 0x0		; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
; GFX10-NEXT: global_load_dword v0, v[0:1], off		; GFX10-NEXT: global_load_dword v0, v[0:1], off
; GFX10-NEXT: s_waitcnt vmcnt(0)		; GFX10-NEXT: s_waitcnt vmcnt(0)
; GFX10-NEXT: v_and_b32_e32 v1, 0xffff0000, v0
; GFX10-NEXT: v_and_or_b32 v0, 0xffff, v0, v1
; GFX10-NEXT: s_setpc_b64 s[30:31]		; GFX10-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX11-LABEL: global_load_2xi16_align4:		; GFX11-LABEL: global_load_2xi16_align4:
; GFX11: ; %bb.0:		; GFX11: ; %bb.0:
; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-NEXT: s_waitcnt_vscnt null, 0x0		; GFX11-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-NEXT: global_load_b32 v0, v[0:1], off		; GFX11-NEXT: global_load_b32 v0, v[0:1], off
; GFX11-NEXT: s_waitcnt vmcnt(0)		; GFX11-NEXT: s_waitcnt vmcnt(0)
; GFX11-NEXT: v_and_b32_e32 v1, 0xffff0000, v0
; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-NEXT: v_and_or_b32 v0, 0xffff, v0, v1
; GFX11-NEXT: s_setpc_b64 s[30:31]		; GFX11-NEXT: s_setpc_b64 s[30:31]
%gep.p = getelementptr i16, i16 addrspace(1)* %p, i64 1		%gep.p = getelementptr i16, i16 addrspace(1)* %p, i64 1
%p.0 = load i16, i16 addrspace(1)* %p, align 4		%p.0 = load i16, i16 addrspace(1)* %p, align 4
%p.1 = load i16, i16 addrspace(1)* %gep.p, align 2		%p.1 = load i16, i16 addrspace(1)* %gep.p, align 2
%zext.0 = zext i16 %p.0 to i32		%zext.0 = zext i16 %p.0 to i32
%zext.1 = zext i16 %p.1 to i32		%zext.1 = zext i16 %p.1 to i32
%shl.1 = shl i32 %zext.1, 16		%shl.1 = shl i32 %zext.1, 16
%or = or i32 %zext.0, %shl.1		%or = or i32 %zext.0, %shl.1
▲ Show 20 Lines • Show All 99 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.private.ll

Show First 20 Lines • Show All 381 Lines • ▼ Show 20 Lines	; GFX11-FLASTSCR-NEXT: s_setpc_b64 s[30:31]
%gep.r = getelementptr i16, i16 addrspace(5)* %r, i64 1		%gep.r = getelementptr i16, i16 addrspace(5)* %r, i64 1
store i16 1, i16 addrspace(5)* %r, align 1		store i16 1, i16 addrspace(5)* %r, align 1
store i16 2, i16 addrspace(5)* %gep.r, align 1		store i16 2, i16 addrspace(5)* %gep.r, align 1
ret void		ret void
}		}

; Should merge this to a dword load		; Should merge this to a dword load
define i32 @private_load_2xi16_align4(i16 addrspace(5)* %p) #0 {		define i32 @private_load_2xi16_align4(i16 addrspace(5)* %p) #0 {
; GFX7-LABEL: load_2xi16_align4:
; GFX7: ; %bb.0:
; GFX7-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX7-NEXT: flat_load_dword v0, v[0:1]
; GFX7-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
; GFX7-NEXT: s_setpc_b64 s[30:31]
;
; GFX7-ALIGNED-LABEL: private_load_2xi16_align4:		; GFX7-ALIGNED-LABEL: private_load_2xi16_align4:
; GFX7-ALIGNED: ; %bb.0:		; GFX7-ALIGNED: ; %bb.0:
; GFX7-ALIGNED-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX7-ALIGNED-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX7-ALIGNED-NEXT: buffer_load_dword v0, v0, s[0:3], 0 offen		; GFX7-ALIGNED-NEXT: buffer_load_dword v0, v0, s[0:3], 0 offen
; GFX7-ALIGNED-NEXT: s_waitcnt vmcnt(0)		; GFX7-ALIGNED-NEXT: s_waitcnt vmcnt(0)
; GFX7-ALIGNED-NEXT: s_setpc_b64 s[30:31]		; GFX7-ALIGNED-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX7-UNALIGNED-LABEL: private_load_2xi16_align4:		; GFX7-UNALIGNED-LABEL: private_load_2xi16_align4:
; GFX7-UNALIGNED: ; %bb.0:		; GFX7-UNALIGNED: ; %bb.0:
; GFX7-UNALIGNED-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX7-UNALIGNED-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX7-UNALIGNED-NEXT: buffer_load_dword v0, v0, s[0:3], 0 offen		; GFX7-UNALIGNED-NEXT: buffer_load_dword v0, v0, s[0:3], 0 offen
; GFX7-UNALIGNED-NEXT: s_waitcnt vmcnt(0)		; GFX7-UNALIGNED-NEXT: s_waitcnt vmcnt(0)
; GFX7-UNALIGNED-NEXT: s_setpc_b64 s[30:31]		; GFX7-UNALIGNED-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX9-LABEL: private_load_2xi16_align4:		; GFX9-LABEL: private_load_2xi16_align4:
; GFX9: ; %bb.0:		; GFX9: ; %bb.0:
; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX9-NEXT: buffer_load_dword v0, v0, s[0:3], 0 offen		; GFX9-NEXT: buffer_load_dword v0, v0, s[0:3], 0 offen
; GFX9-NEXT: s_mov_b32 s4, 0xffff
; GFX9-NEXT: s_waitcnt vmcnt(0)		; GFX9-NEXT: s_waitcnt vmcnt(0)
; GFX9-NEXT: v_and_b32_e32 v1, 0xffff0000, v0
; GFX9-NEXT: v_and_or_b32 v0, v0, s4, v1
; GFX9-NEXT: s_setpc_b64 s[30:31]		; GFX9-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX9-FLASTSCR-LABEL: private_load_2xi16_align4:		; GFX9-FLASTSCR-LABEL: private_load_2xi16_align4:
; GFX9-FLASTSCR: ; %bb.0:		; GFX9-FLASTSCR: ; %bb.0:
; GFX9-FLASTSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX9-FLASTSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX9-FLASTSCR-NEXT: scratch_load_dword v0, v0, off		; GFX9-FLASTSCR-NEXT: scratch_load_dword v0, v0, off
; GFX9-FLASTSCR-NEXT: s_mov_b32 s0, 0xffff
; GFX9-FLASTSCR-NEXT: s_waitcnt vmcnt(0)		; GFX9-FLASTSCR-NEXT: s_waitcnt vmcnt(0)
; GFX9-FLASTSCR-NEXT: v_and_b32_e32 v1, 0xffff0000, v0
; GFX9-FLASTSCR-NEXT: v_and_or_b32 v0, v0, s0, v1
; GFX9-FLASTSCR-NEXT: s_setpc_b64 s[30:31]		; GFX9-FLASTSCR-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX10-LABEL: private_load_2xi16_align4:		; GFX10-LABEL: private_load_2xi16_align4:
; GFX10: ; %bb.0:		; GFX10: ; %bb.0:
; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX10-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX10-NEXT: s_waitcnt_vscnt null, 0x0		; GFX10-NEXT: s_waitcnt_vscnt null, 0x0
; GFX10-NEXT: buffer_load_dword v0, v0, s[0:3], 0 offen		; GFX10-NEXT: buffer_load_dword v0, v0, s[0:3], 0 offen
; GFX10-NEXT: s_waitcnt vmcnt(0)		; GFX10-NEXT: s_waitcnt vmcnt(0)
; GFX10-NEXT: v_and_b32_e32 v1, 0xffff0000, v0
; GFX10-NEXT: v_and_or_b32 v0, 0xffff, v0, v1
; GFX10-NEXT: s_setpc_b64 s[30:31]		; GFX10-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX10-FLASTSCR-LABEL: private_load_2xi16_align4:		; GFX10-FLASTSCR-LABEL: private_load_2xi16_align4:
; GFX10-FLASTSCR: ; %bb.0:		; GFX10-FLASTSCR: ; %bb.0:
; GFX10-FLASTSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX10-FLASTSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX10-FLASTSCR-NEXT: s_waitcnt_vscnt null, 0x0		; GFX10-FLASTSCR-NEXT: s_waitcnt_vscnt null, 0x0
; GFX10-FLASTSCR-NEXT: scratch_load_dword v0, v0, off		; GFX10-FLASTSCR-NEXT: scratch_load_dword v0, v0, off
; GFX10-FLASTSCR-NEXT: s_waitcnt vmcnt(0)		; GFX10-FLASTSCR-NEXT: s_waitcnt vmcnt(0)
; GFX10-FLASTSCR-NEXT: v_and_b32_e32 v1, 0xffff0000, v0
; GFX10-FLASTSCR-NEXT: v_and_or_b32 v0, 0xffff, v0, v1
; GFX10-FLASTSCR-NEXT: s_setpc_b64 s[30:31]		; GFX10-FLASTSCR-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX11-LABEL: private_load_2xi16_align4:		; GFX11-LABEL: private_load_2xi16_align4:
; GFX11: ; %bb.0:		; GFX11: ; %bb.0:
; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX11-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-NEXT: s_waitcnt_vscnt null, 0x0		; GFX11-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-NEXT: scratch_load_b32 v0, v0, off		; GFX11-NEXT: scratch_load_b32 v0, v0, off
; GFX11-NEXT: s_waitcnt vmcnt(0)		; GFX11-NEXT: s_waitcnt vmcnt(0)
; GFX11-NEXT: v_and_b32_e32 v1, 0xffff0000, v0
; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-NEXT: v_and_or_b32 v0, 0xffff, v0, v1
; GFX11-NEXT: s_setpc_b64 s[30:31]		; GFX11-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX11-FLASTSCR-LABEL: private_load_2xi16_align4:		; GFX11-FLASTSCR-LABEL: private_load_2xi16_align4:
; GFX11-FLASTSCR: ; %bb.0:		; GFX11-FLASTSCR: ; %bb.0:
; GFX11-FLASTSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX11-FLASTSCR-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX11-FLASTSCR-NEXT: s_waitcnt_vscnt null, 0x0		; GFX11-FLASTSCR-NEXT: s_waitcnt_vscnt null, 0x0
; GFX11-FLASTSCR-NEXT: scratch_load_b32 v0, v0, off		; GFX11-FLASTSCR-NEXT: scratch_load_b32 v0, v0, off
; GFX11-FLASTSCR-NEXT: s_waitcnt vmcnt(0)		; GFX11-FLASTSCR-NEXT: s_waitcnt vmcnt(0)
; GFX11-FLASTSCR-NEXT: v_and_b32_e32 v1, 0xffff0000, v0
; GFX11-FLASTSCR-NEXT: s_delay_alu instid0(VALU_DEP_1)
; GFX11-FLASTSCR-NEXT: v_and_or_b32 v0, 0xffff, v0, v1
; GFX11-FLASTSCR-NEXT: s_setpc_b64 s[30:31]		; GFX11-FLASTSCR-NEXT: s_setpc_b64 s[30:31]
%gep.p = getelementptr i16, i16 addrspace(5)* %p, i64 1		%gep.p = getelementptr i16, i16 addrspace(5)* %p, i64 1
%p.0 = load i16, i16 addrspace(5)* %p, align 4		%p.0 = load i16, i16 addrspace(5)* %p, align 4
%p.1 = load i16, i16 addrspace(5)* %gep.p, align 2		%p.1 = load i16, i16 addrspace(5)* %gep.p, align 2
%zext.0 = zext i16 %p.0 to i32		%zext.0 = zext i16 %p.0 to i32
%zext.1 = zext i16 %p.1 to i32		%zext.1 = zext i16 %p.1 to i32
%shl.1 = shl i32 %zext.1, 16		%shl.1 = shl i32 %zext.1, 16
%or = or i32 %zext.0, %shl.1		%or = or i32 %zext.0, %shl.1
▲ Show 20 Lines • Show All 87 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] [AMDGPU] Allow vector loads in MatchLoadCombineClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 465035

llvm/include/llvm/CodeGen/SelectionDAGAddressAnalysis.h

llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

llvm/test/CodeGen/AArch64/load-combine.ll

llvm/test/CodeGen/AMDGPU/combine-vload-extract.ll

llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.global.ll

llvm/test/CodeGen/AMDGPU/fast-unaligned-load-store.private.ll

[DAGCombiner] [AMDGPU] Allow vector loads in MatchLoadCombine
ClosedPublic