This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
5/12
SIISelLowering.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
2/3
llvm.amdgcn.dispatch.ptr.ll

Differential D80364

[amdgpu] Teach load widening to handle non-DWORD aligned loads.
ClosedPublic

Authored by hliao on May 20 2020, 11:46 PM.

Download Raw Diff

Details

Reviewers

rampitec
arsenm

Commits

rG46c3d5cb05d6: [amdgpu] Add the late codegen preparation pass.

Summary

If a load is naturally aligned but not DWORD aligned, load wideningg should handle the case where the address is the sum of a base pointer and a constant offset. That load is able to be widened by aligning that address and extracting the narrow value from the high part.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

hliao created this revision.May 20 2020, 11:46 PM

Herald added a project: Restricted Project. · View Herald TranscriptMay 20 2020, 11:46 PM

Herald added subscribers: llvm-commits, kerbowa, hiraditya and 8 others. · View Herald Transcript

Harbormaster completed remote builds in B57496: Diff 265435.May 21 2020, 1:36 AM

arsenm added inline comments.May 21 2020, 7:11 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7469	Typo quaranteed
7472–7475	This will be true for all of the preloaded SGPRs. However I think looking for the argument copies here is the wrong approach. The original intrinsic calls should have been annotated with the align return value attribute. Currently this is a burden on the frontend/library call emitting the intrinsic. We could either annotate the intrinsic calls in one of the later passes (maybe AMDGPUCodeGenPrepare), or add a minimum alignment to the intrinsic definition which will always be applied, similar to how intrinsic declarations already get their other attributes
7509–7550	We already have essentially the same code in lowerKernargMemParameter (plus an IR version in AMDGPULowerKernelArguments). This just generalizes to any isBaseWithConstantOffset. Can you refactor these to avoid the duplication?
7535	Can you use one of the simpler getLoad overloads? We don't really ever need to mention unindexed
7537	I think you could have a better alignment than 4 here
llvm/test/CodeGen/AMDGPU/llvm.amdgcn.dispatch.ptr.ll
32	Should also have some tests with multiple loads. You should make sure that a use of the base address load, and a use of the offset address both CSE into a single load with 1 shift. Also could test with an explicit align attribute on the dispatch ptr call

hliao marked an inline comment as done.May 21 2020, 11:26 AM

hliao added inline comments.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7472–7475	do we have that attribute available? could you elaborate in detail?

hliao marked an inline comment as done.May 21 2020, 11:45 AM

hliao added inline comments.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7472–7475	OK, I found that.

hliao mentioned this in D80422: Enable `align <n>` to be used in intrinsic definitions..May 21 2020, 9:20 PM

hliao marked an inline comment as done.May 22 2020, 11:45 AM

hliao added inline comments.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7472–7475	In another review D80422, relevant intrinsics are being annotated with assumed alignment on the return pointer. Unfortunately, in SelectionDAG, we won't have the facility to keep tracking of that hint. I will enhance `AMDGPUCodeGenPrepare` pass to do that similar thing.

arsenm added inline comments.May 22 2020, 1:13 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7472–7475	We don't need to explicitly know the pointer alignment. If the intrinsic was properly annotated from the beginning (as would be the case after D80422), the downstream load users would have the correct alignment assigned. The optimizer propagates alignment information already

hliao marked an inline comment as done.May 22 2020, 2:01 PM

hliao added inline comments.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7472–7475	That load won't be marked as DWORD aligned even after those intrinsics are annotated correctly as they are just not DWORD aligned. For example, the newly-added test has an i16 load from pointer (dippatch.ptr + 6). It's not DOWRD aligned. The transformation performed here is to widen such load if that pointer is calculated in the form a DWORD aligned pointer (`base`) with a constant offset (`off`). With that, (i16 load (add base, off)) is translated into (trunc (lshr (i32 load (add base, off - 2))) We need to know that base is DWORD aligned. However, in SDNode, we don't have any facility to retain the original alignment from LLVM IR.

arsenm added inline comments.May 22 2020, 2:37 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7472–7475	is computeKnownBitsForTargetNode called for CopyFromReg? If so, we could move the argument check in there

hliao marked an inline comment as done.May 22 2020, 6:08 PM

hliao added inline comments.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7472–7475	unfortunately NO

Rewrite such transformations in LLVM IR as the late codegen preparation.

Herald added a subscriber: mgorny. · View Herald TranscriptMay 24 2020, 2:49 PM

hliao marked an inline comment as done.May 24 2020, 2:49 PM

Harbormaster failed remote builds in B57759: Diff 265945!May 24 2020, 3:59 PM

In D80364#2052781, @hliao wrote:

Rewrite such transformations in LLVM IR as the late codegen preparation.

this change depends on D80422 to add align attribute properly on relevant intrinsics.

Rebase the latest trunk.

I'd still like to find a way to avoid a whole extra pass run for this. In the test here, the LoadStoreVectorizer should have vectorized these? Why didn't it?

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
32–36 ↗	(On Diff #266655)	If a separate pass is going to do the load widening, you can remove it from AMDGPUCodeGenPrepare
llvm/test/CodeGen/AMDGPU/llvm.amdgcn.dispatch.ptr.ll
33	Missing align?

Harbormaster failed remote builds in B58107: Diff 266655!May 28 2020, 3:45 AM

In D80364#2058794, @arsenm wrote:

I'd still like to find a way to avoid a whole extra pass run for this. In the test here, the LoadStoreVectorizer should have vectorized these? Why didn't it?

That's due to the misaligned load after coalescing. This example is written intentionally to skip LSV and verify that common widened load could be CSEd within DAG. In practice, there won't be such input as the 1st i16 load should be properly annotated with align 4.

hliao marked 2 inline comments as done.May 28 2020, 8:47 AM

hliao added inline comments.

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
32–36 ↗	(On Diff #266655)	I could clean that after this.
llvm/test/CodeGen/AMDGPU/llvm.amdgcn.dispatch.ptr.ll
33	that align attribute will be added implicitly if there's no explicit overriding.

In D80364#2060318, @hliao wrote:

In D80364#2058794, @arsenm wrote:

I'd still like to find a way to avoid a whole extra pass run for this. In the test here, the LoadStoreVectorizer should have vectorized these? Why didn't it?

In D80364#2058794, @arsenm wrote:

I'd still like to find a way to avoid a whole extra pass run for this. In the test here, the LoadStoreVectorizer should have vectorized these? Why didn't it?

That's due to the misaligned load after coalescing. This example is written intentionally to skip LSV and verify that common widened load could be CSEd within DAG. In practice, there won't be such input as the 1st i16 load should be properly annotated with align 4.

But the load isn't misaligned. The point of adding the alignment to the intrinsic declaration was so that the whole optimization pipeline would know about the alignment. Taking this example:

after opt -instcombine:

define amdgpu_kernel void @test3(i32 addrspace(1)* nocapture %out) local_unnamed_addr #0 {
  %dispatch_ptr = tail call noalias i8 addrspace(4)* @llvm.amdgcn.dispatch.ptr()
  %d1 = getelementptr inbounds i8, i8 addrspace(4)* %dispatch_ptr, i64 4
  %h1 = bitcast i8 addrspace(4)* %d1 to i16 addrspace(4)*
  %v1 = load i16, i16 addrspace(4)* %h1, align 4
  %e1 = zext i16 %v1 to i32
  %d2 = getelementptr inbounds i8, i8 addrspace(4)* %dispatch_ptr, i64 6
  %h2 = bitcast i8 addrspace(4)* %d2 to i16 addrspace(4)*
  %v2 = load i16, i16 addrspace(4)* %h2, align 2
  %e2 = zext i16 %v2 to i32
  %o = add nuw nsw i32 %e2, %e1
  store i32 %o, i32 addrspace(1)* %out, align 4
  ret void
}

Now the load's alignment was correctly inferred to be higher, so then the vectorizer handles this as expected:

opt -instcombine -load-store-vectorizer -instsimplify gives:

define amdgpu_kernel void @test3(i32 addrspace(1)* nocapture %out) local_unnamed_addr #0 {
  %dispatch_ptr = tail call noalias i8 addrspace(4)* @llvm.amdgcn.dispatch.ptr()
  %d1 = getelementptr inbounds i8, i8 addrspace(4)* %dispatch_ptr, i64 4
  %h1 = bitcast i8 addrspace(4)* %d1 to i16 addrspace(4)*
  %1 = bitcast i16 addrspace(4)* %h1 to <2 x i16> addrspace(4)*
  %2 = load <2 x i16>, <2 x i16> addrspace(4)* %1, align 4
  %v11 = extractelement <2 x i16> %2, i32 0
  %v22 = extractelement <2 x i16> %2, i32 1
  %e1 = zext i16 %v11 to i32
  %e2 = zext i16 %v22 to i32
  %o = add nuw nsw i32 %e2, %e1
  store i32 %o, i32 addrspace(1)* %out, align 4
  ret void
}

So I think with the alignment patch, for any real world example, this would do the right thing. Your example fails here because the IR wasn't pre-optimized. We don't need to expect perfect optimization from the backend for any random IR and need to consider the entire pass pipeline

In D80364#2061176, @arsenm wrote:

In D80364#2060318, @hliao wrote:

In D80364#2058794, @arsenm wrote:

I'd still like to find a way to avoid a whole extra pass run for this. In the test here, the LoadStoreVectorizer should have vectorized these? Why didn't it?

In D80364#2058794, @arsenm wrote:

I'd still like to find a way to avoid a whole extra pass run for this. In the test here, the LoadStoreVectorizer should have vectorized these? Why didn't it?

That's due to the misaligned load after coalescing. This example is written intentionally to skip LSV and verify that common widened load could be CSEd within DAG. In practice, there won't be such input as the 1st i16 load should be properly annotated with align 4.

But the load isn't misaligned. The point of adding the alignment to the intrinsic declaration was so that the whole optimization pipeline would know about the alignment. Taking this example:

after opt -instcombine:

but we run llc only in this test and skip all middle-end optimizations. That's why I said this test is impractical in the normal scenario where middle-end is always run. This test is added to address your previous concern on CSE of widened loads. If not required, I could remove this test.

define amdgpu_kernel void @test3(i32 addrspace(1)* nocapture %out) local_unnamed_addr #0 {
  %dispatch_ptr = tail call noalias i8 addrspace(4)* @llvm.amdgcn.dispatch.ptr()
  %d1 = getelementptr inbounds i8, i8 addrspace(4)* %dispatch_ptr, i64 4
  %h1 = bitcast i8 addrspace(4)* %d1 to i16 addrspace(4)*
  %v1 = load i16, i16 addrspace(4)* %h1, align 4
  %e1 = zext i16 %v1 to i32
  %d2 = getelementptr inbounds i8, i8 addrspace(4)* %dispatch_ptr, i64 6
  %h2 = bitcast i8 addrspace(4)* %d2 to i16 addrspace(4)*
  %v2 = load i16, i16 addrspace(4)* %h2, align 2
  %e2 = zext i16 %v2 to i32
  %o = add nuw nsw i32 %e2, %e1
  store i32 %o, i32 addrspace(1)* %out, align 4
  ret void
}

Now the load's alignment was correctly inferred to be higher, so then the vectorizer handles this as expected:

opt -instcombine -load-store-vectorizer -instsimplify gives:

define amdgpu_kernel void @test3(i32 addrspace(1)* nocapture %out) local_unnamed_addr #0 {
  %dispatch_ptr = tail call noalias i8 addrspace(4)* @llvm.amdgcn.dispatch.ptr()
  %d1 = getelementptr inbounds i8, i8 addrspace(4)* %dispatch_ptr, i64 4
  %h1 = bitcast i8 addrspace(4)* %d1 to i16 addrspace(4)*
  %1 = bitcast i16 addrspace(4)* %h1 to <2 x i16> addrspace(4)*
  %2 = load <2 x i16>, <2 x i16> addrspace(4)* %1, align 4
  %v11 = extractelement <2 x i16> %2, i32 0
  %v22 = extractelement <2 x i16> %2, i32 1
  %e1 = zext i16 %v11 to i32
  %e2 = zext i16 %v22 to i32
  %o = add nuw nsw i32 %e2, %e1
  store i32 %o, i32 addrspace(1)* %out, align 4
  ret void
}

In D80364#2061226, @hliao wrote:

In D80364#2061176, @arsenm wrote:

In D80364#2060318, @hliao wrote:

In D80364#2058794, @arsenm wrote:

I'd still like to find a way to avoid a whole extra pass run for this. In the test here, the LoadStoreVectorizer should have vectorized these? Why didn't it?

In D80364#2058794, @arsenm wrote:

I'd still like to find a way to avoid a whole extra pass run for this. In the test here, the LoadStoreVectorizer should have vectorized these? Why didn't it?

That's due to the misaligned load after coalescing. This example is written intentionally to skip LSV and verify that common widened load could be CSEd within DAG. In practice, there won't be such input as the 1st i16 load should be properly annotated with align 4.

But the load isn't misaligned. The point of adding the alignment to the intrinsic declaration was so that the whole optimization pipeline would know about the alignment. Taking this example:

after opt -instcombine:

but we run llc only in this test and skip all middle-end optimizations. That's why I said this test is impractical in the normal scenario where middle-end is always run. This test is added to address your previous concern on CSE of widened loads. If not required, I could remove this test.

This was more of a concern when letting the DAG handle it. Since you added the implied align attribute, I don't think anything else should be needed

Remove an inpractical test.

In D80364#2061379, @hliao wrote:

Remove an inpractical test.

I mean beyond the test, I think the whole patch is unnecessary now. Do you have an example of real source that still needs this?

Harbormaster completed remote builds in B58107: Diff 266655.May 28 2020, 6:43 PM

Harbormaster completed remote builds in B58322: Diff 267029.May 28 2020, 7:47 PM

In D80364#2061467, @arsenm wrote:

In D80364#2061379, @hliao wrote:

Remove an inpractical test.

I mean beyond the test, I think the whole patch is unnecessary now. Do you have an example of real source that still needs this?

test2, that's a real example. The widening done in this patch is not supported in any code in LLVM. The load, by itself, is not DWORD aligned but only naturally aligned no matter whether alignment marking is correct or not.

I did some experiments locally and think this can stay in AMDGPUCodeGenPrepare, and doesn't need the split pass. Since you restrict this widening to the case where you're rebasing the load anyway, I don't think this will cause the same problems with the vectorizer the previous IR load widening had (and may help it even?)

test3 should also come back, but should have the explicit align 4 added to the load. This could also use some loads of i8, and <2 x i8>. We could also extend this to handle wider, sub-dword aligned types but that's a separate patch.

In D80364#2063603, @arsenm wrote:

I did some experiments locally and think this can stay in AMDGPUCodeGenPrepare, and doesn't need the split pass. Since you restrict this widening to the case where you're rebasing the load anyway, I don't think this will cause the same problems with the vectorizer the previous IR load widening had (and may help it even?)

test3 should also come back, but should have the explicit align 4 added to the load. This could also use some loads of i8, and <2 x i8>. We could also extend this to handle wider, sub-dword aligned types but that's a separate patch.

Scalar load widening should run after LSV to generate redundant loads. Cases like a sequence of consecutive loads of i16 benefit from such an organization to avoid redundant load generation. Here's the details

for 4 loads of i16

ld.i16 (ptr + 0)
ld.i16 (ptr + 2)
ld.i16 (ptr + 4)
ld.i16 (ptr + 6)

If we run scalar load widening before LSV. After widening, we have

ld.i16 (ptr + 0)
ld.i32 (ptr + 0)
ld.i16 (ptr + 4)
ld.i32 (ptr + 4)

After LSV, we have

ld.i16 (ptr + 0)
ld.i32x2 (ptr + 0)
ld.i16 (ptr + 4)

That 2 i16 loads are redundant. If we run scalar load widening after LSV, we won't have that result.

Add test case and comment on why we need to run scalar load widening after LSV.

arsenm added inline comments.Jun 2 2020, 11:15 AM

llvm/test/CodeGen/AMDGPU/vectorize-loads.ll
26 ↗	(On Diff #267904)	s/widening/widened
32–33 ↗	(On Diff #267904)	Function name needs to be better. This is not merely a v4i16 vectorization, there's the constant widening to consider

Is this still necessary?

rampitec added inline comments.Oct 8 2020, 2:18 PM

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
796 ↗	(On Diff #267904)	It can simply to to GCNPassConfig::addPreISel().

Rebase and revise.

This patch is still required as MMO's alignment is calculated based on the offset from the base alignment. As the base alignment is the alignment from the pointer in the IR, it cannot be modified. We need extra logic to re-align MMO operand if we widen the original one. For instance of a 16-bit load from ptr has an alignment of 2, if ptr is equivalent to base - 2 and base's alignment is 4, we could widen that 16-bit load to 32-bit load from ptr - 2with an alignment 4. But, as we cannot change IR in MMO, we need extra stuff to in the new MMO could assume that new alignment.

Harbormaster completed remote builds in B75920: Diff 299769.Oct 21 2020, 12:40 PM

LGTM in principle. We wanted to split CodeGenPrepare for a long time already. We also should drop widening from an early pass then.

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
118 ↗	(On Diff #299769)	Please use "auto *" as tidy suggests.
146 ↗	(On Diff #299769)	Also "auto *".
166 ↗	(On Diff #299769)	Same here and in another places.

Fix coding style following clang-tidy.

LGTM, provided you are planning to remove widening from the early pass.

This revision is now accepted and ready to land.Oct 27 2020, 9:39 AM

Harbormaster completed remote builds in B76578: Diff 301018.Oct 27 2020, 9:54 AM

This revision was landed with ongoing or failed builds.Oct 27 2020, 11:08 AM

Closed by commit rG46c3d5cb05d6: [amdgpu] Add the late codegen preparation pass. (authored by hliao). · Explain Why

This revision was automatically updated to reflect the committed changes.

hliao added a commit: rG46c3d5cb05d6: [amdgpu] Add the late codegen preparation pass..

foad added a subscriber: foad.Nov 2 2020, 2:55 PM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

SIISelLowering.cpp

113 lines

test/

CodeGen/

AMDGPU/

llvm.amdgcn.dispatch.ptr.ll

16 lines

Diff 265435

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 7,450 Lines • ▼ Show 20 Lines	case ISD::EXTLOAD:
return DAG.getNode(ISD::ANY_EXTEND, SL, VT, Op);		return DAG.getNode(ISD::ANY_EXTEND, SL, VT, Op);
case ISD::NON_EXTLOAD:		case ISD::NON_EXTLOAD:
return Op;		return Op;
}		}

llvm_unreachable("invalid ext type");		llvm_unreachable("invalid ext type");
}		}

		static bool isDWORDAligned(SelectionDAG &DAG, MachineFunction &MF, SDValue Op) {
		if (Op.getOpcode() == ISD::CopyFromReg) {
		MachineRegisterInfo &MRI = MF.getRegInfo();
		SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();

		Register VReg = cast<RegisterSDNode>(Op.getOperand(1))->getReg();
		MCRegister Reg = MRI.getLiveInPhysReg(VReg);
		if (!Reg.isValid())
		return false;

		// Certain preloaded registers are quaranteed to be DWORD aligned.
		arsenmUnsubmitted Not Done Reply Inline Actions Typo quaranteed arsenm: Typo quaranteed
		const ArgDescriptor *AD;
		const TargetRegisterClass *RC;
		std::tie(AD, RC) =
		MFI->getPreloadedValue(AMDGPUFunctionArgInfo::DISPATCH_PTR);
		if (AD && AD->isRegister() && AD->getRegister() == Reg)
		return true;
		arsenmUnsubmitted Not Done Reply Inline Actions This will be true for all of the preloaded SGPRs. However I think looking for the argument copies here is the wrong approach. The original intrinsic calls should have been annotated with the align return value attribute. Currently this is a burden on the frontend/library call emitting the intrinsic. We could either annotate the intrinsic calls in one of the later passes (maybe AMDGPUCodeGenPrepare), or add a minimum alignment to the intrinsic definition which will always be applied, similar to how intrinsic declarations already get their other attributes arsenm: This will be true for all of the preloaded SGPRs. However I think looking for the argument…
		hliaoAuthorUnsubmitted Done Reply Inline Actions do we have that attribute available? could you elaborate in detail? hliao: do we have that attribute available? could you elaborate in detail?
		hliaoAuthorUnsubmitted Done Reply Inline Actions OK, I found that. hliao: OK, I found that.
		hliaoAuthorUnsubmitted Done Reply Inline Actions In another review D80422, relevant intrinsics are being annotated with assumed alignment on the return pointer. Unfortunately, in SelectionDAG, we won't have the facility to keep tracking of that hint. I will enhance `AMDGPUCodeGenPrepare` pass to do that similar thing. hliao: In another review [[ https://reviews.llvm.org/D80422 \| D80422 ]], relevant intrinsics are being…
		arsenmUnsubmitted Not Done Reply Inline Actions We don't need to explicitly know the pointer alignment. If the intrinsic was properly annotated from the beginning (as would be the case after D80422), the downstream load users would have the correct alignment assigned. The optimizer propagates alignment information already arsenm: We don't need to explicitly know the pointer alignment. If the intrinsic was properly annotated…
		hliaoAuthorUnsubmitted Done Reply Inline Actions That load won't be marked as DWORD aligned even after those intrinsics are annotated correctly as they are just not DWORD aligned. For example, the newly-added test has an i16 load from pointer (dippatch.ptr + 6). It's not DOWRD aligned. The transformation performed here is to widen such load if that pointer is calculated in the form a DWORD aligned pointer (`base`) with a constant offset (`off`). With that, (i16 load (add base, off)) is translated into (trunc (lshr (i32 load (add base, off - 2))) We need to know that base is DWORD aligned. However, in SDNode, we don't have any facility to retain the original alignment from LLVM IR. hliao: That load won't be marked as DWORD aligned even after those intrinsics are annotated correctly…
		arsenmUnsubmitted Not Done Reply Inline Actions is computeKnownBitsForTargetNode called for CopyFromReg? If so, we could move the argument check in there arsenm: is computeKnownBitsForTargetNode called for CopyFromReg? If so, we could move the argument…
		hliaoAuthorUnsubmitted Done Reply Inline Actions unfortunately NO hliao: unfortunately NO

		return false;
		}
		KnownBits Known = DAG.computeKnownBits(Op);
		return Known.countMinTrailingZeros() >= 2;
		}

SDValue SITargetLowering::widenLoad(LoadSDNode *Ld, DAGCombinerInfo &DCI) const {		SDValue SITargetLowering::widenLoad(LoadSDNode *Ld, DAGCombinerInfo &DCI) const {
SelectionDAG &DAG = DCI.DAG;		SelectionDAG &DAG = DCI.DAG;
if (Ld->getAlignment() < 4 \|\| Ld->isDivergent())		if (Ld->isDivergent())
return SDValue();		return SDValue();

// FIXME: Constant loads should all be marked invariant.		// FIXME: Constant loads should all be marked invariant.
unsigned AS = Ld->getAddressSpace();		unsigned AS = Ld->getAddressSpace();
if (AS != AMDGPUAS::CONSTANT_ADDRESS &&		if (AS != AMDGPUAS::CONSTANT_ADDRESS &&
AS != AMDGPUAS::CONSTANT_ADDRESS_32BIT &&		AS != AMDGPUAS::CONSTANT_ADDRESS_32BIT &&
(AS != AMDGPUAS::GLOBAL_ADDRESS \|\| !Ld->isInvariant()))		(AS != AMDGPUAS::GLOBAL_ADDRESS \|\| !Ld->isInvariant()))
return SDValue();		return SDValue();

// Don't do this early, since it may interfere with adjacent load merging for		// Don't do this early, since it may interfere with adjacent load merging for
// illegal types. We can avoid losing alignment information for exotic types		// illegal types. We can avoid losing alignment information for exotic types
// pre-legalize.		// pre-legalize.
EVT MemVT = Ld->getMemoryVT();		EVT MemVT = Ld->getMemoryVT();
if ((MemVT.isSimple() && !DCI.isAfterLegalizeDAG()) \|\|		if ((MemVT.isSimple() && !DCI.isAfterLegalizeDAG()) \|\|
MemVT.getSizeInBits() >= 32)		MemVT.getSizeInBits() >= 32)
return SDValue();		return SDValue();

SDLoc SL(Ld);		SDLoc SL(Ld);

assert((!MemVT.isVector() \|\| Ld->getExtensionType() == ISD::NON_EXTLOAD) &&		assert((!MemVT.isVector() \|\| Ld->getExtensionType() == ISD::NON_EXTLOAD) &&
"unexpected vector extload");		"unexpected vector extload");

// TODO: Drop only high part of range.		SDValue NewLoad, Cvt;
		if (Ld->getAlign() < 4) {
		// Special handling on non-DWORD aligned loads.
		// So far, only handle scalar loads only.
		if (MemVT.isVector())
		return SDValue();
		// Skip non-naturally aligned loads.
		if (Ld->getAlign() < MemVT.getStoreSize())
		return SDValue();
		// FIXME: Support other types.
		if (MemVT != MVT::i16)
		return SDValue();
		// For naturally aligned but not DWORD aligned load, try to widen if
		// there's a constant offset from an aligned base.
		SDValue Ptr = Ld->getBasePtr();
		if (!DAG.isBaseWithConstantOffset(Ptr))
		return SDValue();
		SDValue BasePtr = Ptr.getOperand(0);
		if (!isDWORDAligned(DAG, DAG.getMachineFunction(), BasePtr))
		return SDValue();

		EVT VT = Ptr.getValueType();
		int64_t Offset = cast<ConstantSDNode>(Ptr.getOperand(1))->getSExtValue();
		SDValue NewPtr = DAG.getNode(ISD::ADD, SL, VT, BasePtr,
		DAG.getConstant(Offset - 2, SL, VT));
		// Now, the new load is DWORD aligned.
		// TODO: Drop only low part of range.
		NewLoad = DAG.getLoad(ISD::UNINDEXED, ISD::NON_EXTLOAD, MVT::i32, SL,
		arsenmUnsubmitted Not Done Reply Inline Actions Can you use one of the simpler getLoad overloads? We don't really ever need to mention unindexed arsenm: Can you use one of the simpler getLoad overloads? We don't really ever need to mention…
		Ld->getChain(), NewPtr, Ld->getOffset(),
		Ld->getPointerInfo(), MVT::i32, Align(4),
		arsenmUnsubmitted Not Done Reply Inline Actions I think you could have a better alignment than 4 here arsenm: I think you could have a better alignment than 4 here
		Ld->getMemOperand()->getFlags(), Ld->getAAInfo(),
		nullptr); // Drop ranges
		// Extract the high bits.
		Cvt = DAG.getNode(
		Ld->getExtensionType() == ISD::SEXTLOAD ? ISD::SRA : ISD::SRL, SL,
		MVT::i32, NewLoad, DAG.getShiftAmountConstant(16, MVT::i32, SL));
		} else {
SDValue Ptr = Ld->getBasePtr();		SDValue Ptr = Ld->getBasePtr();
SDValue NewLoad = DAG.getLoad(ISD::UNINDEXED, ISD::NON_EXTLOAD,		// TODO: Drop only high part of range.
MVT::i32, SL, Ld->getChain(), Ptr,		NewLoad = DAG.getLoad(ISD::UNINDEXED, ISD::NON_EXTLOAD, MVT::i32, SL,
Ld->getOffset(),		Ld->getChain(), Ptr, Ld->getOffset(),
Ld->getPointerInfo(), MVT::i32,		Ld->getPointerInfo(), MVT::i32, Ld->getAlignment(),
Ld->getAlignment(),		Ld->getMemOperand()->getFlags(), Ld->getAAInfo(),
		arsenmUnsubmitted Not Done Reply Inline Actions We already have essentially the same code in lowerKernargMemParameter (plus an IR version in AMDGPULowerKernelArguments). This just generalizes to any isBaseWithConstantOffset. Can you refactor these to avoid the duplication? arsenm: We already have essentially the same code in lowerKernargMemParameter (plus an IR version in…
Ld->getMemOperand()->getFlags(),
Ld->getAAInfo(),
nullptr); // Drop ranges		nullptr); // Drop ranges

EVT TruncVT = EVT::getIntegerVT(*DAG.getContext(), MemVT.getSizeInBits());		EVT TruncVT = EVT::getIntegerVT(*DAG.getContext(), MemVT.getSizeInBits());
if (MemVT.isFloatingPoint()) {		if (MemVT.isFloatingPoint()) {
assert(Ld->getExtensionType() == ISD::NON_EXTLOAD &&		assert(Ld->getExtensionType() == ISD::NON_EXTLOAD &&
"unexpected fp extload");		"unexpected fp extload");
TruncVT = MemVT.changeTypeToInteger();		TruncVT = MemVT.changeTypeToInteger();
}		}

SDValue Cvt = NewLoad;
if (Ld->getExtensionType() == ISD::SEXTLOAD) {		if (Ld->getExtensionType() == ISD::SEXTLOAD) {
Cvt = DAG.getNode(ISD::SIGN_EXTEND_INREG, SL, MVT::i32, NewLoad,		Cvt = DAG.getNode(ISD::SIGN_EXTEND_INREG, SL, MVT::i32, NewLoad,
DAG.getValueType(TruncVT));		DAG.getValueType(TruncVT));
} else if (Ld->getExtensionType() == ISD::ZEXTLOAD \|\|		} else if (Ld->getExtensionType() == ISD::ZEXTLOAD \|\|
Ld->getExtensionType() == ISD::NON_EXTLOAD) {		Ld->getExtensionType() == ISD::NON_EXTLOAD) {
Cvt = DAG.getZeroExtendInReg(NewLoad, SL, TruncVT);		Cvt = DAG.getZeroExtendInReg(NewLoad, SL, TruncVT);
} else {		} else {
assert(Ld->getExtensionType() == ISD::EXTLOAD);		assert(Ld->getExtensionType() == ISD::EXTLOAD);
		Cvt = NewLoad;
		}
}		}

EVT VT = Ld->getValueType(0);		EVT VT = Ld->getValueType(0);
EVT IntVT = EVT::getIntegerVT(*DAG.getContext(), VT.getSizeInBits());		EVT IntVT = EVT::getIntegerVT(*DAG.getContext(), VT.getSizeInBits());

DCI.AddToWorklist(Cvt.getNode());		DCI.AddToWorklist(Cvt.getNode());

// We may need to handle exotic cases, such as i16->i64 extloads, so insert		// We may need to handle exotic cases, such as i16->i64 extloads, so insert
▲ Show 20 Lines • Show All 3,720 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.dispatch.ptr.ll

	; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=kaveri -mattr=-code-object-v3 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s			; RUN: llc -mtriple=amdgcn--amdhsa -mcpu=kaveri -mattr=-code-object-v3 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s
	; RUN: not llc -mtriple=amdgcn-unknown-unknown -mcpu=kaveri -verify-machineinstrs < %s 2>&1 \| FileCheck -check-prefix=ERROR %s			; RUN: not llc -mtriple=amdgcn-unknown-unknown -mcpu=kaveri -verify-machineinstrs < %s 2>&1 \| FileCheck -check-prefix=ERROR %s

	; ERROR: in function test{{.*}}: unsupported hsa intrinsic without hsa target			; ERROR: in function test{{.*}}: unsupported hsa intrinsic without hsa target

	; GCN-LABEL: {{^}}test:			; GCN-LABEL: {{^}}test:
	; GCN: enable_sgpr_dispatch_ptr = 1			; GCN: enable_sgpr_dispatch_ptr = 1
	; GCN: s_load_dword s{{[0-9]+}}, s[4:5], 0x0			; GCN: s_load_dword s{{[0-9]+}}, s[4:5], 0x0
	define amdgpu_kernel void @test(i32 addrspace(1)* %out) {			define amdgpu_kernel void @test(i32 addrspace(1)* %out) {
	%dispatch_ptr = call noalias i8 addrspace(4)* @llvm.amdgcn.dispatch.ptr() #0			%dispatch_ptr = call noalias i8 addrspace(4)* @llvm.amdgcn.dispatch.ptr() #0
	%header_ptr = bitcast i8 addrspace(4)* %dispatch_ptr to i32 addrspace(4)*			%header_ptr = bitcast i8 addrspace(4)* %dispatch_ptr to i32 addrspace(4)*
	%value = load i32, i32 addrspace(4)* %header_ptr			%value = load i32, i32 addrspace(4)* %header_ptr
	store i32 %value, i32 addrspace(1)* %out			store i32 %value, i32 addrspace(1)* %out
	ret void			ret void
	}			}

				; GCN-LABEL: {{^}}test2
				; GCN: enable_sgpr_dispatch_ptr = 1
				; GCN: s_load_dword s[[REG:[0-9]+]], s[4:5], 0x1
				; GCN: s_lshr_b32 s{{[0-9]+}}, s[[REG]], 16
				; GCN-NOT: load_ushort
				; GCN: s_endpgm
				define amdgpu_kernel void @test2(i32 addrspace(1)* %out) {
				%dispatch_ptr = call noalias i8 addrspace(4)* @llvm.amdgcn.dispatch.ptr() #0
				%d1 = getelementptr inbounds i8, i8 addrspace(4)* %dispatch_ptr, i64 6
				%h1 = bitcast i8 addrspace(4)* %d1 to i16 addrspace(4)*
				%v1 = load i16, i16 addrspace(4)* %h1
				%e1 = zext i16 %v1 to i32
				store i32 %e1, i32 addrspace(1)* %out
				ret void
				}

				arsenmUnsubmitted Done Reply Inline Actions Should also have some tests with multiple loads. You should make sure that a use of the base address load, and a use of the offset address both CSE into a single load with 1 shift. Also could test with an explicit align attribute on the dispatch ptr call arsenm: Should also have some tests with multiple loads. You should make sure that a use of the base…
	declare noalias i8 addrspace(4)* @llvm.amdgcn.dispatch.ptr() #0			declare noalias i8 addrspace(4)* @llvm.amdgcn.dispatch.ptr() #0
				arsenmUnsubmitted Not Done Reply Inline Actions Missing align? arsenm: Missing align?
				hliaoAuthorUnsubmitted Done Reply Inline Actions that align attribute will be added implicitly if there's no explicit overriding. hliao: that align attribute will be added implicitly if there's no explicit overriding.

	attributes #0 = { readnone }			attributes #0 = { readnone }

This is an archive of the discontinued LLVM Phabricator instance.

[amdgpu] Teach load widening to handle non-DWORD aligned loads.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 265435

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.dispatch.ptr.ll

[amdgpu] Teach load widening to handle non-DWORD aligned loads.
ClosedPublic