This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
-
AMDGPU.h
1/5
AMDGPULateCodeGenPrepare.cpp
1/1
AMDGPUTargetMachine.cpp
-
CMakeLists.txt
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
2/3
llvm.amdgcn.dispatch.ptr.ll
2
vectorize-loads.ll

Differential D80364

[amdgpu] Teach load widening to handle non-DWORD aligned loads.
ClosedPublic

Authored by hliao on May 20 2020, 11:46 PM.

Download Raw Diff

Details

Reviewers

rampitec
arsenm

Commits

rG46c3d5cb05d6: [amdgpu] Add the late codegen preparation pass.

Summary

If a load is naturally aligned but not DWORD aligned, load wideningg should handle the case where the address is the sum of a base pointer and a constant offset. That load is able to be widened by aligning that address and extracting the narrow value from the high part.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

hliao created this revision.May 20 2020, 11:46 PM

Herald added a project: Restricted Project. · View Herald TranscriptMay 20 2020, 11:46 PM

Herald added subscribers: llvm-commits, kerbowa, hiraditya and 8 others. · View Herald Transcript

Harbormaster completed remote builds in B57496: Diff 265435.May 21 2020, 1:36 AM

arsenm added inline comments.May 21 2020, 7:11 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7469 ↗	(On Diff #265435)	Typo quaranteed
7472–7475 ↗	(On Diff #265435)	This will be true for all of the preloaded SGPRs. However I think looking for the argument copies here is the wrong approach. The original intrinsic calls should have been annotated with the align return value attribute. Currently this is a burden on the frontend/library call emitting the intrinsic. We could either annotate the intrinsic calls in one of the later passes (maybe AMDGPUCodeGenPrepare), or add a minimum alignment to the intrinsic definition which will always be applied, similar to how intrinsic declarations already get their other attributes
7509–7550 ↗	(On Diff #265435)	We already have essentially the same code in lowerKernargMemParameter (plus an IR version in AMDGPULowerKernelArguments). This just generalizes to any isBaseWithConstantOffset. Can you refactor these to avoid the duplication?
7535 ↗	(On Diff #265435)	Can you use one of the simpler getLoad overloads? We don't really ever need to mention unindexed
7537 ↗	(On Diff #265435)	I think you could have a better alignment than 4 here
llvm/test/CodeGen/AMDGPU/llvm.amdgcn.dispatch.ptr.ll
32	Should also have some tests with multiple loads. You should make sure that a use of the base address load, and a use of the offset address both CSE into a single load with 1 shift. Also could test with an explicit align attribute on the dispatch ptr call

hliao marked an inline comment as done.May 21 2020, 11:26 AM

hliao added inline comments.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7472–7475 ↗	(On Diff #265435)	do we have that attribute available? could you elaborate in detail?

hliao marked an inline comment as done.May 21 2020, 11:45 AM

hliao added inline comments.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7472–7475 ↗	(On Diff #265435)	OK, I found that.

hliao mentioned this in D80422: Enable `align <n>` to be used in intrinsic definitions..May 21 2020, 9:20 PM

hliao marked an inline comment as done.May 22 2020, 11:45 AM

hliao added inline comments.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7472–7475 ↗	(On Diff #265435)	In another review D80422, relevant intrinsics are being annotated with assumed alignment on the return pointer. Unfortunately, in SelectionDAG, we won't have the facility to keep tracking of that hint. I will enhance `AMDGPUCodeGenPrepare` pass to do that similar thing.

arsenm added inline comments.May 22 2020, 1:13 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7472–7475 ↗	(On Diff #265435)	We don't need to explicitly know the pointer alignment. If the intrinsic was properly annotated from the beginning (as would be the case after D80422), the downstream load users would have the correct alignment assigned. The optimizer propagates alignment information already

hliao marked an inline comment as done.May 22 2020, 2:01 PM

hliao added inline comments.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7472–7475 ↗	(On Diff #265435)	That load won't be marked as DWORD aligned even after those intrinsics are annotated correctly as they are just not DWORD aligned. For example, the newly-added test has an i16 load from pointer (dippatch.ptr + 6). It's not DOWRD aligned. The transformation performed here is to widen such load if that pointer is calculated in the form a DWORD aligned pointer (`base`) with a constant offset (`off`). With that, (i16 load (add base, off)) is translated into (trunc (lshr (i32 load (add base, off - 2))) We need to know that base is DWORD aligned. However, in SDNode, we don't have any facility to retain the original alignment from LLVM IR.

arsenm added inline comments.May 22 2020, 2:37 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7472–7475 ↗	(On Diff #265435)	is computeKnownBitsForTargetNode called for CopyFromReg? If so, we could move the argument check in there

hliao marked an inline comment as done.May 22 2020, 6:08 PM

hliao added inline comments.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
7472–7475 ↗	(On Diff #265435)	unfortunately NO

Rewrite such transformations in LLVM IR as the late codegen preparation.

Herald added a subscriber: mgorny. · View Herald TranscriptMay 24 2020, 2:49 PM

hliao marked an inline comment as done.May 24 2020, 2:49 PM

Harbormaster failed remote builds in B57759: Diff 265945!May 24 2020, 3:59 PM

In D80364#2052781, @hliao wrote:

Rewrite such transformations in LLVM IR as the late codegen preparation.

this change depends on D80422 to add align attribute properly on relevant intrinsics.

Rebase the latest trunk.

I'd still like to find a way to avoid a whole extra pass run for this. In the test here, the LoadStoreVectorizer should have vectorized these? Why didn't it?

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
33–37	If a separate pass is going to do the load widening, you can remove it from AMDGPUCodeGenPrepare
llvm/test/CodeGen/AMDGPU/llvm.amdgcn.dispatch.ptr.ll
33	Missing align?

Harbormaster failed remote builds in B58107: Diff 266655!May 28 2020, 3:45 AM

In D80364#2058794, @arsenm wrote:

I'd still like to find a way to avoid a whole extra pass run for this. In the test here, the LoadStoreVectorizer should have vectorized these? Why didn't it?

That's due to the misaligned load after coalescing. This example is written intentionally to skip LSV and verify that common widened load could be CSEd within DAG. In practice, there won't be such input as the 1st i16 load should be properly annotated with align 4.

hliao marked 2 inline comments as done.May 28 2020, 8:47 AM

hliao added inline comments.

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
33–37	I could clean that after this.
llvm/test/CodeGen/AMDGPU/llvm.amdgcn.dispatch.ptr.ll
33	that align attribute will be added implicitly if there's no explicit overriding.

In D80364#2060318, @hliao wrote:

In D80364#2058794, @arsenm wrote:

I'd still like to find a way to avoid a whole extra pass run for this. In the test here, the LoadStoreVectorizer should have vectorized these? Why didn't it?

In D80364#2058794, @arsenm wrote:

I'd still like to find a way to avoid a whole extra pass run for this. In the test here, the LoadStoreVectorizer should have vectorized these? Why didn't it?

That's due to the misaligned load after coalescing. This example is written intentionally to skip LSV and verify that common widened load could be CSEd within DAG. In practice, there won't be such input as the 1st i16 load should be properly annotated with align 4.

But the load isn't misaligned. The point of adding the alignment to the intrinsic declaration was so that the whole optimization pipeline would know about the alignment. Taking this example:

after opt -instcombine:

define amdgpu_kernel void @test3(i32 addrspace(1)* nocapture %out) local_unnamed_addr #0 {
  %dispatch_ptr = tail call noalias i8 addrspace(4)* @llvm.amdgcn.dispatch.ptr()
  %d1 = getelementptr inbounds i8, i8 addrspace(4)* %dispatch_ptr, i64 4
  %h1 = bitcast i8 addrspace(4)* %d1 to i16 addrspace(4)*
  %v1 = load i16, i16 addrspace(4)* %h1, align 4
  %e1 = zext i16 %v1 to i32
  %d2 = getelementptr inbounds i8, i8 addrspace(4)* %dispatch_ptr, i64 6
  %h2 = bitcast i8 addrspace(4)* %d2 to i16 addrspace(4)*
  %v2 = load i16, i16 addrspace(4)* %h2, align 2
  %e2 = zext i16 %v2 to i32
  %o = add nuw nsw i32 %e2, %e1
  store i32 %o, i32 addrspace(1)* %out, align 4
  ret void
}

Now the load's alignment was correctly inferred to be higher, so then the vectorizer handles this as expected:

opt -instcombine -load-store-vectorizer -instsimplify gives:

define amdgpu_kernel void @test3(i32 addrspace(1)* nocapture %out) local_unnamed_addr #0 {
  %dispatch_ptr = tail call noalias i8 addrspace(4)* @llvm.amdgcn.dispatch.ptr()
  %d1 = getelementptr inbounds i8, i8 addrspace(4)* %dispatch_ptr, i64 4
  %h1 = bitcast i8 addrspace(4)* %d1 to i16 addrspace(4)*
  %1 = bitcast i16 addrspace(4)* %h1 to <2 x i16> addrspace(4)*
  %2 = load <2 x i16>, <2 x i16> addrspace(4)* %1, align 4
  %v11 = extractelement <2 x i16> %2, i32 0
  %v22 = extractelement <2 x i16> %2, i32 1
  %e1 = zext i16 %v11 to i32
  %e2 = zext i16 %v22 to i32
  %o = add nuw nsw i32 %e2, %e1
  store i32 %o, i32 addrspace(1)* %out, align 4
  ret void
}

So I think with the alignment patch, for any real world example, this would do the right thing. Your example fails here because the IR wasn't pre-optimized. We don't need to expect perfect optimization from the backend for any random IR and need to consider the entire pass pipeline

In D80364#2061176, @arsenm wrote:

In D80364#2060318, @hliao wrote:

In D80364#2058794, @arsenm wrote:

I'd still like to find a way to avoid a whole extra pass run for this. In the test here, the LoadStoreVectorizer should have vectorized these? Why didn't it?

In D80364#2058794, @arsenm wrote:

I'd still like to find a way to avoid a whole extra pass run for this. In the test here, the LoadStoreVectorizer should have vectorized these? Why didn't it?

That's due to the misaligned load after coalescing. This example is written intentionally to skip LSV and verify that common widened load could be CSEd within DAG. In practice, there won't be such input as the 1st i16 load should be properly annotated with align 4.

But the load isn't misaligned. The point of adding the alignment to the intrinsic declaration was so that the whole optimization pipeline would know about the alignment. Taking this example:

after opt -instcombine:

but we run llc only in this test and skip all middle-end optimizations. That's why I said this test is impractical in the normal scenario where middle-end is always run. This test is added to address your previous concern on CSE of widened loads. If not required, I could remove this test.

define amdgpu_kernel void @test3(i32 addrspace(1)* nocapture %out) local_unnamed_addr #0 {
  %dispatch_ptr = tail call noalias i8 addrspace(4)* @llvm.amdgcn.dispatch.ptr()
  %d1 = getelementptr inbounds i8, i8 addrspace(4)* %dispatch_ptr, i64 4
  %h1 = bitcast i8 addrspace(4)* %d1 to i16 addrspace(4)*
  %v1 = load i16, i16 addrspace(4)* %h1, align 4
  %e1 = zext i16 %v1 to i32
  %d2 = getelementptr inbounds i8, i8 addrspace(4)* %dispatch_ptr, i64 6
  %h2 = bitcast i8 addrspace(4)* %d2 to i16 addrspace(4)*
  %v2 = load i16, i16 addrspace(4)* %h2, align 2
  %e2 = zext i16 %v2 to i32
  %o = add nuw nsw i32 %e2, %e1
  store i32 %o, i32 addrspace(1)* %out, align 4
  ret void
}

Now the load's alignment was correctly inferred to be higher, so then the vectorizer handles this as expected:

opt -instcombine -load-store-vectorizer -instsimplify gives:

define amdgpu_kernel void @test3(i32 addrspace(1)* nocapture %out) local_unnamed_addr #0 {
  %dispatch_ptr = tail call noalias i8 addrspace(4)* @llvm.amdgcn.dispatch.ptr()
  %d1 = getelementptr inbounds i8, i8 addrspace(4)* %dispatch_ptr, i64 4
  %h1 = bitcast i8 addrspace(4)* %d1 to i16 addrspace(4)*
  %1 = bitcast i16 addrspace(4)* %h1 to <2 x i16> addrspace(4)*
  %2 = load <2 x i16>, <2 x i16> addrspace(4)* %1, align 4
  %v11 = extractelement <2 x i16> %2, i32 0
  %v22 = extractelement <2 x i16> %2, i32 1
  %e1 = zext i16 %v11 to i32
  %e2 = zext i16 %v22 to i32
  %o = add nuw nsw i32 %e2, %e1
  store i32 %o, i32 addrspace(1)* %out, align 4
  ret void
}

In D80364#2061226, @hliao wrote:

In D80364#2061176, @arsenm wrote:

In D80364#2060318, @hliao wrote:

In D80364#2058794, @arsenm wrote:

I'd still like to find a way to avoid a whole extra pass run for this. In the test here, the LoadStoreVectorizer should have vectorized these? Why didn't it?

In D80364#2058794, @arsenm wrote:

I'd still like to find a way to avoid a whole extra pass run for this. In the test here, the LoadStoreVectorizer should have vectorized these? Why didn't it?

That's due to the misaligned load after coalescing. This example is written intentionally to skip LSV and verify that common widened load could be CSEd within DAG. In practice, there won't be such input as the 1st i16 load should be properly annotated with align 4.

But the load isn't misaligned. The point of adding the alignment to the intrinsic declaration was so that the whole optimization pipeline would know about the alignment. Taking this example:

after opt -instcombine:

but we run llc only in this test and skip all middle-end optimizations. That's why I said this test is impractical in the normal scenario where middle-end is always run. This test is added to address your previous concern on CSE of widened loads. If not required, I could remove this test.

This was more of a concern when letting the DAG handle it. Since you added the implied align attribute, I don't think anything else should be needed

Remove an inpractical test.

In D80364#2061379, @hliao wrote:

Remove an inpractical test.

I mean beyond the test, I think the whole patch is unnecessary now. Do you have an example of real source that still needs this?

Harbormaster completed remote builds in B58107: Diff 266655.May 28 2020, 6:43 PM

Harbormaster completed remote builds in B58322: Diff 267029.May 28 2020, 7:47 PM

In D80364#2061467, @arsenm wrote:

In D80364#2061379, @hliao wrote:

Remove an inpractical test.

I mean beyond the test, I think the whole patch is unnecessary now. Do you have an example of real source that still needs this?

test2, that's a real example. The widening done in this patch is not supported in any code in LLVM. The load, by itself, is not DWORD aligned but only naturally aligned no matter whether alignment marking is correct or not.

I did some experiments locally and think this can stay in AMDGPUCodeGenPrepare, and doesn't need the split pass. Since you restrict this widening to the case where you're rebasing the load anyway, I don't think this will cause the same problems with the vectorizer the previous IR load widening had (and may help it even?)

test3 should also come back, but should have the explicit align 4 added to the load. This could also use some loads of i8, and <2 x i8>. We could also extend this to handle wider, sub-dword aligned types but that's a separate patch.

In D80364#2063603, @arsenm wrote:

I did some experiments locally and think this can stay in AMDGPUCodeGenPrepare, and doesn't need the split pass. Since you restrict this widening to the case where you're rebasing the load anyway, I don't think this will cause the same problems with the vectorizer the previous IR load widening had (and may help it even?)

test3 should also come back, but should have the explicit align 4 added to the load. This could also use some loads of i8, and <2 x i8>. We could also extend this to handle wider, sub-dword aligned types but that's a separate patch.

Scalar load widening should run after LSV to generate redundant loads. Cases like a sequence of consecutive loads of i16 benefit from such an organization to avoid redundant load generation. Here's the details

for 4 loads of i16

ld.i16 (ptr + 0)
ld.i16 (ptr + 2)
ld.i16 (ptr + 4)
ld.i16 (ptr + 6)

If we run scalar load widening before LSV. After widening, we have

ld.i16 (ptr + 0)
ld.i32 (ptr + 0)
ld.i16 (ptr + 4)
ld.i32 (ptr + 4)

After LSV, we have

ld.i16 (ptr + 0)
ld.i32x2 (ptr + 0)
ld.i16 (ptr + 4)

That 2 i16 loads are redundant. If we run scalar load widening after LSV, we won't have that result.

Add test case and comment on why we need to run scalar load widening after LSV.

arsenm added inline comments.Jun 2 2020, 11:15 AM

llvm/test/CodeGen/AMDGPU/vectorize-loads.ll
26	s/widening/widened
32–33	Function name needs to be better. This is not merely a v4i16 vectorization, there's the constant widening to consider

Is this still necessary?

rampitec added inline comments.Oct 8 2020, 2:18 PM

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
801	It can simply to to GCNPassConfig::addPreISel().

Rebase and revise.

This patch is still required as MMO's alignment is calculated based on the offset from the base alignment. As the base alignment is the alignment from the pointer in the IR, it cannot be modified. We need extra logic to re-align MMO operand if we widen the original one. For instance of a 16-bit load from ptr has an alignment of 2, if ptr is equivalent to base - 2 and base's alignment is 4, we could widen that 16-bit load to 32-bit load from ptr - 2with an alignment 4. But, as we cannot change IR in MMO, we need extra stuff to in the new MMO could assume that new alignment.

Harbormaster completed remote builds in B75920: Diff 299769.Oct 21 2020, 12:40 PM

LGTM in principle. We wanted to split CodeGenPrepare for a long time already. We also should drop widening from an early pass then.

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp
119	Please use "auto *" as tidy suggests.
147	Also "auto *".
167	Same here and in another places.

Fix coding style following clang-tidy.

LGTM, provided you are planning to remove widening from the early pass.

This revision is now accepted and ready to land.Oct 27 2020, 9:39 AM

Harbormaster completed remote builds in B76578: Diff 301018.Oct 27 2020, 9:54 AM

This revision was landed with ongoing or failed builds.Oct 27 2020, 11:08 AM

Closed by commit rG46c3d5cb05d6: [amdgpu] Add the late codegen preparation pass. (authored by hliao). · Explain Why

This revision was automatically updated to reflect the committed changes.

hliao added a commit: rG46c3d5cb05d6: [amdgpu] Add the late codegen preparation pass..

foad added a subscriber: foad.Nov 2 2020, 2:55 PM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPU.h

4 lines

AMDGPULateCodeGenPrepare.cpp

198 lines

AMDGPUTargetMachine.cpp

2 lines

CMakeLists.txt

1 line

test/

CodeGen/

AMDGPU/

llvm.amdgcn.dispatch.ptr.ll

16 lines

vectorize-loads.ll

31 lines

Diff 301058

llvm/lib/Target/AMDGPU/AMDGPU.h

	Show First 20 Lines • Show All 62 Lines • ▼ Show 20 Lines
	FunctionPass *createSIInsertWaitcntsPass();			FunctionPass *createSIInsertWaitcntsPass();
	FunctionPass *createSIPreAllocateWWMRegsPass();			FunctionPass *createSIPreAllocateWWMRegsPass();
	FunctionPass *createSIFormMemoryClausesPass();			FunctionPass *createSIFormMemoryClausesPass();

	FunctionPass *createSIPostRABundlerPass();			FunctionPass *createSIPostRABundlerPass();
	FunctionPass createAMDGPUSimplifyLibCallsPass(const TargetMachine );			FunctionPass createAMDGPUSimplifyLibCallsPass(const TargetMachine );
	FunctionPass *createAMDGPUUseNativeCallsPass();			FunctionPass *createAMDGPUUseNativeCallsPass();
	FunctionPass *createAMDGPUCodeGenPreparePass();			FunctionPass *createAMDGPUCodeGenPreparePass();
				FunctionPass *createAMDGPULateCodeGenPreparePass();
	FunctionPass *createAMDGPUMachineCFGStructurizerPass();			FunctionPass *createAMDGPUMachineCFGStructurizerPass();
	FunctionPass createAMDGPUPropagateAttributesEarlyPass(const TargetMachine );			FunctionPass createAMDGPUPropagateAttributesEarlyPass(const TargetMachine );
	ModulePass createAMDGPUPropagateAttributesLatePass(const TargetMachine );			ModulePass createAMDGPUPropagateAttributesLatePass(const TargetMachine );
	FunctionPass *createAMDGPURewriteOutArgumentsPass();			FunctionPass *createAMDGPURewriteOutArgumentsPass();
	FunctionPass *createSIModeRegisterPass();			FunctionPass *createSIModeRegisterPass();

	void initializeAMDGPUDAGToDAGISelPass(PassRegistry&);			void initializeAMDGPUDAGToDAGISelPass(PassRegistry&);

	▲ Show 20 Lines • Show All 139 Lines • ▼ Show 20 Lines
	extern char &SIOptimizeExecMaskingPreRAID;			extern char &SIOptimizeExecMaskingPreRAID;

	void initializeAMDGPUAnnotateUniformValuesPass(PassRegistry&);			void initializeAMDGPUAnnotateUniformValuesPass(PassRegistry&);
	extern char &AMDGPUAnnotateUniformValuesPassID;			extern char &AMDGPUAnnotateUniformValuesPassID;

	void initializeAMDGPUCodeGenPreparePass(PassRegistry&);			void initializeAMDGPUCodeGenPreparePass(PassRegistry&);
	extern char &AMDGPUCodeGenPrepareID;			extern char &AMDGPUCodeGenPrepareID;

				void initializeAMDGPULateCodeGenPreparePass(PassRegistry &);
				extern char &AMDGPULateCodeGenPrepareID;

	void initializeSIAnnotateControlFlowPass(PassRegistry&);			void initializeSIAnnotateControlFlowPass(PassRegistry&);
	extern char &SIAnnotateControlFlowPassID;			extern char &SIAnnotateControlFlowPassID;

	void initializeSIMemoryLegalizerPass(PassRegistry&);			void initializeSIMemoryLegalizerPass(PassRegistry&);
	extern char &SIMemoryLegalizerID;			extern char &SIMemoryLegalizerID;

	void initializeSIModeRegisterPass(PassRegistry&);			void initializeSIModeRegisterPass(PassRegistry&);
	extern char &SIModeRegisterID;			extern char &SIModeRegisterID;
	▲ Show 20 Lines • Show All 116 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

This file was added.

				//===-- AMDGPUCodeGenPrepare.cpp ------------------------------------------===//
				//
				// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
				// See https://llvm.org/LICENSE.txt for license information.
				// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
				//
				//===----------------------------------------------------------------------===//
				//
				/// \file
				/// This pass does misc. AMDGPU optimizations on IR just before instruction
				/// selection.
				//
				//===----------------------------------------------------------------------===//

				#include "AMDGPU.h"
				#include "llvm/Analysis/AssumptionCache.h"
				#include "llvm/Analysis/LegacyDivergenceAnalysis.h"
				#include "llvm/Analysis/ValueTracking.h"
				#include "llvm/CodeGen/Passes.h"
				#include "llvm/IR/IRBuilder.h"
				#include "llvm/IR/InstVisitor.h"
				#include "llvm/InitializePasses.h"
				#include "llvm/Support/CommandLine.h"
				#include "llvm/Support/KnownBits.h"
				#include "llvm/Transforms/Utils/Local.h"
				#include <cassert>
				#include <iterator>

				#define DEBUG_TYPE "amdgpu-late-codegenprepare"

				using namespace llvm;

				// Scalar load widening needs running after load-store-vectorizer as that pass
				// doesn't handle overlapping cases. In addition, this pass enhances the
				// widening to handle cases where scalar sub-dword loads are naturally aligned
				// only but not dword aligned.
				static cl::opt<bool>
				arsenmUnsubmitted Not Done Reply Inline Actions If a separate pass is going to do the load widening, you can remove it from AMDGPUCodeGenPrepare arsenm: If a separate pass is going to do the load widening, you can remove it from AMDGPUCodeGenPrepare
				hliaoAuthorUnsubmitted Done Reply Inline Actions I could clean that after this. hliao: I could clean that after this.
				WidenLoads("amdgpu-late-codegenprepare-widen-constant-loads",
				cl::desc("Widen sub-dword constant address space loads in "
				"AMDGPULateCodeGenPrepare"),
				cl::ReallyHidden, cl::init(true));

				namespace {

				class AMDGPULateCodeGenPrepare
				: public FunctionPass,
				public InstVisitor<AMDGPULateCodeGenPrepare, bool> {
				Module *Mod = nullptr;
				const DataLayout *DL = nullptr;

				AssumptionCache *AC = nullptr;
				LegacyDivergenceAnalysis *DA = nullptr;

				public:
				static char ID;

				AMDGPULateCodeGenPrepare() : FunctionPass(ID) {}

				StringRef getPassName() const override {
				return "AMDGPU IR late optimizations";
				}

				void getAnalysisUsage(AnalysisUsage &AU) const override {
				AU.addRequired<AssumptionCacheTracker>();
				AU.addRequired<LegacyDivergenceAnalysis>();
				AU.setPreservesAll();
				}

				bool doInitialization(Module &M) override;
				bool runOnFunction(Function &F) override;

				bool visitInstruction(Instruction &) { return false; }

				// Check if the specified value is at least DWORD aligned.
				bool isDWORDAligned(const Value *V) const {
				KnownBits Known = computeKnownBits(V, *DL, 0, AC);
				return Known.countMinTrailingZeros() >= 2;
				}

				bool canWidenScalarExtLoad(LoadInst &LI) const;
				bool visitLoadInst(LoadInst &LI);
				};

				} // end anonymous namespace

				bool AMDGPULateCodeGenPrepare::doInitialization(Module &M) {
				Mod = &M;
				DL = &Mod->getDataLayout();
				return false;
				}

				bool AMDGPULateCodeGenPrepare::runOnFunction(Function &F) {
				if (skipFunction(F))
				return false;

				AC = &getAnalysis<AssumptionCacheTracker>().getAssumptionCache(F);
				DA = &getAnalysis<LegacyDivergenceAnalysis>();

				bool Changed = false;
				for (auto &BB : F)
				for (auto BI = BB.begin(), BE = BB.end(); BI != BE; /EMPTY/) {
				Instruction I = &BI++;
				Changed \|= visit(*I);
				}

				return Changed;
				}

				bool AMDGPULateCodeGenPrepare::canWidenScalarExtLoad(LoadInst &LI) const {
				unsigned AS = LI.getPointerAddressSpace();
				// Skip non-constant address space.
				if (AS != AMDGPUAS::CONSTANT_ADDRESS &&
				AS != AMDGPUAS::CONSTANT_ADDRESS_32BIT)
				return false;
				// Skip non-simple loads.
				if (!LI.isSimple())
				return false;
				auto *Ty = LI.getType();
				// Skip aggregate types.
				rampitecUnsubmitted Not Done Reply Inline Actions Please use "auto " as tidy suggests. rampitec:* Please use "auto *" as tidy suggests.
				if (Ty->isAggregateType())
				return false;
				unsigned TySize = DL->getTypeStoreSize(Ty);
				// Only handle sub-DWORD loads.
				if (TySize >= 4)
				return false;
				// That load must be at least naturally aligned.
				if (LI.getAlign() < DL->getABITypeAlign(Ty))
				return false;
				// It should be uniform, i.e. a scalar load.
				return DA->isUniform(&LI);
				}

				bool AMDGPULateCodeGenPrepare::visitLoadInst(LoadInst &LI) {
				if (!WidenLoads)
				return false;

				// Skip if that load is already aligned on DWORD at least as it's handled in
				// SDAG.
				if (LI.getAlign() >= 4)
				return false;

				if (!canWidenScalarExtLoad(LI))
				return false;

				int64_t Offset = 0;
				auto *Base =
				GetPointerBaseWithConstantOffset(LI.getPointerOperand(), Offset, *DL);
				rampitecUnsubmitted Not Done Reply Inline Actions Also "auto ". rampitec:* Also "auto *".
				// If that base is not DWORD aligned, it's not safe to perform the following
				// transforms.
				if (!isDWORDAligned(Base))
				return false;

				int64_t Adjust = Offset & 0x3;
				if (Adjust == 0) {
				// With a zero adjust, the original alignment could be promoted with a
				// better one.
				LI.setAlignment(Align(4));
				return true;
				}

				IRBuilder<> IRB(&LI);
				IRB.SetCurrentDebugLocation(LI.getDebugLoc());

				unsigned AS = LI.getPointerAddressSpace();
				unsigned LdBits = DL->getTypeStoreSize(LI.getType()) * 8;
				auto IntNTy = Type::getIntNTy(LI.getContext(), LdBits);

				rampitecUnsubmitted Not Done Reply Inline Actions Same here and in another places. rampitec: Same here and in another places.
				PointerType *Int32PtrTy = Type::getInt32PtrTy(LI.getContext(), AS);
				PointerType *Int8PtrTy = Type::getInt8PtrTy(LI.getContext(), AS);
				auto *NewPtr = IRB.CreateBitCast(
				IRB.CreateConstGEP1_64(IRB.CreateBitCast(Base, Int8PtrTy),
				Offset - Adjust),
				Int32PtrTy);
				LoadInst *NewLd = IRB.CreateAlignedLoad(NewPtr, Align(4));
				NewLd->copyMetadata(LI);
				NewLd->setMetadata(LLVMContext::MD_range, nullptr);

				unsigned ShAmt = Adjust * 8;
				auto *NewVal = IRB.CreateBitCast(
				IRB.CreateTrunc(IRB.CreateLShr(NewLd, ShAmt), IntNTy), LI.getType());
				LI.replaceAllUsesWith(NewVal);
				RecursivelyDeleteTriviallyDeadInstructions(&LI);

				return true;
				}

				INITIALIZE_PASS_BEGIN(AMDGPULateCodeGenPrepare, DEBUG_TYPE,
				"AMDGPU IR late optimizations", false, false)
				INITIALIZE_PASS_DEPENDENCY(AssumptionCacheTracker)
				INITIALIZE_PASS_DEPENDENCY(LegacyDivergenceAnalysis)
				INITIALIZE_PASS_END(AMDGPULateCodeGenPrepare, DEBUG_TYPE,
				"AMDGPU IR late optimizations", false, false)

				char AMDGPULateCodeGenPrepare::ID = 0;

				FunctionPass *llvm::createAMDGPULateCodeGenPreparePass() {
				return new AMDGPULateCodeGenPrepare();
				}

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Show First 20 Lines • Show All 230 Lines • ▼ Show 20 Lines	extern "C" LLVM_EXTERNAL_VISIBILITY void LLVMInitializeAMDGPUTarget() {
initializeAMDGPULowerKernelAttributesPass(*PR);		initializeAMDGPULowerKernelAttributesPass(*PR);
initializeAMDGPULowerIntrinsicsPass(*PR);		initializeAMDGPULowerIntrinsicsPass(*PR);
initializeAMDGPUOpenCLEnqueuedBlockLoweringPass(*PR);		initializeAMDGPUOpenCLEnqueuedBlockLoweringPass(*PR);
initializeAMDGPUPostLegalizerCombinerPass(*PR);		initializeAMDGPUPostLegalizerCombinerPass(*PR);
initializeAMDGPUPreLegalizerCombinerPass(*PR);		initializeAMDGPUPreLegalizerCombinerPass(*PR);
initializeAMDGPUPromoteAllocaPass(*PR);		initializeAMDGPUPromoteAllocaPass(*PR);
initializeAMDGPUPromoteAllocaToVectorPass(*PR);		initializeAMDGPUPromoteAllocaToVectorPass(*PR);
initializeAMDGPUCodeGenPreparePass(*PR);		initializeAMDGPUCodeGenPreparePass(*PR);
		initializeAMDGPULateCodeGenPreparePass(*PR);
initializeAMDGPUPropagateAttributesEarlyPass(*PR);		initializeAMDGPUPropagateAttributesEarlyPass(*PR);
initializeAMDGPUPropagateAttributesLatePass(*PR);		initializeAMDGPUPropagateAttributesLatePass(*PR);
initializeAMDGPURewriteOutArgumentsPass(*PR);		initializeAMDGPURewriteOutArgumentsPass(*PR);
initializeAMDGPUUnifyMetadataPass(*PR);		initializeAMDGPUUnifyMetadataPass(*PR);
initializeSIAnnotateControlFlowPass(*PR);		initializeSIAnnotateControlFlowPass(*PR);
initializeSIInsertHardClausesPass(*PR);		initializeSIInsertHardClausesPass(*PR);
initializeSIInsertWaitcntsPass(*PR);		initializeSIInsertWaitcntsPass(*PR);
initializeSIModeRegisterPass(*PR);		initializeSIModeRegisterPass(*PR);
▲ Show 20 Lines • Show All 545 Lines • ▼ Show 20 Lines	void AMDGPUPassConfig::addCodeGenPrepare() {
// here seems better that these blocks would get cleaned up by		// here seems better that these blocks would get cleaned up by
// UnreachableBlockElim inserted next in the pass flow.		// UnreachableBlockElim inserted next in the pass flow.
addPass(createLowerSwitchPass());		addPass(createLowerSwitchPass());
}		}

bool AMDGPUPassConfig::addPreISel() {		bool AMDGPUPassConfig::addPreISel() {
addPass(createFlattenCFGPass());		addPass(createFlattenCFGPass());
return false;		return false;
}		}
		rampitecUnsubmitted Done Reply Inline Actions It can simply to to GCNPassConfig::addPreISel(). rampitec: It can simply to to GCNPassConfig::addPreISel().

bool AMDGPUPassConfig::addInstSelector() {		bool AMDGPUPassConfig::addInstSelector() {
// Defer the verifier until FinalizeISel.		// Defer the verifier until FinalizeISel.
addPass(createAMDGPUISelDag(&getAMDGPUTargetMachine(), getOptLevel()), false);		addPass(createAMDGPUISelDag(&getAMDGPUTargetMachine(), getOptLevel()), false);
return false;		return false;
}		}

bool AMDGPUPassConfig::addGCPasses() {		bool AMDGPUPassConfig::addGCPasses() {
▲ Show 20 Lines • Show All 51 Lines • ▼ Show 20 Lines	ScheduleDAGInstrs *GCNPassConfig::createMachineScheduler(
if (ST.enableSIScheduler())		if (ST.enableSIScheduler())
return createSIMachineScheduler(C);		return createSIMachineScheduler(C);
return createGCNMaxOccupancyMachineScheduler(C);		return createGCNMaxOccupancyMachineScheduler(C);
}		}

bool GCNPassConfig::addPreISel() {		bool GCNPassConfig::addPreISel() {
AMDGPUPassConfig::addPreISel();		AMDGPUPassConfig::addPreISel();

		addPass(createAMDGPULateCodeGenPreparePass());
if (EnableAtomicOptimizations) {		if (EnableAtomicOptimizations) {
addPass(createAMDGPUAtomicOptimizerPass());		addPass(createAMDGPUAtomicOptimizerPass());
}		}

// FIXME: We need to run a pass to propagate the attributes when calls are		// FIXME: We need to run a pass to propagate the attributes when calls are
// supported.		// supported.

// Merge divergent exit nodes. StructurizeCFG won't recognize the multi-exit		// Merge divergent exit nodes. StructurizeCFG won't recognize the multi-exit
▲ Show 20 Lines • Show All 338 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/CMakeLists.txt

Show First 20 Lines • Show All 52 Lines • ▼ Show 20 Lines	add_llvm_target(AMDGPUCodeGen
AMDGPUFrameLowering.cpp		AMDGPUFrameLowering.cpp
AMDGPUHSAMetadataStreamer.cpp		AMDGPUHSAMetadataStreamer.cpp
AMDGPUInstCombineIntrinsic.cpp		AMDGPUInstCombineIntrinsic.cpp
AMDGPUInstrInfo.cpp		AMDGPUInstrInfo.cpp
AMDGPUInstructionSelector.cpp		AMDGPUInstructionSelector.cpp
AMDGPUISelDAGToDAG.cpp		AMDGPUISelDAGToDAG.cpp
AMDGPUISelLowering.cpp		AMDGPUISelLowering.cpp
AMDGPUGlobalISelUtils.cpp		AMDGPUGlobalISelUtils.cpp
		AMDGPULateCodeGenPrepare.cpp
AMDGPULegalizerInfo.cpp		AMDGPULegalizerInfo.cpp
AMDGPULibCalls.cpp		AMDGPULibCalls.cpp
AMDGPULibFunc.cpp		AMDGPULibFunc.cpp
AMDGPULowerIntrinsics.cpp		AMDGPULowerIntrinsics.cpp
AMDGPULowerKernelArguments.cpp		AMDGPULowerKernelArguments.cpp
AMDGPULowerKernelAttributes.cpp		AMDGPULowerKernelAttributes.cpp
AMDGPUMachineCFGStructurizer.cpp		AMDGPUMachineCFGStructurizer.cpp
AMDGPUMachineFunction.cpp		AMDGPUMachineFunction.cpp
▲ Show 20 Lines • Show All 81 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.dispatch.ptr.ll

	; RUN: llc -mtriple=amdgcn--amdhsa --amdhsa-code-object-version=2 -mcpu=kaveri -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s			; RUN: llc -mtriple=amdgcn--amdhsa --amdhsa-code-object-version=2 -mcpu=kaveri -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s
	; RUN: not llc -mtriple=amdgcn-unknown-unknown -mcpu=kaveri -verify-machineinstrs < %s 2>&1 \| FileCheck -check-prefix=ERROR %s			; RUN: not llc -mtriple=amdgcn-unknown-unknown -mcpu=kaveri -verify-machineinstrs < %s 2>&1 \| FileCheck -check-prefix=ERROR %s

	; ERROR: in function test{{.*}}: unsupported hsa intrinsic without hsa target			; ERROR: in function test{{.*}}: unsupported hsa intrinsic without hsa target

	; GCN-LABEL: {{^}}test:			; GCN-LABEL: {{^}}test:
	; GCN: enable_sgpr_dispatch_ptr = 1			; GCN: enable_sgpr_dispatch_ptr = 1
	; GCN: s_load_dword s{{[0-9]+}}, s[4:5], 0x0			; GCN: s_load_dword s{{[0-9]+}}, s[4:5], 0x0
	define amdgpu_kernel void @test(i32 addrspace(1)* %out) {			define amdgpu_kernel void @test(i32 addrspace(1)* %out) {
	%dispatch_ptr = call noalias i8 addrspace(4)* @llvm.amdgcn.dispatch.ptr() #0			%dispatch_ptr = call noalias i8 addrspace(4)* @llvm.amdgcn.dispatch.ptr() #0
	%header_ptr = bitcast i8 addrspace(4)* %dispatch_ptr to i32 addrspace(4)*			%header_ptr = bitcast i8 addrspace(4)* %dispatch_ptr to i32 addrspace(4)*
	%value = load i32, i32 addrspace(4)* %header_ptr			%value = load i32, i32 addrspace(4)* %header_ptr
	store i32 %value, i32 addrspace(1)* %out			store i32 %value, i32 addrspace(1)* %out
	ret void			ret void
	}			}

				; GCN-LABEL: {{^}}test2
				; GCN: enable_sgpr_dispatch_ptr = 1
				; GCN: s_load_dword s[[REG:[0-9]+]], s[4:5], 0x1
				; GCN: s_lshr_b32 s{{[0-9]+}}, s[[REG]], 16
				; GCN-NOT: load_ushort
				; GCN: s_endpgm
				define amdgpu_kernel void @test2(i32 addrspace(1)* %out) {
				%dispatch_ptr = call noalias i8 addrspace(4)* @llvm.amdgcn.dispatch.ptr() #0
				%d1 = getelementptr inbounds i8, i8 addrspace(4)* %dispatch_ptr, i64 6
				%h1 = bitcast i8 addrspace(4)* %d1 to i16 addrspace(4)*
				%v1 = load i16, i16 addrspace(4)* %h1
				%e1 = zext i16 %v1 to i32
				store i32 %e1, i32 addrspace(1)* %out
				ret void
				}

				arsenmUnsubmitted Done Reply Inline Actions Should also have some tests with multiple loads. You should make sure that a use of the base address load, and a use of the offset address both CSE into a single load with 1 shift. Also could test with an explicit align attribute on the dispatch ptr call arsenm: Should also have some tests with multiple loads. You should make sure that a use of the base…
	declare noalias i8 addrspace(4)* @llvm.amdgcn.dispatch.ptr() #0			declare noalias i8 addrspace(4)* @llvm.amdgcn.dispatch.ptr() #0
				arsenmUnsubmitted Not Done Reply Inline Actions Missing align? arsenm: Missing align?
				hliaoAuthorUnsubmitted Done Reply Inline Actions that align attribute will be added implicitly if there's no explicit overriding. hliao: that align attribute will be added implicitly if there's no explicit overriding.

	attributes #0 = { readnone }			attributes #0 = { readnone }

llvm/test/CodeGen/AMDGPU/vectorize-loads.ll

Show All 16 Lines	entry:
%gep_y.cast = bitcast i8 addrspace(4)* %gep_y to i16 addrspace(4)*		%gep_y.cast = bitcast i8 addrspace(4)* %gep_y to i16 addrspace(4)*
%id_y = load i16, i16 addrspace(4)* %gep_y.cast, align 2, !invariant.load !0 ; load workgroup size y		%id_y = load i16, i16 addrspace(4)* %gep_y.cast, align 2, !invariant.load !0 ; load workgroup size y
%add = add nuw nsw i16 %id_y, %id_x		%add = add nuw nsw i16 %id_y, %id_x
%conv = zext i16 %add to i32		%conv = zext i16 %add to i32
store i32 %conv, i32 addrspace(1)* %out, align 4		store i32 %conv, i32 addrspace(1)* %out, align 4
ret void		ret void
}		}

		; A little more complicated case where more sub-dword loads could be coalesced
		; if they are not widening earlier.
		arsenmUnsubmitted Not Done Reply Inline Actions s/widening/widened arsenm: s/widening/widened
		; GCN-LABEL: {{^}}load_4i16:
		; GCN: s_load_dwordx2 s{{\[}}[[D0:[0-9]+]]:[[D1:[0-9]+]]{{\]}}, s[4:5], 0x4
		; GCN-NOT: s_load_dword {{s[0-9]+}}, s[4:5], 0x4
		; GCN-DAG: s_lshr_b32 s{{[0-9]+}}, s[[D0]], 16
		; GCN-DAG: s_lshr_b32 s{{[0-9]+}}, s[[D1]], 16
		; GCN: s_endpgm
		define protected amdgpu_kernel void @load_4i16(i32 addrspace(1)* %out) {
		arsenmUnsubmitted Not Done Reply Inline Actions Function name needs to be better. This is not merely a v4i16 vectorization, there's the constant widening to consider arsenm: Function name needs to be better. This is not merely a v4i16 vectorization, there's the…
		entry:
		%disp = tail call align 4 dereferenceable(64) i8 addrspace(4)* @llvm.amdgcn.dispatch.ptr()
		%gep_x = getelementptr i8, i8 addrspace(4)* %disp, i64 4
		%gep_x.cast = bitcast i8 addrspace(4)* %gep_x to i16 addrspace(4)*
		%id_x = load i16, i16 addrspace(4)* %gep_x.cast, align 4, !invariant.load !0 ; load workgroup size x
		%gep_y = getelementptr i8, i8 addrspace(4)* %disp, i64 6
		%gep_y.cast = bitcast i8 addrspace(4)* %gep_y to i16 addrspace(4)*
		%id_y = load i16, i16 addrspace(4)* %gep_y.cast, align 2, !invariant.load !0 ; load workgroup size y
		%gep_z = getelementptr i8, i8 addrspace(4)* %disp, i64 8
		%gep_z.cast = bitcast i8 addrspace(4)* %gep_z to i16 addrspace(4)*
		%id_z = load i16, i16 addrspace(4)* %gep_z.cast, align 4, !invariant.load !0 ; load workgroup size x
		%gep_w = getelementptr i8, i8 addrspace(4)* %disp, i64 10
		%gep_w.cast = bitcast i8 addrspace(4)* %gep_w to i16 addrspace(4)*
		%id_w = load i16, i16 addrspace(4)* %gep_w.cast, align 2, !invariant.load !0 ; load workgroup size y
		%add = add nuw nsw i16 %id_y, %id_x
		%add2 = add nuw nsw i16 %id_z, %id_w
		%add3 = add nuw nsw i16 %add, %add2
		%conv = zext i16 %add3 to i32
		store i32 %conv, i32 addrspace(1)* %out, align 4
		ret void
		}

declare i8 addrspace(4)* @llvm.amdgcn.dispatch.ptr()		declare i8 addrspace(4)* @llvm.amdgcn.dispatch.ptr()

!0 = !{!0}		!0 = !{!0}

This is an archive of the discontinued LLVM Phabricator instance.

[amdgpu] Teach load widening to handle non-DWORD aligned loads.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 301058

llvm/lib/Target/AMDGPU/AMDGPU.h

llvm/lib/Target/AMDGPU/AMDGPULateCodeGenPrepare.cpp

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

llvm/lib/Target/AMDGPU/CMakeLists.txt

llvm/test/CodeGen/AMDGPU/llvm.amdgcn.dispatch.ptr.ll

llvm/test/CodeGen/AMDGPU/vectorize-loads.ll

[amdgpu] Teach load widening to handle non-DWORD aligned loads.
ClosedPublic