This is an archive of the discontinued LLVM Phabricator instance.

[AMDGCN] Run LoadStoreVectorizer before CodeGenPrepare
AbandonedPublic

Authored by rampitec on Apr 9 2020, 2:05 PM.

Download Raw Diff

Details

Reviewers

arsenm
yaxunl

Summary

AMDGPUCodeGenPrepare widens some loads which then prevent
vectorization of an otherwise vectorizable load pair.

Run vectorizer pass before AMDGPUCodeGenPrepare to catch
the opportunity. The second run is still in place as passes
in between also create a lot of vectorization opportunities.

Diff Detail

Event Timeline

rampitec created this revision.Apr 9 2020, 2:05 PM

Herald added subscribers: kerbowa, hiraditya, tpr and 3 others. · View Herald TranscriptApr 9 2020, 2:05 PM

arsenm added inline comments.Apr 9 2020, 2:13 PM

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
742	This pass is already problematic for compile time and hits quadratic behavior, so I'm not thrilled about running it twice. Why can't we just run it once?

arsenm added inline comments.Apr 9 2020, 2:24 PM

llvm/test/CodeGen/AMDGPU/vectorize-loads.ll
18	If the problem is just the widened constant loads, we could maybe handle this directly in the vectorizer.

rampitec marked 2 inline comments as done.Apr 9 2020, 2:27 PM

rampitec added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
742	There are 111 regressions in check-llvm-codegen-amdgpu if I just move it. All that I have checked are real.
llvm/test/CodeGen/AMDGPU/vectorize-loads.ll
18	Looks like vectorizer would need to be much smarter than it is now to do so.

arsenm added inline comments.Apr 9 2020, 2:27 PM

llvm/test/CodeGen/AMDGPU/vectorize-loads.ll
18	We could also do the widening later

rampitec marked an inline comment as done.Apr 9 2020, 2:48 PM

rampitec added inline comments.

llvm/test/CodeGen/AMDGPU/vectorize-loads.ll
18	Split AMDGPUCodeGenPrepare to early and late? It should be possible. If I use -amdgpu-codegenprepare-widen-constant-loads=0 this test is vectorized, so the only known to me problem is widening.

arsenm added inline comments.Apr 9 2020, 2:51 PM

llvm/test/CodeGen/AMDGPU/vectorize-loads.ll
18	We don't really need to do the load widening in the IR. We should be able to do in the DAG/gMIR level

In fact we do it on a DAG level, look at t14 transformed into t18:

Initial selection DAG: %bb.0 'test:'
SelectionDAG has 18 nodes:
  t0: ch = EntryToken
  t3: i64 = Constant<0>
    t2: i64,ch = CopyFromReg t0, Register:i64 %3
  t5: i64,ch = load<(dereferenceable invariant load 8, align 16, addrspace 4)> t0, t2, undef:i64
      t6: i64,ch = merge_values t5, t5:1
            t11: i64 = llvm.amdgcn.dispatch.ptr TargetConstant:i64<1116>
          t13: i64 = add nuw t11, Constant:i64<4>
        t14: i16,ch = load<(load 2 from %ir.4, align 4, !tbaa !10, addrspace 4)> t0, t13, undef:i64
      t15: i32 = zero_extend t14
        t8: i64 = llvm.amdgcn.kernarg.segment.ptr TargetConstant:i64<1636>
      t9: i64,ch = load<(dereferenceable invariant load 8 from %ir..kernarg.offset.cast, align 16, addrspace 4)> t0, t8, undef:i64
    t16: ch = store<(store 4 into %ir..load, !tbaa !19, addrspace 1)> t6:1, t15, t9, undef:i64
  t17: ch = ENDPGM t16


Optimized lowered selection DAG: %bb.0 'test:'
SelectionDAG has 12 nodes:
  t0: ch = EntryToken
          t11: i64 = llvm.amdgcn.dispatch.ptr TargetConstant:i64<1116>
        t13: i64 = add nuw t11, Constant:i64<4>
      t18: i32,ch = load<(load 2 from %ir.4, align 4, !tbaa !10, addrspace 4), zext from i16> t0, t13, undef:i64
        t8: i64 = llvm.amdgcn.kernarg.segment.ptr TargetConstant:i64<1636>
      t9: i64,ch = load<(dereferenceable invariant load 8 from %ir..kernarg.offset.cast, align 16, addrspace 4)> t0, t8, undef:i64
    t19: ch = store<(store 4 into %ir..load, !tbaa !19, addrspace 1)> t0, t18, t9, undef:i64
  t17: ch = ENDPGM t19

IR widening was done by your commit https://reviews.llvm.org/rG90083d3088ae, there you mention "the very weak load merging the DAG does".
However, when I turned off the option -amdgpu-codegenprepare-widen-constant-loads the only test which has failed was widen_extending_scalar_loads.ll for obvious reason.

Why don't I just turn off this option? It does not seem to do anything good anymore.

An alternative: D77835

rampitec abandoned this revision.Apr 10 2020, 8:02 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUTargetMachine.cpp

3 lines

test/

CodeGen/

AMDGPU/

vectorize-loads.ll

27 lines

Diff 256402

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Show First 20 Lines • Show All 733 Lines • ▼ Show 20 Lines	if (TM.getOptLevel() > CodeGenOpt::None) {
if (EnableAMDGPUAliasAnalysis) {		if (EnableAMDGPUAliasAnalysis) {
addPass(createAMDGPUAAWrapperPass());		addPass(createAMDGPUAAWrapperPass());
addPass(createExternalAAWrapperPass([](Pass &P, Function &,		addPass(createExternalAAWrapperPass([](Pass &P, Function &,
AAResults &AAR) {		AAResults &AAR) {
if (auto *WrapperPass = P.getAnalysisIfAvailable<AMDGPUAAWrapperPass>())		if (auto *WrapperPass = P.getAnalysisIfAvailable<AMDGPUAAWrapperPass>())
AAR.addAAResult(WrapperPass->getResult());		AAR.addAAResult(WrapperPass->getResult());
}));		}));
}		}

		arsenmUnsubmitted Not Done Reply Inline Actions This pass is already problematic for compile time and hits quadratic behavior, so I'm not thrilled about running it twice. Why can't we just run it once? arsenm: This pass is already problematic for compile time and hits quadratic behavior, so I'm not…
		rampitecAuthorUnsubmitted Done Reply Inline Actions There are 111 regressions in check-llvm-codegen-amdgpu if I just move it. All that I have checked are real. rampitec: There are 111 regressions in check-llvm-codegen-amdgpu if I just move it. All that I have…
		if (EnableLoadStoreVectorizer)
		addPass(createLoadStoreVectorizerPass());
}		}

if (TM.getTargetTriple().getArch() == Triple::amdgcn) {		if (TM.getTargetTriple().getArch() == Triple::amdgcn) {
// TODO: May want to move later or split into an early and late one.		// TODO: May want to move later or split into an early and late one.
addPass(createAMDGPUCodeGenPreparePass());		addPass(createAMDGPUCodeGenPreparePass());
}		}

TargetPassConfig::addIRPasses();		TargetPassConfig::addIRPasses();
▲ Show 20 Lines • Show All 453 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/vectorize-loads.ll

This file was added.

				; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -check-prefix=GCN %s

				; GCN-LABEL: {{^}}load_idx_idy:
				; GCN-NOT: global_load
				; GCN: s_load_dword [[ID_XY:s[0-9]+]], s[4:5], 0x4
				; GCN-NOT: global_load
				; GCN: s_lshr_b32 [[ID_Y:s[0-9]+]], [[ID_XY]], 16
				; GCN: s_add_i32 [[ID_SUM:s[0-9]+]], [[ID_Y]], [[ID_XY]]
				; GCN: s_and_b32 s{{[0-9]+}}, [[ID_SUM]], 0xffff
				define protected amdgpu_kernel void @load_idx_idy(i32 addrspace(1)* %out) {
				entry:
				%disp = tail call align 4 dereferenceable(64) i8 addrspace(4)* @llvm.amdgcn.dispatch.ptr()
				%gep_x = getelementptr i8, i8 addrspace(4)* %disp, i64 4
				%gep_x.cast = bitcast i8 addrspace(4)* %gep_x to i16 addrspace(4)*
				%id_x = load i16, i16 addrspace(4)* %gep_x.cast, align 4, !invariant.load !0 ; load workgroup size x
				%gep_y = getelementptr i8, i8 addrspace(4)* %disp, i64 6
				%gep_y.cast = bitcast i8 addrspace(4)* %gep_y to i16 addrspace(4)*
				%id_y = load i16, i16 addrspace(4)* %gep_y.cast, align 2, !invariant.load !0 ; load workgroup size y
				arsenmUnsubmitted Not Done Reply Inline Actions If the problem is just the widened constant loads, we could maybe handle this directly in the vectorizer. arsenm: If the problem is just the widened constant loads, we could maybe handle this directly in the…
				rampitecAuthorUnsubmitted Done Reply Inline Actions Looks like vectorizer would need to be much smarter than it is now to do so. rampitec: Looks like vectorizer would need to be much smarter than it is now to do so.
				arsenmUnsubmitted Not Done Reply Inline Actions We could also do the widening later arsenm: We could also do the widening later
				rampitecAuthorUnsubmitted Done Reply Inline Actions Split AMDGPUCodeGenPrepare to early and late? It should be possible. If I use -amdgpu-codegenprepare-widen-constant-loads=0 this test is vectorized, so the only known to me problem is widening. rampitec: Split AMDGPUCodeGenPrepare to early and late? It should be possible. If I use -amdgpu…
				arsenmUnsubmitted Not Done Reply Inline Actions We don't really need to do the load widening in the IR. We should be able to do in the DAG/gMIR level arsenm: We don't really need to do the load widening in the IR. We should be able to do in the DAG/gMIR…
				%add = add nuw nsw i16 %id_y, %id_x
				%conv = zext i16 %add to i32
				store i32 %conv, i32 addrspace(1)* %out, align 4
				ret void
				}

				declare i8 addrspace(4)* @llvm.amdgcn.dispatch.ptr()

				!0 = !{!0}