This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
-
SISchedule.td
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
GlobalISel/
-
cvt_f32_ubyte.ll
-
divergent-control-flow.ll
-
extractelement.i128.ll
-
extractelement.i8.ll
-
extractelement.ll
-
fmed3.ll
-
fp64-atomics-gfx90a.ll
-
insertelement.i16.ll
-
insertelement.i8.ll
-
lds-global-non-entry-func.ll
-
lds-global-value.ll
-
llvm.amdgcn.atomic.dec.ll
-
llvm.amdgcn.atomic.inc.ll
-
llvm.amdgcn.div.fmas.ll
-
llvm.amdgcn.div.scale.ll
-
llvm.amdgcn.global.atomic.fadd.ll
-
llvm.amdgcn.mfma.gfx90a.ll
-
llvm.amdgcn.mov.dpp.ll
-
llvm.amdgcn.sbfe.ll
-
llvm.amdgcn.set.inactive.ll
-
llvm.amdgcn.trig.preop.ll
-
llvm.amdgcn.ubfe.ll
-
llvm.amdgcn.update.dpp.ll
-
load-constant.96.ll
-
load-local.96.ll
-
load-unaligned.ll
-
localizer.ll
-
non-entry-alloca.ll
-
sdivrem.ll
-
store-local.128.ll
-
store-local.96.ll
-
udivrem.ll
-
abi-attribute-hints-undefined-behavior.ll
-
add.v2i16.ll
-
amdgpu-codegenprepare-idiv.ll
-
and.ll
-
anyext.ll
-
atomic_optimizations_global_pointer.ll
-
bitreverse.ll
-
branch-relaxation-inst-size-gfx10.ll
-
branch-relaxation.ll
-
bswap.ll
-
build_vector.ll
-
call-argument-types.ll
-
callee-special-input-vgprs.ll
-
chain-hi-to-lo.ll
-
code-object-v3.ll
-
commute-shifts.ll
-
copy-illegal-type.ll
-
copy_to_scc.ll
-
ctlz.ll
-
ctlz_zero_undef.ll
-
ctpop16.ll
-
ctpop64.ll
-
cttz.ll
-
cttz_zero_undef.ll
-
cvt_f32_ubyte.ll
-
dag-divergence-atomic.ll
-
dbg-value-ends-sched-region.mir
-
ds-alignment.ll
-
ds_read2.ll
-
ds_write2.ll
-
extract_vector_elt-i16.ll
-
extract_vector_elt-i8.ll
-
fabs.ll
-
fast-unaligned-load-store.global.ll
-
fast-unaligned-load-store.private.ll
-
fmax_legacy.f64.ll
-
fmin_legacy.f64.ll
-
fminnum.f64.ll
-
fp-min-max-atomics.ll
-
fp64-atomics-gfx90a.ll
-
fp_to_sint.ll
-
fp_to_uint.ll
-
fptosi.f16.ll
-
fptoui.f16.ll
-
frame-index-elimination.ll
-
fshl.ll
-
fshr.ll
-
fused-bitlogic.ll
-
gfx-callable-argument-types.ll
-
half.ll
-
idot2.ll
-
idot4s.ll
-
idot4u.ll
-
idot8s.ll
-
idot8u.ll
-
imm.ll
-
imm16.ll
-
immv216.ll
-
insert-subvector-unused-scratch.ll
-
insert_vector_dynelt.ll
-
insert_vector_elt.ll
-
insert_vector_elt.v2i16.ll
-
kernel-args.ll
-
kernel-argument-dag-lowering.ll
-
lds-atomic-fmin-fmax.ll
-
llvm.amdgcn.class.f16.ll
-
llvm.amdgcn.cvt.pkrtz.ll
-
llvm.amdgcn.fmad.ftz.ll
-
llvm.amdgcn.image.dim.ll
-
llvm.amdgcn.image.sample.dim.ll
-
llvm.amdgcn.raw.tbuffer.load.ll
-
llvm.amdgcn.set.inactive.ll
-
llvm.amdgcn.struct.tbuffer.load.ll
-
llvm.amdgcn.tbuffer.load.ll
-
llvm.amdgcn.ubfe.ll
-
llvm.cos.f16.ll
-
llvm.fma.f16.ll
-
llvm.fmuladd.f16.ll
-
llvm.maxnum.f16.ll
-
llvm.minnum.f16.ll
-
llvm.round.f64.ll
-
llvm.sin.f16.ll
-
load-constant-i16.ll
-
load-global-i16.ll
-
load-hi16.ll
-
load-lo16.ll
-
load-local.128.ll
-
load-local.96.ll
-
local-memory.amdgcn.ll
-
lshl64-to-32.ll
-
lshr.v2i16.ll
-
max.i16.ll
-
memory-legalizer-flat-agent.ll
-
memory-legalizer-flat-nontemporal.ll
-
memory-legalizer-flat-singlethread.ll
-
memory-legalizer-flat-system.ll
-
memory-legalizer-flat-volatile.ll
-
memory-legalizer-flat-wavefront.ll
-
memory-legalizer-flat-workgroup.ll
-
memory-legalizer-global-agent.ll
-
memory-legalizer-global-nontemporal.ll
-
memory-legalizer-global-singlethread.ll
-
memory-legalizer-global-system.ll
-
memory-legalizer-global-volatile.ll
-
memory-legalizer-global-wavefront.ll
-
memory-legalizer-global-workgroup.ll
-
memory-legalizer-local-agent.ll
-
memory-legalizer-local-nontemporal.ll
-
memory-legalizer-local-singlethread.ll
-
memory-legalizer-local-system.ll
-
memory-legalizer-local-volatile.ll
-
memory-legalizer-local-wavefront.ll
-
memory-legalizer-local-workgroup.ll
-
memory-legalizer-private-nontemporal.ll
-
memory-legalizer-private-volatile.ll
-
memory_clause.ll
-
min.ll
-
missing-store.ll
-
move-addr64-rsrc-dead-subreg-writes.ll
-
mul_int24.ll
-
mul_uint24-amdgcn.ll
-
operand-spacing.ll
-
packed-op-sel.ll
-
pr51516.mir
-
promote-constOffset-to-imm.ll
-
saddo.ll
-
salu-to-valu.ll
-
scalar_to_vector.ll
-
sched-assert-onlydbg-value-empty-region.mir
-
schedule-ilp.mir
-
sdiv.ll
-
sdiv64.ll
-
sdwa-peephole.ll
-
select.f16.ll
-
select64.ll
-
sext-divergence-driven-isel.ll
-
sgpr-control-flow.ll
-
shift-and-i128-ubfe.ll
-
shift-i128.ll
-
shl.ll
-
shl.v2i16.ll
-
shl_add_ptr_global.ll
-
shrink-add-sub-constant.ll
-
si-annotate-cf.ll
-
si-triv-disjoint-mem-access.ll
-
sign_extend.ll
-
sint_to_fp.i64.ll
-
sra.ll
-
srem64.ll
-
srl.ll
-
store-local.128.ll
-
store-local.96.ll
-
store-weird-sizes.ll
-
sub.ll
-
sub.v2i16.ll
-
udiv.ll
-
udiv64.ll
-
udivrem.ll
-
uint_to_fp.i64.ll
-
uniform-cfg.ll
-
urem64.ll
-
use-sgpr-multiple-times.ll
-
v_madak_f16.ll
-
vector-extract-insert.ll
-
vector_shuffle.packed.ll
-
widen-smrd-loads.ll
-
wwm-reserved-spill.ll

Differential D114777

[AMDGPU] Set most sched model resource's BufferSize to one
ClosedPublic

Authored by kerbowa on Nov 30 2021, 12:26 AM.

Download Raw Diff

Details

Reviewers

rampitec
arsenm
foad
vangthao
vpykhtin

Commits

rGda067ed569e0: [AMDGPU] Set most sched model resource's BufferSize to one

Summary

Using a BufferSize of one for memory ProcResources will result in better
ILP since it more accurately models the dependencies between memory ops
and their consumers on an in-order processor. After this change, the
scheduler will treat the data edges from loads as blocking so that
stalls are guaranteed when waiting for data to be retreaved from memory.
Since we don't actually track waitcnt here, this should do a better job
at modeling their behavior.

Practically, this means that the scheduler will trigger the 'STALL'
heuristic more often.

This type of change needs to be evaluated experimentally. Preliminary
results are positive.

See test: schedule-ilp.mir

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

kerbowa created this revision.Nov 30 2021, 12:26 AM

Herald added subscribers: wenlei, arphaman, hiraditya and 8 others. · View Herald TranscriptNov 30 2021, 12:26 AM

kerbowa requested review of this revision.Nov 30 2021, 12:26 AM

Herald added a project: Restricted Project. · View Herald TranscriptNov 30 2021, 12:26 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

Harbormaster completed remote builds in B136620: Diff 390598.Nov 30 2021, 1:06 AM

critson added a subscriber: critson.Nov 30 2021, 1:14 AM

Would it be possible to write some llvm-mca tests that show exactly how this change affects the scheduling model? schedule-ilp.mir is better than nothing, but it is still a very indirect way of observing what's going on under the covers.

I am mostly in favour of this change just because I never understood how setting BufferSize to (say) 15 bore any relation to that fact that we can have 15 VMEM loads in flight at once. In fact I really don't understand what effect all these different ProcResources have. I think they will stop the scheduler from trying to issue two different (say) VMEM instructions in the same cycle, because they both need to use the same resource; but we have already set IssueWidth = 1, so the scheduler should already know that it can't issue any two instructions in the same cycle.

Incidentally in D87621 we discussed whether most resources should use BufferSize = 0 or 1. I am still unsure.

LGTM. As previously discussed that seem to describe our HW better. If you can add a test as suggested by Jay that will not hurt.

This revision is now accepted and ready to land.Nov 30 2021, 9:11 AM

On GFX10.1 Vulkan with our usual test cases this seems be performance agnostic.

Closed by commit rGda067ed569e0: [AMDGPU] Set most sched model resource's BufferSize to one (authored by kerbowa). · Explain WhyDec 1 2021, 10:35 PM

This revision was automatically updated to reflect the committed changes.

kerbowa added a commit: rGda067ed569e0: [AMDGPU] Set most sched model resource's BufferSize to one.

In D114777#3160939, @foad wrote:

Would it be possible to write some llvm-mca tests that show exactly how this change affects the scheduling model? schedule-ilp.mir is better than nothing, but it is still a very indirect way of observing what's going on under the covers.

I will add these mca tests along with another patch to improve scheduling for latency. This fixes a few pressing perf issues so I'm committing it now.

foad mentioned this in D112839: [AMDGPU][NFC] Remove autogenerated comment for test.Dec 3 2021, 4:58 AM