This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Add perf hints to functions
ClosedPublic

Authored by rampitec on May 16 2018, 6:12 PM.

Details

Summary

This is adoption of HSAIL perfhint pass. Two types of hints are produced:

  1. Function is memory bound.
  2. Kernel can use wave limiter.

Currently these hints are used in the scheduler. If a function is suspected
to be memory bound we allow occupancy to decrease to 4 waves in the course
of scheduling.

Diff Detail

Event Timeline

rampitec created this revision.May 16 2018, 6:12 PM
t-tye added inline comments.May 16 2018, 9:31 PM
lib/Target/AMDGPU/AMDGPUPerfHint.cpp
289 ↗(On Diff #147223)

Should this be done at the beginning of the visit to ensure will terminate for mutual recursive functions?

388 ↗(On Diff #147223)

indentation

392 ↗(On Diff #147223)

Does having the std::move() prevent named return value optimization (which can happen for the return above)? Returning a prvalue (eg by directly returning a constructor) would get guaranteed copy elision in C++17.

mareko added a subscriber: mareko.May 16 2018, 9:56 PM

How can UMDs disable this optimization?

Are there cases where this decreases performance?

lib/Target/AMDGPU/AMDGPUPerfHint.cpp
13 ↗(On Diff #147223)

Did you mean "cache thrashing"?

How can UMDs disable this optimization?

Are there cases where this decreases performance?

This is analysis. Optimization itself must be done in the runtime. OpenCL RT used to control it with the env. Graphics RT never did it. At any rate if you know your ideal occupancy it is better to set amdgpu-waves-per-eu attribute.

The only optimization implemented here based on the analysis is in the scheduler. On practice there is no way for a memory intensive program to benefit from an occupancy higher than 4, usually it is lower. However, the impact of the optimization is to let scheduler work where previously it just reverted the schedule if occupancy has decreased. Therefor the natural way to return the old behavior after this change is to disable scheduler (-enable-mished=0) which will result in the same code as before if this condition is triggered.

arsenm added inline comments.May 17 2018, 3:17 AM
lib/Target/AMDGPU/AMDGPUPerfHint.cpp
336–337 ↗(On Diff #147223)

You're not really supposed to use attributes to pass information through to others, although we do this in a few places to hack around isel limitations. Can you make this an analysis pass instead which returns yes / no at the point you actually need this?

382–383 ↗(On Diff #147223)

Should handle other memory operations too, like atomics and intrinsics. There is already a wrapper somewhere which should find the pointer operand for all of these operations

rampitec updated this revision to Diff 147421.May 17 2018, 5:51 PM
rampitec marked 6 inline comments as done.

Addressed review comments.
Pass is converted to analysis.

lib/Target/AMDGPU/AMDGPUPerfHint.cpp
289 ↗(On Diff #147223)

Thank you!

392 ↗(On Diff #147223)

After the port it is trivially copyable, so move is not required any more.

arsenm added inline comments.May 18 2018, 1:25 AM
lib/Target/AMDGPU/AMDGPUPerfHint.cpp
1 ↗(On Diff #147223)

Update comment

11–13 ↗(On Diff #147223)

Comment needs update.

Maybe add a todo that this should be a machine analysis?

166 ↗(On Diff #147223)

What does this mean exactly by indirect access?

This seems to me like it's reimplementing something like GetUnderlyingObject

259 ↗(On Diff #147223)

Probably should check for CallSite to cover the possible future case of InvokeInsts

260–261 ↗(On Diff #147223)

!Callee

271 ↗(On Diff #147223)

Extra space

277–278 ↗(On Diff #147223)

isLegalAddressingMode (although at this point this should probably be a machine pass, but I understand that's more work to rewrite)

289 ↗(On Diff #147223)

Can you add a test for this case

397 ↗(On Diff #147223)

There's also CONSTANT_ADDRESS_32BIT

lib/Target/AMDGPU/SIDefines.h
538–539 ↗(On Diff #147223)

Should be able to also remove this

lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
178–184

It seems like there's no reason to actually put this code in SIMachineFunctionInfo. Can you just do this directly in the AsmPrinter where you emit this?

yaxunl added inline comments.May 18 2018, 4:08 AM
lib/Target/AMDGPU/AMDGPUPerfHint.cpp
166 ↗(On Diff #147223)

indirect access here means something like a[b[i]], i.e., the index of the array is loaded from memory. It usually results in random access in stead of stream access. Probably it can have a better name.

rampitec added inline comments.May 18 2018, 9:39 AM
lib/Target/AMDGPU/AMDGPUPerfHint.cpp
11–13 ↗(On Diff #147223)

I do not think it has to be machine IR pass. It will be really difficult to perform this analysis on machine IR.

289 ↗(On Diff #147223)

Only as opt test. BE will fail on recursion.

lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
178–184

By the time it is needed function's IR is already destroyed. Note, it is not only needed from printer, it is also checked in the scheduler.

rampitec added inline comments.May 18 2018, 9:40 AM
lib/Target/AMDGPU/AMDGPUPerfHint.cpp
260–261 ↗(On Diff #147223)

This file does not exist in the patch. You seem to comment on the old version somehow.

rampitec updated this revision to Diff 147546.May 18 2018, 10:31 AM
rampitec marked 5 inline comments as done.
rampitec added inline comments.
lib/Target/AMDGPU/AMDGPUPerfHint.cpp
289 ↗(On Diff #147223)

Actually since it is now an on-demand analysis I cannot do it even with opt. We will need to generally fix recursion handling in the BE, it is not a problem specific to this patch.

lib/Target/AMDGPU/SIDefines.h
538–539 ↗(On Diff #147223)

It was removed when pass was converted to analysis. Please check the current patch.

arsenm added inline comments.May 18 2018, 3:13 PM
lib/Target/AMDGPU/AMDGPUPerfHintAnalysis.cpp
258–262

isLegalAddressingMode.

I'm not sure I understand why this pass is doing most of what it's doing. Why does the addressing mode match matter for determining if the function is probably memory bound?

With a machine pass you would have a much more exact idea of the number of memory operations really being executed

rampitec added inline comments.May 18 2018, 3:35 PM
lib/Target/AMDGPU/AMDGPUPerfHint.cpp
277–278 ↗(On Diff #147223)

There is no TLI or subtarget here yet.

lib/Target/AMDGPU/AMDGPUPerfHintAnalysis.cpp
258–262

It matters because we are trying to estimate memory to ALU instruction ratio. A foldable GEP does not result in an instruction.

In fact this is a rough estimation, completely correct answer is not needed.

On the machine IR in turn it will be very difficult to track pointers.

rampitec updated this revision to Diff 147628.May 18 2018, 5:59 PM
rampitec marked 2 inline comments as done.

How does this pass affect shaders that use a lot of memory instructions but no pointers?

How does this pass affect shaders that use a lot of memory instructions but no pointers?

Can you give an example? What is a memory instruction without a pointer? As you may see, pass processes something which can cast to load, store, atomic or memory intrinsic. Everything else considered an ordinary instruction. For example if have an image in mind it is conservatively not considered memory instruction.

How does this pass affect shaders that use a lot of memory instructions but no pointers?

Can you give an example? What is a memory instruction without a pointer? As you may see, pass processes something which can cast to load, store, atomic or memory intrinsic. Everything else considered an ordinary instruction. For example if have an image in mind it is conservatively not considered memory instruction.

A memory instruction that doesn't use a pointer is an instruction that uses a resource descriptor (buffer or image). The majority of non-compute workloads use resource descriptors for all memory accesses (except those that load descriptors from memory).

How does this pass affect shaders that use a lot of memory instructions but no pointers?

Can you give an example? What is a memory instruction without a pointer? As you may see, pass processes something which can cast to load, store, atomic or memory intrinsic. Everything else considered an ordinary instruction. For example if have an image in mind it is conservatively not considered memory instruction.

A memory instruction that doesn't use a pointer is an instruction that uses a resource descriptor (buffer or image). The majority of non-compute workloads use resource descriptors for all memory accesses (except those that load descriptors from memory).

Ok, that's what I meant, it is considered an ordinary instruction. It is just not covered by the pass and there is no impact. Primarily because no measurements were done, no workloads analyzed and no statistics collected to try to perform any optimizations of that kind on compute side. gfx side may have data collected, but the pass will do nothing to that kind of loads.

rampitec updated this revision to Diff 147856.May 21 2018, 1:33 PM
rampitec marked 2 inline comments as done.

Switched to use isLegalAddressingMode in GEP processing.

arsenm added inline comments.May 21 2018, 2:50 PM
lib/Target/AMDGPU/AMDGPUPerfHintAnalysis.cpp
37

Naming convention for existing flags seem to all use the full word threshold (same for the rest)

160

Seems like a SmallSet?

241–243

Run clang-format

294

Move this up to avoid calling the same find twice? visit can also return the inserted iterator

316

I think there's a policy of generally avoiding FP computations

321

Ditto

lib/Target/AMDGPU/AMDGPUPerfHintAnalysis.h
2

c++ mode comment

lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
182

Missing space

rampitec updated this revision to Diff 147886.May 21 2018, 3:10 PM
rampitec marked 8 inline comments as done.
rampitec added inline comments.
lib/Target/AMDGPU/AMDGPUPerfHintAnalysis.cpp
294

It is not the same, it is one after visit() call. But yes, changing to visit to return the iterator.

rampitec updated this revision to Diff 147937.May 21 2018, 7:43 PM

Small cleanup after last changes.

LGTM. Just a hint: whenever you use "auto X = ..." it's worth to specify explicitly if X is pointer or reference. It's not only saves you from accidental temp object by copy but also makes program easier to read.

lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
178

I would use "auto &Resolver" to emphasize Resolver is a reference not a temp object

rampitec marked an inline comment as done.May 25 2018, 9:56 AM
rampitec added inline comments.
lib/Target/AMDGPU/SIMachineFunctionInfo.cpp
178

Resolver is a pointer, not an object or reference.

rampitec updated this revision to Diff 148624.May 25 2018, 10:13 AM
rampitec marked an inline comment as done.

Rebase to master.
Moved info from SIMachineFunctionInfo into its parent AMDGPUMachineFunction since SI/R600 is not clearly separated in AMDGPUAsmPrinter anymore.

vpykhtin accepted this revision.May 25 2018, 10:16 AM
This revision is now accepted and ready to land.May 25 2018, 10:16 AM
This revision was automatically updated to reflect the committed changes.