This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU/MemOpsCluster] Implement new heuristic for computing max mem ops cluster size
ClosedPublic

Authored by hsmhsm on Jun 3 2020, 4:48 AM.

Details

Summary

Make use of both the - (1) clustered bytes and (2) cluster length, to decide on
the max number of mem ops that can be clustered. On an average, when loads
are dword or smaller, consider 5 as max threshold, otherwise 4. This heuristic
is purely based on different experimentation conducted, and there is no analytical
logic here.

Diff Detail

Event Timeline

hsmhsm created this revision.Jun 3 2020, 4:48 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 3 2020, 4:48 AM
hsmhsm edited the summary of this revision. (Show Details)Jun 3 2020, 4:50 AM
hsmhsm edited the summary of this revision. (Show Details)
hsmhsm edited the summary of this revision. (Show Details)
hsmhsm edited the summary of this revision. (Show Details)
hsmhsm updated this revision to Diff 268381.Jun 4 2020, 1:03 AM

Fixed LLVM lit test regressions.

foad added inline comments.Jun 5 2020, 1:21 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
480–481

I would suggest avoiding this division (because division is slow and introduces rounding). And the magic numbers need some kind of explanation. How did you come up with them?

So how about reformatting this as:

if (NumBytes <= 4 * NumLoads) {
  // Loads are dword or smaller (on average).
  MaxNumLoads = 5;
} else {
  // Loads are bigger than a dword (on average).
  MaxNumLoads = 4;
}

... plus some explanation of where the numbers 5 and 4 came from, and why they should be different cases.

hsmhsm marked an inline comment as done.Jun 5 2020, 2:09 AM
hsmhsm added inline comments.
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
480–481

Your suggestions make sense to me. And regarding those magic number 4 and 5, there is no analytical thinking here. It is purely based on experimentation to achieve below three goals.

(1) LLVM lit regressions should be as minimal as possible
(2) OpenCL shoc benchmark should not show any performance degradation (a kind of representative benchmark for compute)
(3) The performance issue in question (rocSPARSE benchmark) should show improvements.

After experimentation with different magic numbers, and different heuristics, this is what I could settle down with.

hsmhsm updated this revision to Diff 268726.Jun 5 2020, 2:55 AM

Have taken care of review comments by Jay.

foad accepted this revision.Jun 5 2020, 3:25 AM

Seems reasonable to me, but I would be happier if at least one other reviewer gave an explicit approval before this is committed.

This revision is now accepted and ready to land.Jun 5 2020, 3:25 AM
hsmhsm added a comment.Jun 5 2020, 3:33 AM

Hi @rampitec,

You already expressed that you have no objection with this patch in one of our internal email communications. As @foad expressed, it would be great if you also take a look at this patch, and officially give LGTM.

hsmhsm edited the summary of this revision. (Show Details)Jun 5 2020, 3:47 AM
hsmhsm updated this revision to Diff 268734.Jun 5 2020, 3:58 AM

Rebase to latest upstream

hsmhsm updated this revision to Diff 269105.Jun 7 2020, 10:35 PM

Rebase to latest upstream.

rampitec accepted this revision.Jun 8 2020, 11:11 AM
hsmhsm updated this revision to Diff 269416.Jun 8 2020, 11:28 PM

Rebased to upstream master to make sure Harbormaster build is fine before commiting this patch.

hsmhsm edited the summary of this revision. (Show Details)Jun 8 2020, 11:30 PM
hsmhsm added a comment.Jun 9 2020, 1:36 AM

Unit test failures in the latest Harbormaster build are spurious - llvm.amdgcn.image.nsa.ll actually passes in my local repo build, and other three failures are unsupported cases.

This revision was automatically updated to reflect the committed changes.
thakis added a subscriber: thakis.Jun 9 2020, 6:01 AM

Looks like this breaks check-llvm: http://45.33.8.238/linux/19836/step_12.txt

Please take a look and revert if it takes a while to fix.

hsmhsm added a comment.EditedJun 9 2020, 7:52 AM

Hi @thakis

This is surprizing, in my local repo the tests pass, but, it fails in pre-checkin build. I have locally built the patch from scratch with upstream rebased repo, and tested it as below:

mahesha@brego:[AMDGPU]$ ~/ROCm/github/llvm/rel-build/bin/llvm-lit -vv cluster_stores.ll llvm.amdgcn.image.nsa.ll unigine-liveness-crash.ll wqm.ll
-- Testing: 4 tests, 4 workers --
UNSUPPORTED: LLVM :: CodeGen/AMDGPU/cluster_stores.ll (1 of 4)
PASS: LLVM :: CodeGen/AMDGPU/unigine-liveness-crash.ll (2 of 4)
PASS: LLVM :: CodeGen/AMDGPU/llvm.amdgcn.image.nsa.ll (3 of 4)
PASS: LLVM :: CodeGen/AMDGPU/wqm.ll (4 of 4)

Testing Time: 0.31s
  Unsupported: 1
  Passed     : 3
mahesha@brego:[AMDGPU]$
foad added a comment.Jun 10 2020, 3:06 AM

Looks like this breaks check-llvm: http://45.33.8.238/linux/19836/step_12.txt

@hsmhsm you can reproduce the failures if you run the tests under valgrind with llvm-lit --vg. Adding an assert(Width > 0) just after MachineScheduler calls getMemOperandsWithOffsetWidth would make the failure even more obvious.

The problem is: first you committed D80946 which returns Width from all cases in SIInstrInfo::getMemOperandsWithOffsetWidth. Then I committed D74035, which was written earlier, which adds a new isMIMG case but does not set Width.

Looks like this breaks check-llvm: http://45.33.8.238/linux/19836/step_12.txt

@hsmhsm you can reproduce the failures if you run the tests under valgrind with llvm-lit --vg. Adding an assert(Width > 0) just after MachineScheduler calls getMemOperandsWithOffsetWidth would make the failure even more obvious.

The problem is: first you committed D80946 which returns Width from all cases in SIInstrInfo::getMemOperandsWithOffsetWidth. Then I committed D74035, which was written earlier, which adds a new isMIMG case but does not set Width.

Hi @foad,

Ok, this is interesting. Anyway, so, we need to make sure that SIInstrInfo::getMemOperandsWithOffsetWidth() will compute width for all cases. So, for the new case, that you have added, will you going to sumbit a patch which make sure that width is computed for this new case as well?

Looks like this breaks check-llvm: http://45.33.8.238/linux/19836/step_12.txt

@hsmhsm you can reproduce the failures if you run the tests under valgrind with llvm-lit --vg. Adding an assert(Width > 0) just after MachineScheduler calls getMemOperandsWithOffsetWidth would make the failure even more obvious.

The problem is: first you committed D80946 which returns Width from all cases in SIInstrInfo::getMemOperandsWithOffsetWidth. Then I committed D74035, which was written earlier, which adds a new isMIMG case but does not set Width.

Hi @foad,

Ok, this is interesting. Anyway, so, we need to make sure that SIInstrInfo::getMemOperandsWithOffsetWidth() will compute width for all cases. So, for the new case, that you have added, will you going to sumbit a patch which make sure that width is computed for this new case as well?