This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU/MemOpsCluster] Guard new mem ops clustering heuristic logic by a flag
AbandonedPublic

Authored by hsmhsm on Jul 11 2020, 11:16 AM.

Details

Reviewers
foad
arsenm
cfang
Summary

For the mem ops clustering logic, keep both old and new logic, guard
new logic by a flag. By default, the flag is disabled and the old logic will be
in place. When the flag is enabled, the new logic is triggered. The flag to
enable new logic is:
--amdgpu-enable-mem-ops-cluster-heuristic-based-on-clustered-bytes=true.

Diff Detail

Event Timeline

hsmhsm created this revision.Jul 11 2020, 11:16 AM
foad added inline comments.Jul 13 2020, 12:35 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
103

Is there a more useful description than "new heuristic"? If we're asking people to choose between two heuristics then the fact that one of them is currently "new" isn't really helpful.

hsmhsm updated this revision to Diff 277346.Jul 13 2020, 2:03 AM

Have taken care of review comment by Jay.

hsmhsm updated this revision to Diff 277353.Jul 13 2020, 2:33 AM

Rebased to upstream master.

hsmhsm edited the summary of this revision. (Show Details)Jul 13 2020, 2:37 AM
hsmhsm edited the summary of this revision. (Show Details)
rampitec added inline comments.Jul 13 2020, 10:06 AM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
562

You have problem with extremely wide loads. I am not sure what was in the regression case, but probably something like 8 longs or so. Isn't it better to tweak it instead and just clamp based on the NumBytes as it supposed to be? You are saying you are checking NumBytes, but the return is solely based on NumLoads.

hsmhsm marked an inline comment as done.Jul 13 2020, 11:55 AM
hsmhsm added inline comments.
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
562

Let's consder your example as you replied to my email:

Stas' Reply:

Now say you have 4 incoming loads of <8 x i64>. According to your logic: this is 8*64*4/8 = 256 bytes, MaxNumLoads = 4. You would say it is OK to cluster, because 4 <= 4 == true.
Now say you have 8 loads of <8 x i32>. It is the same 256 bytes, but you inhibit clustering because 8 <= 4 == false.

It is not really based on the number of bytes, it is based on the number of loads.

As per my understanding, the crux of the logic is this - When the clustered bytes is even large but it is shared by very less number of load instructions, then consider them for clustering. It is where the real usage of clustered bytes plays a role.

From above sence, your first example is ok for the heuristic, but not the second example.

hsmhsm marked an inline comment as done.Jul 13 2020, 12:07 PM
hsmhsm added inline comments.
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
562

And, in any case, I am not sure what you mean by - "just clamp based on the NumBytes as it supposed to be". I welcome your ideas in bit detail so that I understand it in detail, and it helps me to quickly experiment with it.

Also, reworking the heuristic problably means again run all the must performance benchmarks, and make sure, we are not regressing there, and it means, the patch will take its own time before it is commited to master.

rampitec added inline comments.Jul 13 2020, 12:12 PM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
562

I mean you can allow extra wide loads with your logic, which is not desired. I suggest to limit it. Did you look at the regression IR?

Then if you afraid of further regressions and think more testing is needed, just revert the change lead to regression.

hsmhsm marked an inline comment as done.Jul 13 2020, 12:26 PM
hsmhsm added inline comments.
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
562

My honest openion is this - My hands are now tied to other urgent issues, and I do not have any time left to think about it. So safe bet now is just continue with old logic as it is. Since any way, we are not permanently reverting the change, and we are just disabling it, we can come back to it once I get some breathing time, and re-work it sooner than later.

rampitec added inline comments.Jul 13 2020, 12:32 PM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
562

It simply does not do what you wrote it does. You wrote it makes a decision based on a number of bytes. It does not.

hsmhsm marked an inline comment as done.Jul 13 2020, 1:01 PM
hsmhsm added inline comments.
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
562

I had to come-up with, this heuristic in order satisfy all the workloads, and more importantly, to make sure that the workload (rocSpARSE) for which, the work was primarily undertaken will work as expected.

But, yes, I think, I am missing some important key insights here.

Any change now means re-evalaute everything from scratch. Also, it is not going to be very strighforward change, considering all the workloads.

So, let's disable it for the time being.

If you are otherwise fine with this patch, let's first merge this patch, so that it won't blocks some key projects.

@foad @rampitec

PING to give LGTM if the patch looks fine.

And, as I said, let's re-work the heuristic in backgpround as the issues cannot be fixed quickly. There are also other tickets created where it says PSDB CI test for the patch is causing CEFFE2 tests to fail. So, quick fix is not possible here. Even if we do, it is very likey that something else will pops up again.

I don't like the fact that we're solving problems with such switches.

Fundamentally, more clustering should never hurt from the perspective of the memory system, as long as we really properly cluster by locality. The way it could hurt is by increasing register pressure (and reducing occupancy). So the correct long-term solution is to make the scheduler better at avoiding the register pressure.

I think we can live with this ugliness for now, but I implore you to look at better solutions.

One must-fix inline.

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
103

The name of the cl::opt variable must match the name of the command-line option.

Hi @nhaehnle

I agree with you on your comments (and also with what @rampitec had made eariler).

We landed-up with this heuristic based on various fact to satisfy all the work-load that we investigated. But, now, I am realizing that there is serious loop-hole with this patch as @rampitec pointed out eariler.

If you carefully look into the original (old) heuristic, though, it also says, it is based on clustered bytes, but, the counted num bytes is not very accurate because of the way it is implemented. There is also a fixme comment for it. My initial journey of this patch started - (1) to remove that fixme, by adding necessary support, (2) and, then to fix the performance issues which are internally raised.

But, now, I reliaze that working on both (1) and (2) simultaneously led to current issues. So, my goal now is to make sure that the system is stable as eairler after (1). Once that is achieved, then we have 1:1 parity between old heuristic and its new clean-up. Then only, I can think of (2).

I am working on (1) right now, will put it for review once it is ready. From this angle, this patch is no more required.

I appreciate all the comments made here so far.

And, further I reverted my original commmit (cc9d69385659be32178506a38b4f2e112ed01ad4) which had introduced faulty heuristic since I am not finding any quick solution here. Now, none of the issues should be blocked because of this faulty patch. I will now start from scratch again to arrive at the heauristic which hopefully purley based on number of custered-bytes without mixing-up with number of clustered instructions.

Reverted commit is: 4905536086ee47f26cd13d716eff8aa6424dfdd7

And, I am closing this phabricator review.

hsmhsm abandoned this revision.Jul 17 2020, 12:11 AM

This review request is no longer required. I am abandoning it.