For the mem ops clustering logic, keep both old and new logic, guard
new logic by a flag. By default, the flag is disabled and the old logic will be
in place. When the flag is enabled, the new logic is triggered. The flag to
enable new logic is:
--amdgpu-enable-mem-ops-cluster-heuristic-based-on-clustered-bytes=true.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp | ||
---|---|---|
103 | Is there a more useful description than "new heuristic"? If we're asking people to choose between two heuristics then the fact that one of them is currently "new" isn't really helpful. |
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp | ||
---|---|---|
560 | You have problem with extremely wide loads. I am not sure what was in the regression case, but probably something like 8 longs or so. Isn't it better to tweak it instead and just clamp based on the NumBytes as it supposed to be? You are saying you are checking NumBytes, but the return is solely based on NumLoads. |
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp | ||
---|---|---|
560 | Let's consder your example as you replied to my email: Stas' Reply: Now say you have 4 incoming loads of <8 x i64>. According to your logic: this is 8*64*4/8 = 256 bytes, MaxNumLoads = 4. You would say it is OK to cluster, because 4 <= 4 == true. Now say you have 8 loads of <8 x i32>. It is the same 256 bytes, but you inhibit clustering because 8 <= 4 == false. It is not really based on the number of bytes, it is based on the number of loads. As per my understanding, the crux of the logic is this - When the clustered bytes is even large but it is shared by very less number of load instructions, then consider them for clustering. It is where the real usage of clustered bytes plays a role. From above sence, your first example is ok for the heuristic, but not the second example. |
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp | ||
---|---|---|
560 | And, in any case, I am not sure what you mean by - "just clamp based on the NumBytes as it supposed to be". I welcome your ideas in bit detail so that I understand it in detail, and it helps me to quickly experiment with it. Also, reworking the heuristic problably means again run all the must performance benchmarks, and make sure, we are not regressing there, and it means, the patch will take its own time before it is commited to master. |
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp | ||
---|---|---|
560 | I mean you can allow extra wide loads with your logic, which is not desired. I suggest to limit it. Did you look at the regression IR? Then if you afraid of further regressions and think more testing is needed, just revert the change lead to regression. |
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp | ||
---|---|---|
560 | My honest openion is this - My hands are now tied to other urgent issues, and I do not have any time left to think about it. So safe bet now is just continue with old logic as it is. Since any way, we are not permanently reverting the change, and we are just disabling it, we can come back to it once I get some breathing time, and re-work it sooner than later. |
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp | ||
---|---|---|
560 | It simply does not do what you wrote it does. You wrote it makes a decision based on a number of bytes. It does not. |
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp | ||
---|---|---|
560 | I had to come-up with, this heuristic in order satisfy all the workloads, and more importantly, to make sure that the workload (rocSpARSE) for which, the work was primarily undertaken will work as expected. But, yes, I think, I am missing some important key insights here. Any change now means re-evalaute everything from scratch. Also, it is not going to be very strighforward change, considering all the workloads. So, let's disable it for the time being. If you are otherwise fine with this patch, let's first merge this patch, so that it won't blocks some key projects. |
PING to give LGTM if the patch looks fine.
And, as I said, let's re-work the heuristic in backgpround as the issues cannot be fixed quickly. There are also other tickets created where it says PSDB CI test for the patch is causing CEFFE2 tests to fail. So, quick fix is not possible here. Even if we do, it is very likey that something else will pops up again.
I don't like the fact that we're solving problems with such switches.
Fundamentally, more clustering should never hurt from the perspective of the memory system, as long as we really properly cluster by locality. The way it could hurt is by increasing register pressure (and reducing occupancy). So the correct long-term solution is to make the scheduler better at avoiding the register pressure.
I think we can live with this ugliness for now, but I implore you to look at better solutions.
One must-fix inline.
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp | ||
---|---|---|
103 | The name of the cl::opt variable must match the name of the command-line option. |
Hi @nhaehnle
I agree with you on your comments (and also with what @rampitec had made eariler).
We landed-up with this heuristic based on various fact to satisfy all the work-load that we investigated. But, now, I am realizing that there is serious loop-hole with this patch as @rampitec pointed out eariler.
If you carefully look into the original (old) heuristic, though, it also says, it is based on clustered bytes, but, the counted num bytes is not very accurate because of the way it is implemented. There is also a fixme comment for it. My initial journey of this patch started - (1) to remove that fixme, by adding necessary support, (2) and, then to fix the performance issues which are internally raised.
But, now, I reliaze that working on both (1) and (2) simultaneously led to current issues. So, my goal now is to make sure that the system is stable as eairler after (1). Once that is achieved, then we have 1:1 parity between old heuristic and its new clean-up. Then only, I can think of (2).
I am working on (1) right now, will put it for review once it is ready. From this angle, this patch is no more required.
I appreciate all the comments made here so far.
And, further I reverted my original commmit (cc9d69385659be32178506a38b4f2e112ed01ad4) which had introduced faulty heuristic since I am not finding any quick solution here. Now, none of the issues should be blocked because of this faulty patch. I will now start from scratch again to arrive at the heauristic which hopefully purley based on number of custered-bytes without mixing-up with number of clustered instructions.
Reverted commit is: 4905536086ee47f26cd13d716eff8aa6424dfdd7
Is there a more useful description than "new heuristic"? If we're asking people to choose between two heuristics then the fact that one of them is currently "new" isn't really helpful.