Make use of both the - (1) clustered bytes and (2) cluster length, to decide on
the max number of mem ops that can be clustered. On an average, when loads
are dword or smaller, consider 5 as max threshold, otherwise 4. This heuristic
is purely based on different experimentation conducted, and there is no analytical
logic here.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp | ||
---|---|---|
480–481 | I would suggest avoiding this division (because division is slow and introduces rounding). And the magic numbers need some kind of explanation. How did you come up with them? So how about reformatting this as: if (NumBytes <= 4 * NumLoads) { // Loads are dword or smaller (on average). MaxNumLoads = 5; } else { // Loads are bigger than a dword (on average). MaxNumLoads = 4; } ... plus some explanation of where the numbers 5 and 4 came from, and why they should be different cases. |
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp | ||
---|---|---|
480–481 | Your suggestions make sense to me. And regarding those magic number 4 and 5, there is no analytical thinking here. It is purely based on experimentation to achieve below three goals. (1) LLVM lit regressions should be as minimal as possible After experimentation with different magic numbers, and different heuristics, this is what I could settle down with. |
Seems reasonable to me, but I would be happier if at least one other reviewer gave an explicit approval before this is committed.
Rebased to upstream master to make sure Harbormaster build is fine before commiting this patch.
Unit test failures in the latest Harbormaster build are spurious - llvm.amdgcn.image.nsa.ll actually passes in my local repo build, and other three failures are unsupported cases.
Looks like this breaks check-llvm: http://45.33.8.238/linux/19836/step_12.txt
Please take a look and revert if it takes a while to fix.
Hi @thakis
This is surprizing, in my local repo the tests pass, but, it fails in pre-checkin build. I have locally built the patch from scratch with upstream rebased repo, and tested it as below:
mahesha@brego:[AMDGPU]$ ~/ROCm/github/llvm/rel-build/bin/llvm-lit -vv cluster_stores.ll llvm.amdgcn.image.nsa.ll unigine-liveness-crash.ll wqm.ll -- Testing: 4 tests, 4 workers -- UNSUPPORTED: LLVM :: CodeGen/AMDGPU/cluster_stores.ll (1 of 4) PASS: LLVM :: CodeGen/AMDGPU/unigine-liveness-crash.ll (2 of 4) PASS: LLVM :: CodeGen/AMDGPU/llvm.amdgcn.image.nsa.ll (3 of 4) PASS: LLVM :: CodeGen/AMDGPU/wqm.ll (4 of 4) Testing Time: 0.31s Unsupported: 1 Passed : 3 mahesha@brego:[AMDGPU]$
Looks like this breaks check-llvm: http://45.33.8.238/linux/19836/step_12.txt
@hsmhsm you can reproduce the failures if you run the tests under valgrind with llvm-lit --vg. Adding an assert(Width > 0) just after MachineScheduler calls getMemOperandsWithOffsetWidth would make the failure even more obvious.
The problem is: first you committed D80946 which returns Width from all cases in SIInstrInfo::getMemOperandsWithOffsetWidth. Then I committed D74035, which was written earlier, which adds a new isMIMG case but does not set Width.
Hi @foad,
Ok, this is interesting. Anyway, so, we need to make sure that SIInstrInfo::getMemOperandsWithOffsetWidth() will compute width for all cases. So, for the new case, that you have added, will you going to sumbit a patch which make sure that width is computed for this new case as well?
Hi @foad,
Ok, this is interesting. Anyway, so, we need to make sure that SIInstrInfo::getMemOperandsWithOffsetWidth() will compute width for all cases. So, for the new case, that you have added, will you going to sumbit a patch which make sure that width is computed for this new case as well?
I would suggest avoiding this division (because division is slow and introduces rounding). And the magic numbers need some kind of explanation. How did you come up with them?
So how about reformatting this as:
... plus some explanation of where the numbers 5 and 4 came from, and why they should be different cases.