This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU/MemOpsCluster] Clean-up fixme's around mem ops clustering logic
ClosedPublic

Authored by hsmhsm on Jul 22 2020, 12:42 PM.

Details

Summary

Get rid of all fixmes and base heuristic on num-clustered-dwords. The main intuition behind this is as
follows. The existing heuristic roughly summarizes as below:

  • Assume, all the mem ops instructions participating in the clustering process, loads/stores same num bytes
  • If num bytes loaded by each mem op is 4 bytes, then cluster at max 5 mem ops, that is at max 20 bytes
  • If num bytes loaded by each mem op is 8 bytes, then cluster at max 3 mem ops, that is at max 24 bytes
  • If num bytes loaded by each mem op is 16 bytes, then cluster at max 2 mem ops, that is at max 32 bytes

So, we need to make sure that the new heuristic do not completey deviate away from the above one, and it
properly handles both the sub-word loads and the wide loads.

Diff Detail

Event Timeline

hsmhsm created this revision.Jul 22 2020, 12:42 PM
hsmhsm edited the summary of this revision. (Show Details)Jul 22 2020, 12:45 PM
arsenm added inline comments.Jul 22 2020, 4:15 PM
llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
477

Reads badly with the first comma

485–499

saying total max is redundant

486

drop is set

hsmhsm updated this revision to Diff 280026.Jul 22 2020, 9:57 PM

Take care of review comments by Matt.

arsenm accepted this revision.Jul 23 2020, 1:18 PM

Test changes don't seem obviously worse

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
477

s/cannot/should not/, this isn't mandatory

This revision is now accepted and ready to land.Jul 23 2020, 1:18 PM
hsmhsm updated this revision to Diff 280339.Jul 23 2020, 11:52 PM

Fixed second review comment by Matt (though he said it is not mandatory)

rampitec requested changes to this revision.Jul 28 2020, 4:21 PM

This is fine in principle, but will create too big clusters for sub-dword loads. In case of byte loads you could cluster 32 instructions and that will for sure stall because you would overflow a 4-bit counter. Moreover, a sub-dword load will likely occupy a whole VGPR anyway, so in that scenario you will even consume 32 registers, while you are looking into consuming not more than 8.

In addition to the byte size you also need to limit a number of loads. Our previous experience showed visible benefits up to 5-6 loads. Also note the case of 16 bit loads. Assume you have limited num-loads to 6. Now you would likely need 12 registers to hold it which is still more than 8 you had in mind. How about this:

LoadSize = NumBytes / NumLoads;
NumDWORDs = (LoadSize + 3) / 4;
return NumDWORDs <= 8;

You would still have the same logic for wide loads but limit it to 8 loads for sub-dword case.

This revision now requires changes to proceed.Jul 28 2020, 4:21 PM

This is fine in principle, but will create too big clusters for sub-dword loads. In case of byte loads you could cluster 32 instructions and that will for sure stall because you would overflow a 4-bit counter. Moreover, a sub-dword load will likely occupy a whole VGPR anyway, so in that scenario you will even consume 32 registers, while you are looking into consuming not more than 8.

In addition to the byte size you also need to limit a number of loads. Our previous experience showed visible benefits up to 5-6 loads. Also note the case of 16 bit loads. Assume you have limited num-loads to 6. Now you would likely need 12 registers to hold it which is still more than 8 you had in mind. How about this:

LoadSize = NumBytes / NumLoads;
NumDWORDs = (LoadSize + 3) / 4;
return NumDWORDs <= 8;

You would still have the same logic for wide loads but limit it to 8 loads for sub-dword case.

Thanks for the further insights here, yes, I agree with you. However, I think, the proposed above heuristic still have some problems. Let's take below example. Let's assume there are N load operations, each loads 16 bytes, are put into clustering process. Then below is the scenario:

NumLoadsNumBytesLoadSizeNumDWORDsNumDWORDs <= 8Decision
2321644 <= 8cluster
3481644 <= 8cluster
4641644 <= 8cluster
N64 * N1644 <= 8cluster

So, we land-up clustering all the N loads. Is it that you meant NumDWORDs = ((LoadSize + 3) / 4) * NumLoads instead of NumDWORDs = (LoadSize + 3) / 4?

This is fine in principle, but will create too big clusters for sub-dword loads. In case of byte loads you could cluster 32 instructions and that will for sure stall because you would overflow a 4-bit counter. Moreover, a sub-dword load will likely occupy a whole VGPR anyway, so in that scenario you will even consume 32 registers, while you are looking into consuming not more than 8.

In addition to the byte size you also need to limit a number of loads. Our previous experience showed visible benefits up to 5-6 loads. Also note the case of 16 bit loads. Assume you have limited num-loads to 6. Now you would likely need 12 registers to hold it which is still more than 8 you had in mind. How about this:

LoadSize = NumBytes / NumLoads;
NumDWORDs = (LoadSize + 3) / 4;
return NumDWORDs <= 8;

You would still have the same logic for wide loads but limit it to 8 loads for sub-dword case.

Thanks for the further insights here, yes, I agree with you. However, I think, the proposed above heuristic still have some problems. Let's take below example. Let's assume there are N load operations, each loads 16 bytes, are put into clustering process. Then below is the scenario:

NumLoadsNumBytesLoadSizeNumDWORDsNumDWORDs <= 8Decision
2321644 <= 8cluster
3481644 <= 8cluster
4641644 <= 8cluster
N64 * N1644 <= 8cluster

So, we land-up clustering all the N loads. Is it that you meant NumDWORDs = ((LoadSize + 3) / 4) * NumLoads instead of NumDWORDs = (LoadSize + 3) / 4?

Yep, you are right! My NumDWORDs is in a single load, it should be multiplied by the NumLoads.

hsmhsm updated this revision to Diff 281809.Jul 30 2020, 12:19 AM

Implemented new heuristic as suggested by Stas.

I tested the SHOC benchmark and rocSPARSE testsuite for this new heuristic as suggested by Stas. The numbers are almost same as in case of previous heuristic (NumBytes <= 32).

hsmhsm edited the summary of this revision. (Show Details)Jul 30 2020, 12:30 AM
hsmhsm edited the summary of this revision. (Show Details)
hsmhsm edited the summary of this revision. (Show Details)Jul 30 2020, 12:32 AM
This revision is now accepted and ready to land.Jul 30 2020, 8:52 AM