The algorithm to find load clobbering in function is in the order of O^2.
The compilation becomes very slow if there are too many blocks ( ~3000).
To limit the compile time, we introduce a threshold (default 2500) of the
number of basic blocks.
Details
Diff Detail
Event Timeline
llvm/lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp | ||
---|---|---|
145–146 | The logic here should be fixed first. This is checking if the load was clobbered, before the trivial check for isGlobalLoad. The expensive check should be reordered last | |
145–146 | Actually it can go even deep,r under the isa<Argument> || GlobalValue check |
Disclaimer: I know nothing about this pass or the purpose of this patch, just trying to answer this question.
MemorySSA has its own internal threshold limiting the number of memory instructions that are traversed upwards. It does not care at how many blocks those memory instructions are spread over.
llvm/lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp | ||
---|---|---|
145–146 | Done in https://reviews.llvm.org/D84890. |
Thanks for the comments! I see that in MemorySSA, it scans 100 memory instructions upwards to find whether it is clobbered.
In our case, we essentially check every basic block, and a max of also 100 instructions in each block to find the pointer dependence:
MDR->getPointerDependencyFrom(MemoryLocation(Ptr), true, StartIt, BB, Load);
At this moment, I am not clear how can we use the existing functionality in MemorySSA for our purpose.
For the test case I have, D84890 did not show a measurable improvement,
while this can reduce the time from 2m30s to 13 seconds.
llvm/lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp | ||
---|---|---|
118–126 | This seems like a pretty stupid way of using this analysis. This is going to be re-scanning the same instructions many times. My quick look at MemoryDependenceAnalysis suggests the way you should use it is to use a combination of getDependency and getNonLocalPointeDependency, which has a cache and internally calls getPointerDependencyFrom. You would then have to walk up the chain of dependencies until you find no clobbers? |
llvm/lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp | ||
---|---|---|
118–126 | You are right. We need to come out with a better memory dependence analysis algorithm to avoid redundant searching. Before that, we should live with the current approach, which is a correct one. |
Ping!
Should we commit this patch to fix the compilation time for now? Then we may look at the possibility to replace
MemoryDependenceAnaysis in AnnotateUniform pass?
Do you mean we need to open a bug (new task) to redesign load clobbering in AnnotateUniform pass?
Given the current implementation, I think this proposal is an effective cut-off to an expensive searching (without caching).
The issue has been workarounded by https://reviews.llvm.org/D94107
So abandon this one.
This seems like a pretty stupid way of using this analysis. This is going to be re-scanning the same instructions many times. My quick look at MemoryDependenceAnalysis suggests the way you should use it is to use a combination of getDependency and getNonLocalPointeDependency, which has a cache and internally calls getPointerDependencyFrom. You would then have to walk up the chain of dependencies until you find no clobbers?