This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: In determining load clobbering in AnnotateUniform, don't scan if there are too many blocks.
AbandonedPublic

Authored by cfang on Jul 29 2020, 10:18 AM.

Details

Reviewers
arsenm
rampitec
Summary

The algorithm to find load clobbering in function is in the order of O^2.
The compilation becomes very slow if there are too many blocks ( ~3000).
To limit the compile time, we introduce a threshold (default 2500) of the
number of basic blocks.

Diff Detail

Event Timeline

cfang created this revision.Jul 29 2020, 10:18 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 29 2020, 10:18 AM
cfang requested review of this revision.Jul 29 2020, 10:18 AM

Does MemorySSA have the same problem? Could we just switch this to use MemorySSA?

arsenm added inline comments.Jul 29 2020, 12:01 PM
llvm/lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
145–146

The logic here should be fixed first. This is checking if the load was clobbered, before the trivial check for isGlobalLoad. The expensive check should be reordered last

145–146

Actually it can go even deep,r under the isa<Argument> || GlobalValue check

Does MemorySSA have the same problem? Could we just switch this to use MemorySSA?

Disclaimer: I know nothing about this pass or the purpose of this patch, just trying to answer this question.
MemorySSA has its own internal threshold limiting the number of memory instructions that are traversed upwards. It does not care at how many blocks those memory instructions are spread over.

cfang marked 2 inline comments as done.Jul 29 2020, 2:41 PM
cfang added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
145–146

Done in https://reviews.llvm.org/D84890.
Actually we have to do the expensive function call for the case PitI != NULL anyway.
So it won't resolve the issue we encountered (this this current patch is still needed).

cfang marked an inline comment as done.Jul 29 2020, 2:49 PM

Does MemorySSA have the same problem? Could we just switch this to use MemorySSA?

Disclaimer: I know nothing about this pass or the purpose of this patch, just trying to answer this question.
MemorySSA has its own internal threshold limiting the number of memory instructions that are traversed upwards. It does not care at how many blocks those memory instructions are spread over.

Thanks for the comments! I see that in MemorySSA, it scans 100 memory instructions upwards to find whether it is clobbered.
In our case, we essentially check every basic block, and a max of also 100 instructions in each block to find the pointer dependence:
MDR->getPointerDependencyFrom(MemoryLocation(Ptr), true, StartIt, BB, Load);
At this moment, I am not clear how can we use the existing functionality in MemorySSA for our purpose.

How much improvement does D84890 give vs. this?

cfang added a comment.Jul 30 2020, 3:21 PM

How much improvement does D84890 give vs. this?

For the test case I have, D84890 did not show a measurable improvement,
while this can reduce the time from 2m30s to 13 seconds.

arsenm added inline comments.Jul 30 2020, 3:41 PM
llvm/lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
118–126

This seems like a pretty stupid way of using this analysis. This is going to be re-scanning the same instructions many times. My quick look at MemoryDependenceAnalysis suggests the way you should use it is to use a combination of getDependency and getNonLocalPointeDependency, which has a cache and internally calls getPointerDependencyFrom. You would then have to walk up the chain of dependencies until you find no clobbers?

cfang added inline comments.Aug 4 2020, 4:02 PM
llvm/lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
118–126

You are right. We need to come out with a better memory dependence analysis algorithm to avoid redundant searching. Before that, we should live with the current approach, which is a correct one.
As a result, we have to restrict the search to control the compile time.

cfang added a comment.Oct 27 2020, 2:10 PM

Ping!

Should we commit this patch to fix the compilation time for now? Then we may look at the possibility to replace
MemoryDependenceAnaysis in AnnotateUniform pass?

Ping!

Should we commit this patch to fix the compilation time for now? Then we may look at the possibility to replace
MemoryDependenceAnaysis in AnnotateUniform pass?

This doesn't sound like a commitment to me

cfang added a comment.Dec 1 2020, 4:03 PM

Ping!

Should we commit this patch to fix the compilation time for now? Then we may look at the possibility to replace
MemoryDependenceAnaysis in AnnotateUniform pass?

This doesn't sound like a commitment to me

Do you mean we need to open a bug (new task) to redesign load clobbering in AnnotateUniform pass?
Given the current implementation, I think this proposal is an effective cut-off to an expensive searching (without caching).

arsenm added a comment.Dec 2 2020, 7:41 PM

Ping!

Should we commit this patch to fix the compilation time for now? Then we may look at the possibility to replace
MemoryDependenceAnaysis in AnnotateUniform pass?

This doesn't sound like a commitment to me

Do you mean we need to open a bug (new task) to redesign load clobbering in AnnotateUniform pass?
Given the current implementation, I think this proposal is an effective cut-off to an expensive searching (without caching).

Yes, I don't like that this is just putting off a real fix

cfang abandoned this revision.Jan 12 2021, 2:44 PM

The issue has been workarounded by https://reviews.llvm.org/D94107
So abandon this one.