LC can currently select scalar load for uniform memory access basing on readonly memory address space only. This restriction originated from the fact that in HW prior to VI vector and scalar caches are not coherent. With MemoryDependenceAnalysis we can check that the memory location corresponding to the memory operand of the LOAD is not clobbered along the all paths from the function entry.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
Related patch that takes a slightly different approach: https://reviews.llvm.org/D19493
This is not valid to run on just any function. A value shall not be written to memory starting from the kernel. We currently inline everything, but when we have calls that will be an error.
lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp | ||
---|---|---|
94 ↗ | (On Diff #78729) | What is wrong with Def? |
lib/Target/AMDGPU/SIISelLowering.cpp | ||
535 ↗ | (On Diff #78729) | I see it already exists in SITargetLowering::isMemOpUniform(), but we need a better way to identify a kernarg than UndefValue. Especially because later the very same logic will be used for user's non-inlined functions. |
536 ↗ | (On Diff #78729) | I do not follow the logic. Why do GlobalValue and Constant pointers are always no clobber? |
lib/Target/AMDGPU/SMInstructions.td | ||
226 ↗ | (On Diff #78729) | It also has to be uniform. |
test/CodeGen/AMDGPU/global_smrd.ll | ||
1 ↗ | (On Diff #78729) | Please add -verify-machineinstrs |
lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp | ||
---|---|---|
94 ↗ | (On Diff #78729) | Since I could not invent the example where the instructions that defines pointer may be of vector (non-uniform) kind I just deleted "isDef()" as you suggested. |
lib/Target/AMDGPU/SIISelLowering.cpp | ||
535 ↗ | (On Diff #78729) | I don't understand this either. From my observation "kernarg" is always "llvm::Argument". The code itself obviously copied from the SITargetLowering::isMemOpUniform() BTW, what check would you suggest for specifically kernel arguments? |
536 ↗ | (On Diff #78729) | This is not a "logic" - just "copy-paste" ) |
I prefer the approach of changing the address space of the pointer, rather than adding an additional metadata node that the backend needs to check.
Also, this needs more tests. You can borrow the ones from the patch I mentioned earlier.
Tom, your patch is cool. The only thing I don't like about it is the fact that you have to change address space of "not-clobberable" pointers. I cannot take into account all possible passes that may (or may not) leverage on the correct (unchanged) address space. Let's imagine that further somebody invent a very cool optimization that is legal for distinctly read only but not legal for the global (even not-clobberable).
The problem with this patch is that I have to change a huge amount of tests.
I looked into the several failed lit tests.
The reason is as follows:
- most of the tests are intended to be as simple as possible that's why they don't use divergent intrinsic if this is not necessary for the test. As a result they use uniform loads to retrieve the data.
- any arithmetic instructions taking this uniform data as operands become uniform as well. Since we use ISel to deduce the scalar/vector form of operation, we'll have most of the instruction flow scalar.
For example this simple input
%b_ptr = getelementptr <2 x i32>, <2 x i32> addrspace(1)* %in, i32 1 %a = load <2 x i32>, <2 x i32> addrspace(1) * %in %b = load <2 x i32>, <2 x i32> addrspace(1) * %b_ptr %result = and <2 x i32> %a, %b store <2 x i32> %result, <2 x i32> addrspace(1)* %out
will produce mostly scalar flow:
s_load_dwordx2 s[2:3], s[4:5], 0x8
s_load_dwordx2 s[0:1], s[4:5], 0x0
s_nop 0
s_waitcnt lgkmcnt(0)
s_load_dwordx2 s[4:5], s[2:3], 0x0
v_mov_b32_e32 v3, s1
s_load_dwordx2 s[2:3], s[2:3], 0x8
v_mov_b32_e32 v2, s0
s_waitcnt lgkmcnt(0)
s_and_b32 s3, s5, s3
s_and_b32 s2, s4, s2
v_mov_b32_e32 v0, s2
v_mov_b32_e32 v1, s3
flat_store_dwordx2 v[2:3], v[0:1]
approach of changing the address space of the pointer, rather than adding an additional metadata node that the backend needs to check.
I just meant that the latter adds information to IR but the former loses information from IR.
The problem with address space cast from global to constant is that it is against memory model. We have adopted HSA memory model and constant does not alias with global. In fact it does not alias even with flat.
This seems right to me, but it shall only run on kernel functions.
lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
535 ↗ | (On Diff #78729) | Since it already exists let's keep it for now. I was thinking about a PseudoSourceValue. |
Ok, then I don't have any objections to this approach.
lib/Target/AMDGPU/SIISelLowering.cpp | ||
---|---|---|
536 ↗ | (On Diff #78729) | Do we actually need all this code here? Isn't it enough just to check for the metadata? |
There is one serious drawback in my approach: metadata cannot be set on Argument. So even trivial example like this "load i32, i32 addrspace(1)* %arg" won't be scalarized. To pass any metadata to ISel I need Instruction (i.e. GEP). So I'd have to transform
load i32, i32 addrspace(1)* %arg
to
%gep = getelementptr i32, i32 addrspace(1)* %arg, i32 0
load i32, i32 addrspace(1)* %gep
to set "noclobber" on GEP.
I do not think this is an issue. We used this with HSAIL for a long time and seen no problems. Moreover, with call support the same will be needed for the uniformness metadata as well.
This is improved implementation of the global memory scalarization. It checks if the memory location is clobbered along the CFG to the Function boundary. This approach is restricted by the FunctionPass capabilities and is not allowed to go beyond the current Function. So we cannot check accesses to Module level variables outside the Function. That's why analysis is restricted to kernel only given that any function calls (when implemented) will be considered as clobbers for any memory location.
This implementation relies on the existing Divergence analysis and does not attempt to improve it's results.
Global loads scalarization is SWITCHED OFF by default.
To enable use: "-amdgpu-scalarize-global-loads=true" LLC option.
Further work is planned to improve current implementation.
Namely:
Constant expression as a pointer operand support
Caching the results of the DFG along the CFG for clobbering memory accesses to shorten the search path and improve compile time on the large CFGs.
There is no currently any DFS depth limit since I had relevant experience in HSAIL backend and did not observe serious compile time impact even on large source files. If somebody feel it is necessary I can add depth limitation.
lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp | ||
---|---|---|
25 ↗ | (On Diff #80098) | Includes should be alphabetically sorted. |
81 ↗ | (On Diff #80098) | Could you please move brace in line with the for loop or remove braces? |
95 ↗ | (On Diff #80098) | Could you please capitalize names "checklist" and "load"? Same for other variables. |
99 ↗ | (On Diff #80098) | Would be nice to have spacing around "*" consistent. |
112 ↗ | (On Diff #80098) | Can you avoid const_cast and use const_iterator? |
115 ↗ | (On Diff #80098) | Inconsistent indent. |
143 ↗ | (On Diff #80098) | Inconsistent indent. |
149 ↗ | (On Diff #80098) | I would suggest checking for kernel only once in runOnFunction. |
154 ↗ | (On Diff #80098) | I suppose this condition can never happen, as you have replaced all uses of this Ptr below in the else block. |
lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp | ||
---|---|---|
154 ↗ | (On Diff #80098) | Please disregard this comment, the code is indeed reachable. |
Style fixed
lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp | ||
---|---|---|
25 ↗ | (On Diff #80098) | Which symbol of the full file path should be used as a sorting key? |
112 ↗ | (On Diff #80098) | As you probably know there is no consistent strategy regarding "const" in LLVM. MemoryDependenceAnalysis::getSimplePointerDependencyFrom, as welll as other MDA interface methods, accepts non-const iterator while it does not change anything. That's why the only way is to make all the parameters and methods in whole call stack non-const. I removed "const" modifier every where along the code. No const_casts any longer. |
lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp | ||
---|---|---|
183 ↗ | (On Diff #80260) | Really? What's about the loads from the readonly memory? Aren't they still valid even in non-kernel? |
lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp | ||
---|---|---|
183 ↗ | (On Diff #80260) | OK, I see your point. But then check for isKernelFunc first in the condition before doing the expensive DFS. |
lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp | ||
---|---|---|
20–21 ↗ | (On Diff #80302) | Alphabetize |
37 ↗ | (On Diff #80302) | *LI |
88–93 ↗ | (On Diff #80302) | C++ style comments |
lib/Target/AMDGPU/SIISelLowering.cpp | ||
2620 ↗ | (On Diff #80302) | Previous line |
test/CodeGen/AMDGPU/global_smrd_cfg.ll | ||
1 ↗ | (On Diff #80302) | You don't need the -O2 since that's the default. Can you also change the check prefixes to GCN, and also run instnamer (same for the other tests) |