This is an archive of the discontinued LLVM Phabricator instance.

Scalarization for global uniform loads
ClosedPublic

Authored by alex-t on Nov 21 2016, 8:55 AM.

Details

Summary

LC can currently select scalar load for uniform memory access basing on readonly memory address space only. This restriction originated from the fact that in HW prior to VI vector and scalar caches are not coherent. With MemoryDependenceAnalysis we can check that the memory location corresponding to the memory operand of the LOAD is not clobbered along the all paths from the function entry.

Diff Detail

Event Timeline

alex-t updated this revision to Diff 78729.Nov 21 2016, 8:55 AM
alex-t retitled this revision from to Scalarization for global uniform loads.
alex-t updated this object.

Related patch that takes a slightly different approach: https://reviews.llvm.org/D19493

rampitec edited edge metadata.Nov 21 2016, 9:59 AM

This is not valid to run on just any function. A value shall not be written to memory starting from the kernel. We currently inline everything, but when we have calls that will be an error.

rampitec added inline comments.Nov 21 2016, 10:50 AM
lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
173

What is wrong with Def?

lib/Target/AMDGPU/SIISelLowering.cpp
535

I see it already exists in SITargetLowering::isMemOpUniform(), but we need a better way to identify a kernarg than UndefValue. Especially because later the very same logic will be used for user's non-inlined functions.

536

I do not follow the logic. Why do GlobalValue and Constant pointers are always no clobber?

lib/Target/AMDGPU/SMInstructions.td
227

It also has to be uniform.

test/CodeGen/AMDGPU/global_smrd.ll
2

Please add -verify-machineinstrs

alex-t updated this revision to Diff 78872.Nov 22 2016, 7:32 AM
alex-t edited edge metadata.

fixes according the reviewers requests

alex-t added inline comments.Nov 22 2016, 7:33 AM
lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
173

Since I could not invent the example where the instructions that defines pointer may be of vector (non-uniform) kind I just deleted "isDef()" as you suggested.

lib/Target/AMDGPU/SIISelLowering.cpp
535

I don't understand this either. From my observation "kernarg" is always "llvm::Argument". The code itself obviously copied from the SITargetLowering::isMemOpUniform()

BTW, what check would you suggest for specifically kernel arguments?

536

This is not a "logic" - just "copy-paste" )

I prefer the approach of changing the address space of the pointer, rather than adding an additional metadata node that the backend needs to check.

Also, this needs more tests. You can borrow the ones from the patch I mentioned earlier.

Related patch that takes a slightly different approach: https://reviews.llvm.org/D19493

Tom, your patch is cool. The only thing I don't like about it is the fact that you have to change address space of "not-clobberable" pointers. I cannot take into account all possible passes that may (or may not) leverage on the correct (unchanged) address space. Let's imagine that further somebody invent a very cool optimization that is legal for distinctly read only but not legal for the global (even not-clobberable).

The problem with this patch is that I have to change a huge amount of tests.
I looked into the several failed lit tests.

The reason is as follows:

  1. most of the tests are intended to be as simple as possible that's why they don't use divergent intrinsic if this is not necessary for the test. As a result they use uniform loads to retrieve the data.
  2. any arithmetic instructions taking this uniform data as operands become uniform as well. Since we use ISel to deduce the scalar/vector form of operation, we'll have most of the instruction flow scalar.

For example this simple input

%b_ptr = getelementptr <2 x i32>, <2 x i32> addrspace(1)* %in, i32 1
%a = load <2 x i32>, <2 x i32> addrspace(1) * %in
%b = load <2 x i32>, <2 x i32> addrspace(1) * %b_ptr
%result = and <2 x i32> %a, %b
store <2 x i32> %result, <2 x i32> addrspace(1)* %out

will produce mostly scalar flow:

s_load_dwordx2 s[2:3], s[4:5], 0x8
s_load_dwordx2 s[0:1], s[4:5], 0x0
s_nop 0
s_waitcnt lgkmcnt(0)
s_load_dwordx2 s[4:5], s[2:3], 0x0
v_mov_b32_e32 v3, s1
s_load_dwordx2 s[2:3], s[2:3], 0x8
v_mov_b32_e32 v2, s0
s_waitcnt lgkmcnt(0)
s_and_b32 s3, s5, s3
s_and_b32 s2, s4, s2
v_mov_b32_e32 v0, s2
v_mov_b32_e32 v1, s3
flat_store_dwordx2 v[2:3], v[0:1]

approach of changing the address space of the pointer, rather than adding an additional metadata node that the backend needs to check.

I just meant that the latter adds information to IR but the former loses information from IR.

I prefer the approach of changing the address space of the pointer, rather than adding an additional metadata node that the backend needs to check.

The problem with address space cast from global to constant is that it is against memory model. We have adopted HSA memory model and constant does not alias with global. In fact it does not alias even with flat.

This seems right to me, but it shall only run on kernel functions.

lib/Target/AMDGPU/SIISelLowering.cpp
535

Since it already exists let's keep it for now. I was thinking about a PseudoSourceValue.

I prefer the approach of changing the address space of the pointer, rather than adding an additional metadata node that the backend needs to check.

The problem with address space cast from global to constant is that it is against memory model. We have adopted HSA memory model and constant does not alias with global. In fact it does not alias even with flat.

Ok, then I don't have any objections to this approach.

lib/Target/AMDGPU/SIISelLowering.cpp
536

Do we actually need all this code here? Isn't it enough just to check for the metadata?

alex-t marked 3 inline comments as done.Nov 30 2016, 6:44 AM

I prefer the approach of changing the address space of the pointer, rather than adding an additional metadata node that the backend needs to check.

The problem with address space cast from global to constant is that it is against memory model. We have adopted HSA memory model and constant does not alias with global. In fact it does not alias even with flat.

Ok, then I don't have any objections to this approach.

There is one serious drawback in my approach: metadata cannot be set on Argument. So even trivial example like this "load i32, i32 addrspace(1)* %arg" won't be scalarized. To pass any metadata to ISel I need Instruction (i.e. GEP). So I'd have to transform

load i32, i32 addrspace(1)* %arg

to

%gep = getelementptr i32, i32 addrspace(1)* %arg, i32 0
load i32, i32 addrspace(1)* %gep

to set "noclobber" on GEP.

There is one serious drawback in my approach: metadata cannot be set on Argument. So even trivial example like this "load i32, i32 addrspace(1)* %arg" won't be scalarized. To pass any metadata to ISel I need Instruction (i.e. GEP). So I'd have to transform

load i32, i32 addrspace(1)* %arg

to

%gep = getelementptr i32, i32 addrspace(1)* %arg, i32 0
load i32, i32 addrspace(1)* %gep

to set "noclobber" on GEP.

I do not think this is an issue. We used this with HSAIL for a long time and seen no problems. Moreover, with call support the same will be needed for the uniformness metadata as well.

alex-t updated this revision to Diff 80098.EditedDec 2 2016, 10:55 AM

This is improved implementation of the global memory scalarization. It checks if the memory location is clobbered along the CFG to the Function boundary. This approach is restricted by the FunctionPass capabilities and is not allowed to go beyond the current Function. So we cannot check accesses to Module level variables outside the Function. That's why analysis is restricted to kernel only given that any function calls (when implemented) will be considered as clobbers for any memory location.
This implementation relies on the existing Divergence analysis and does not attempt to improve it's results.

Global loads scalarization is SWITCHED OFF by default.
To enable use: "-amdgpu-scalarize-global-loads=true" LLC option.

Further work is planned to improve current implementation.
Namely:
Constant expression as a pointer operand support
Caching the results of the DFG along the CFG for clobbering memory accesses to shorten the search path and improve compile time on the large CFGs.
There is no currently any DFS depth limit since I had relevant experience in HSAIL backend and did not observe serious compile time impact even on large source files. If somebody feel it is necessary I can add depth limitation.

rampitec added inline comments.Dec 2 2016, 2:27 PM
lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
26

Includes should be alphabetically sorted.

82

Could you please move brace in line with the for loop or remove braces?

96

Could you please capitalize names "checklist" and "load"? Same for other variables.

100

Would be nice to have spacing around "*" consistent.

113

Can you avoid const_cast and use const_iterator?

116

Inconsistent indent.

141

Inconsistent indent.

147

I would suggest checking for kernel only once in runOnFunction.

152

I suppose this condition can never happen, as you have replaced all uses of this Ptr below in the else block.

rampitec added inline comments.Dec 2 2016, 2:33 PM
lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
152

Please disregard this comment, the code is indeed reachable.

alex-t updated this revision to Diff 80260.Dec 5 2016, 5:47 AM
alex-t marked 7 inline comments as done.

Style fixed

lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
26

Which symbol of the full file path should be used as a sorting key?

113

As you probably know there is no consistent strategy regarding "const" in LLVM. MemoryDependenceAnalysis::getSimplePointerDependencyFrom, as welll as other MDA interface methods, accepts non-const iterator while it does not change anything. That's why the only way is to make all the parameters and methods in whole call stack non-const. I removed "const" modifier every where along the code. No const_casts any longer.
Also I changed getSimplePointerDependencyFrom to getPointerDependencyFrom that queries invariant.load metadata stuff and thus potentially provides better alias granularity.

rampitec added inline comments.Dec 5 2016, 9:49 AM
lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
26

You did it right now ;) The key is a full string.

183

That is really better just to return right here if it is not kernel.

alex-t added inline comments.Dec 5 2016, 11:01 AM
lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
183

Really? What's about the loads from the readonly memory? Aren't they still valid even in non-kernel?

rampitec added inline comments.Dec 5 2016, 11:08 AM
lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
183

OK, I see your point. But then check for isKernelFunc first in the condition before doing the expensive DFS.

alex-t updated this revision to Diff 80302.Dec 5 2016, 11:46 AM
alex-t marked an inline comment as done.
rampitec accepted this revision.Dec 5 2016, 12:02 PM
rampitec edited edge metadata.

LGTM

This revision is now accepted and ready to land.Dec 5 2016, 12:02 PM
arsenm added inline comments.Dec 5 2016, 3:31 PM
lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
20–21

Alphabetize

37

*LI

88–93

C++ style comments

lib/Target/AMDGPU/SIISelLowering.cpp
2620

Previous line

test/CodeGen/AMDGPU/global_smrd_cfg.ll
2

You don't need the -O2 since that's the default.

Can you also change the check prefixes to GCN, and also run instnamer (same for the other tests)

arsenm added inline comments.Dec 5 2016, 3:32 PM
lib/Target/AMDGPU/AMDGPUAnnotateUniformValues.cpp
136–137

Weird formatting

150–151

else on same line as previous }

153–154

Asterisks to right

alex-t updated this revision to Diff 80390.Dec 6 2016, 2:23 AM
alex-t edited edge metadata.
This revision was automatically updated to reflect the committed changes.