Updated the annotate-kernel-features pass to support the propagation of uniform-work-group attribute. This attribute propagates top down from the callers till the callees. Maintained a list in the increasing order of uses (caller to callee). Propagated the uniform-work-group attribute from called functions for each node to the callees. Any functions which do not have the attribute will be added this attribute with a "false" value after this pass.
Details
Diff Detail
Event Timeline
lib/Target/AMDGPU/AMDGPUCodeGenPrepare.cpp | ||
---|---|---|
807 ↗ | (On Diff #158825) | We cannot rely on function names, or introduce functions such as this. This is also not necessary for optimizing this |
A more correct way to optimize this would be to have a CallGraphSCC pass that propagates the uniform-work-group-size attribute to callees only reachable from kernels with uniform-work-group-size
lib/Target/AMDGPU/AMDGPUAnnotateKernelFeatures.cpp | ||
---|---|---|
280 | OS check isn't necessary | |
283–284 | I don't see how this prevents propagating the attribute if other callers do not have it | |
286–288 | Should only be looking for == "true" | |
lib/Target/AMDGPU/AMDGPUTargetMachine.cpp | ||
684 | Extra space at end | |
771–772 | Delete | |
test/CodeGen/AMDGPU/uniform-workgroup-test1.ll | ||
1 ↗ | (On Diff #164518) | Tests are missing run lines |
25–30 ↗ | (On Diff #164518) | This includes attributes added by the pass |
lib/Target/AMDGPU/AMDGPUAnnotateKernelFeatures.cpp | ||
---|---|---|
283–284 | By other callers, do you mean other kernel function? |
Added support to ensure the attribute propagates from the caller to the function even within nested function calls.
Updated the patch to prevent propagating the attribute if other callers do not have it.
Updated Test 5 to show the update.
I think you need to split this into a separate loop before the propagate attributes function instead of adding the recursive call at the same time. This is different from the other attributes because it is inferred top down. You should have a first loop over the CallGraphSCC that adds this. Since the CallGraphSCC should have all of the nodes reachable from each other, this should be some set building / checks from there. You shouldn't need to be looking at the instructions inside the functions and looking for specific call sites
lib/Target/AMDGPU/AMDGPUAnnotateKernelFeatures.cpp | ||
---|---|---|
227–228 | You don't need to check if it has the attribute and check the value == false. Just check the value == true | |
238 | This shouldn't use recursion. This will fail when there's a recursive call. It also shouldn't be necessary, because callees should be called before callers in an SCC pass | |
test/CodeGen/AMDGPU/uniform-work-group-test1.ll | ||
1 ↗ | (On Diff #166333) | GCN check prefix is usually used for ISA check lines. It doesn't really matter, but would require more changes if for some reason in the future an llc run line were added |
4–5 ↗ | (On Diff #166333) | Functions and attribute group variables could use better names (as well as the test file). It would also help to add a comment at the top of each test file for what case this is supposed to be |
test/CodeGen/AMDGPU/uniform-work-group-test5.ll | ||
1 ↗ | (On Diff #166333) | Triple is broken |
23 ↗ | (On Diff #166333) | This is accepted by the IR parser? I would remove all of these empty attribute groups |
lib/Target/AMDGPU/AMDGPUAnnotateKernelFeatures.cpp | ||
---|---|---|
83–85 | No globals | |
217 | I don't understand the point of anything this function is doing. You're just copying the same information that's already present in the SCC into a slightly different structure | |
220 | Range loop | |
223 | This leaks and should also be unnecessary | |
230 | std::make_pair | |
239 | Typo propagte | |
337 | Comment should be capitalized, have a space after // and be punctuated | |
test/CodeGen/AMDGPU/uniform-work-group-nested-function-calls.ll | ||
22 | Should also test a recursive loop |
lib/Target/AMDGPU/AMDGPUAnnotateKernelFeatures.cpp | ||
---|---|---|
49 | llvm:: unnecessary | |
189 | Random whitespace change | |
220 | Space before ( | |
221 | You aren't using this as a stack, so this is weird. You could just do a range loop on reverse(SortedNodeList) | |
338 | Comment is misleading since you aren't really sorting it. Also still not clear why the number of uses matter. Is it just to find unused functions? | |
351 | Extra whitespace change | |
test/CodeGen/AMDGPU/uniform-work-group-resursion-test.ll | ||
1 | Missing amdhsa in triple Testname spelling resursion | |
7 | Better check variable name | |
test/CodeGen/AMDGPU/uniform-work-group-test.ll | ||
33 | A testcase with the attribute on an external function might also be useful |
lib/Target/AMDGPU/AMDGPUAnnotateKernelFeatures.cpp | ||
---|---|---|
338 | I didn't need to explicitly sort it as push_back is doing it for me. The result is a list ranging from the most number of uses to the least. |
lib/Target/AMDGPU/AMDGPUAnnotateKernelFeatures.cpp | ||
---|---|---|
221 | I used pop_back since I wanted the last element first. I'll change it to a range loop. |
-Updated a test case to include an external function.
-Uploaded the correct version of the recursion test.
-Fixed other small errors.
lib/Target/AMDGPU/AMDGPUAnnotateKernelFeatures.cpp | ||
---|---|---|
226–228 | Did you mean Callee->mayBeDerefined() ? |
lib/Target/AMDGPU/AMDGPUAnnotateKernelFeatures.cpp | ||
---|---|---|
49 | Remove Sorted from the name now? | |
226–228 | I think so? I'm not sure what the difference is from isInterposable. This also should check F.hasAddressTaken. | |
234 | You don't need this, which was the point of putting this into a function | |
242–243 | You can just return true directly | |
253 | You can just return true | |
257 | You can just return true | |
260–262 | Can you avoid adding the attribute if not originally present? |
lib/Target/AMDGPU/AMDGPUAnnotateKernelFeatures.cpp | ||
---|---|---|
260–262 | If I don't add the attribute if not originally present, it can create a discrepancy in certain scenarios. |
-Directly returning "true" from the propagateAttribute function as per the comments.
-Renamed the list to NodeList, dropped "Sorted" from the name
LGTM with the long line fixed
lib/Target/AMDGPU/AMDGPUAnnotateKernelFeatures.cpp | ||
---|---|---|
233 | I think this goes over the column limit |
Hi,
This patch breaks RADV (and probably RadeonSI as well). Here's a backtrace of the problem:
$ gdb --args ./deqp-vk --deqp-case=dEQP-VK.spirv_assembly.instruction.graphics.selection_block_order.out_of_order_frag
GNU gdb (GDB) 8.2
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./deqp-vk...(no debugging symbols found)...done.
(gdb) r
Starting program: /home/hakzsam/programming/VK-GL-CTS/build/external/vulkancts/modules/vulkan/deqp-vk --deqp-case=dEQP-VK.spirv_assembly.instruction.graphics.selection_block_order.out_of_order_frag
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
Writing test log into TestResults.qpa
dEQP Core git-12aa347f43c85df3a0daf930739551d3f53d3d48 (0x12aa347f) starting..
target implementation = 'Default'
[New Thread 0x7ffff1968700 (LWP 4723)]
[Thread 0x7ffff1968700 (LWP 4723) exited]
[New Thread 0x7ffff1968700 (LWP 4724)]
Thread 1 "deqp-vk" received signal SIGSEGV, Segmentation fault.
0x00007ffff447a81a in llvm::ValueHandleBase::AddToExistingUseList (this=this@entry=0x7fffffffbc50, List=0x7ffff5fdab80 <vtable for llvm::AAResults::Model<llvm::ScopedNoAliasAAResult>+16>) at ../lib/IR/Value.cpp:745
745 *List = this;
(gdb) bt
#0 0x00007ffff447a81a in llvm::ValueHandleBase::AddToExistingUseList (this=this@entry=0x7fffffffbc50, List=0x7ffff5fdab80 <vtable for llvm::AAResults::Model<llvm::ScopedNoAliasAAResult>+16>) at ../lib/IR/Value.cpp:745
#1 0x00007ffff572fd87 in llvm::ValueHandleBase::ValueHandleBase (RHS=..., Kind=llvm::ValueHandleBase::WeakTracking, this=0x7fffffffbc50) at ../include/llvm/ADT/PointerIntPair.h:150
#2 llvm::WeakTrackingVH::WeakTrackingVH (RHS=..., this=0x7fffffffbc50) at ../include/llvm/IR/ValueHandle.h:187
#3 std::pair<llvm::WeakTrackingVH, llvm::CallGraphNode*>::pair (this=0x7fffffffbc50) at /usr/include/c++/8.2.1/bits/stl_pair.h:303
#4 (anonymous namespace)::AMDGPUAnnotateKernelFeatures::processUniformWorkGroupAttribute (this=0x555557e87570) at ../lib/Target/AMDGPU/AMDGPUAnnotateKernelFeatures.cpp:224
#5 (anonymous namespace)::AMDGPUAnnotateKernelFeatures::runOnSCC (this=<optimized out>, SCC=...) at ../lib/Target/AMDGPU/AMDGPUAnnotateKernelFeatures.cpp:355
#6 0x00007ffff51d3d47 in (anonymous namespace)::CGPassManager::RunPassOnSCC (DevirtualizedCall=<synthetic pointer>: <optimized out>, CallGraphUpToDate=<synthetic pointer>: <optimized out>, CG=..., CurSCC=..., P=
0x555557e87570, this=0x555557e87650) at ../lib/Analysis/CallGraphSCCPass.cpp:141
#7 (anonymous namespace)::CGPassManager::RunAllPassesOnSCC (DevirtualizedCall=<synthetic pointer>: <optimized out>, CG=..., CurSCC=..., this=0x555557e87650) at ../lib/Analysis/CallGraphSCCPass.cpp:442
#8 (anonymous namespace)::CGPassManager::runOnModule (this=0x555557e87650, M=...) at ../lib/Analysis/CallGraphSCCPass.cpp:498
#9 0x00007ffff4428fba in (anonymous namespace)::MPPassManager::runOnModule (M=..., this=0x555557e65c20) at ../lib/IR/LegacyPassManager.cpp:1744
#10 llvm::legacy::PassManagerImpl::run (this=0x555557e657b0, M=...) at ../lib/IR/LegacyPassManager.cpp:1857
#11 0x00007ffff61fa7a6 in ac_compile_module_to_binary (p=0x555557e65750, module=module@entry=0x555557eb65a0, binary=binary@entry=0x7fffffffc080) at /home/hakzsam/install/llvm/debug/master/include/llvm/IR/Module.h:889
#12 0x00007ffff61b6e2b in radv_llvm_per_thread_info::compile_to_memory_buffer (this=<optimized out>, binary=0x7fffffffc080, module=0x555557eb65a0) at radv_llvm_helper.cpp:97
#13 radv_compile_to_binary (info=info@entry=0x7fffffffc050, module=module@entry=0x7ffff61fa7a6 <ac_compile_module_to_binary(ac_compiler_passes*, LLVMModuleRef, ac_shader_binary*)+22>, binary=binary@entry=0x7fffffffc080)
at radv_llvm_helper.cpp:97
#14 0x00007ffff61b0d81 in ac_llvm_compile (ac_llvm=0x7fffffffc050, binary=0x7fffffffc080, M=0x7ffff61fa7a6 <ac_compile_module_to_binary(ac_compiler_passes*, LLVMModuleRef, ac_shader_binary*)+22>) at radv_nir_to_llvm.c:3660
#15 ac_compile_llvm_module (ac_llvm=ac_llvm@entry=0x7fffffffc050, llvm_module=0x7ffff61fa7a6 <ac_compile_module_to_binary(ac_compiler_passes*, LLVMModuleRef, ac_shader_binary*)+22>, binary=binary@entry=0x7fffffffc080,
config=0x7fffffffc080, config@entry=0x555557f71778, stage=MESA_SHADER_COMPUTE, options=0x7fffffffc140, shader_info=<optimized out>, shader_info=<optimized out>) at radv_nir_to_llvm.c:3684
#16 0x00007ffff61b65d0 in radv_compile_nir_shader (ac_llvm=ac_llvm@entry=0x7fffffffc050, binary=binary@entry=0x7fffffffc080, config=config@entry=0x555557f71778, shader_info=shader_info@entry=0x555557f717a0,
nir=nir@entry=0x7fffffffc388, nir_count=nir_count@entry=1, options=0x7fffffffc140) at radv_nir_to_llvm.c:3808
#17 0x00007ffff61c56db in shader_variant_create (device=device@entry=0x555557e4d920, module=0x7fffffffcd40, shaders=shaders@entry=0x7fffffffc388, shader_count=shader_count@entry=1, stage=MESA_SHADER_COMPUTE,
options=options@entry=0x7fffffffc140, gs_copy_shader=false, code_out=0x7fffffffc3b8, code_size_out=0x7fffffffc2e4) at radv_shader.c:612
#18 0x00007ffff61c5b04 in radv_shader_variant_create (device=device@entry=0x555557e4d920, module=<optimized out>, shaders=shaders@entry=0x7fffffffc388, shader_count=shader_count@entry=1, layout=<optimized out>,
key=key@entry=0x7fffffffc810, code_out=0x7fffffffc3b8, code_size_out=0x7fffffffc2e4) at radv_shader.c:666
#19 0x00007ffff61b8aa6 in radv_create_shaders (pipeline=0x555557ef5ca0, device=<optimized out>, cache=0x555557e4d998, key=<optimized out>, pStages=<optimized out>, flags=<optimized out>) at radv_pipeline.c:2151
#20 0x00007ffff61bf7eb in radv_compute_pipeline_create (pPipeline=0x555557e4ef70, pAllocator=<optimized out>, pCreateInfo=0x7fffffffcbd0, _cache=<optimized out>, _device=0x555557e4d920) at radv_pipeline.c:3787
#21 radv_CreateComputePipelines (_device=_device@entry=0x555557e4d920, pipelineCache=pipelineCache@entry=0x555557e4d998, count=count@entry=1, pCreateInfos=pCreateInfos@entry=0x7fffffffcbd0, pAllocator=pAllocator@entry=0x0,
pPipelines=pPipelines@entry=0x555557e4ef70) at radv_pipeline.c:3817
#22 0x00007ffff619653a in radv_device_init_meta_itob_state (device=0x555557e4d920) at radv_private.h:1986
#23 radv_device_init_meta_bufimage_state (device=device@entry=0x555557e4d920) at radv_meta_bufimage.c:1489
#24 0x00007ffff6175a4a in radv_device_init_meta (device=device@entry=0x555557e4d920) at radv_meta.c:365
#25 0x00007ffff61680d0 in radv_CreateDevice (physicalDevice=0x555557d7c0e0, pCreateInfo=0x7fffffffd0d0, pAllocator=<optimized out>, pDevice=0x555557d82ec0) at radv_device.c:1702
#26 0x00007ffff640c574 in ?? () from /usr/lib/libvulkan.so.1
#27 0x00007ffff641599b in ?? () from /usr/lib/libvulkan.so.1
#28 0x00007ffff6419b29 in vkCreateDevice () from /usr/lib/libvulkan.so.1
#29 0x0000555556942f7d in vk::createDevice(vk::PlatformInterface const&, vk::VkInstance_s*, vk::InstanceInterface const&, vk::VkPhysicalDevice_s*, vk::VkDeviceCreateInfo const*, vk::VkAllocationCallbacks const*) ()
#30 0x00005555558a3384 in vkt::DefaultDevice::DefaultDevice(vk::PlatformInterface const&, tcu::CommandLine const&) ()
#31 0x00005555558a40e5 in vkt::Context::Context(tcu::TestContext&, vk::PlatformInterface const&, vk::ProgramCollection<vk::ProgramBinary, vk::BinaryBuildOptions>&) ()
#32 0x000055555588c3e2 in vkt::TestCaseExecutor::TestCaseExecutor(tcu::TestContext&) ()
#33 0x000055555588c552 in vkt::TestPackage::createExecutor() const ()
#34 0x0000555556e04964 in tcu::TestSessionExecutor::iterate() ()
#35 0x0000555556dd89a9 in tcu::App::iterate() ()
#36 0x000055555587e4e8 in main ()
Can you look into it?
Thanks!
Would you be able to provide LLVM IR? I am trying to reproduce this issue locally but it is taking some time as I haven't worked with RADV yet.
Thanks.
Unfortunately, I can't get any LLVM IR because it crashes too early in the process. You can also reproduce the problem with RadeonSI btw.
Thanks for the revert!
llvm:: unnecessary