In case of indirect calls or address taken functions,
skip propagating any attributes to them. We just
propagate features to such functions.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Event Timeline
llvm/lib/Target/AMDGPU/AMDGPUPropagateAttributes.cpp | ||
---|---|---|
240–245 | Propagating the subtarget features is broken. The cases that need this both should be set up front, or shouldn't be subtarget features | |
llvm/test/CodeGen/AMDGPU/propagate-attributes-direct-indirect.ll | ||
41 | CHECK-LABEL, and don't need the arguments/attribute group | |
44–64 | This can be simplified. You shouldn't need allocas or addrspacecasts |
With this patch you would set features on an address-taken function and ignore whole call stack below it. I.e. its own callees will not be processed. I think you need to continue traversal, just skip actual setting of attributes on such a function. Setting these attributes on a functions it may call in turn shall be fine.
llvm/test/CodeGen/AMDGPU/propagate-attributes-direct-indirect.ll | ||
---|---|---|
58 | There should be no numbered variables in tests. |
Decided to skip both features and attributes to address taken functions as discussed offline.
llvm/lib/Target/AMDGPU/AMDGPUPropagateAttributes.cpp | ||
---|---|---|
242 | I think you still need to add it to NewRoots to propagate from F down the stack. |
llvm/test/CodeGen/AMDGPU/propagate-attributes-common-callees.ll | ||
---|---|---|
77–78 ↗ | (On Diff #316862) | Don't need all this, most of this is noise |
Bump!
llvm/lib/Target/AMDGPU/AMDGPUPropagateAttributes.cpp | ||
---|---|---|
242 | I don't think that is needed. The address taken functions will neither have features not attributes to no need to propagate them to their callees. If a function is called from both address taken function and a non-address taken function then the traversal would address that. |
llvm/lib/Target/AMDGPU/AMDGPUPropagateAttributes.cpp | ||
---|---|---|
242 | I'd say never say never. You had a proper and simple code to add it the roots in this if() block. |
llvm/lib/Target/AMDGPU/AMDGPUPropagateAttributes.cpp | ||
---|---|---|
242 | NewRoots is a set, there is no need to check for the NewRoots.count(F), just insert. | |
244 | It is not really changed yet. | |
271 | Do you still need this part? It was needed when you was doing partial update of the properties, which you do not do now. | |
llvm/test/CodeGen/AMDGPU/propagate-attributes-direct-indirect.ll | ||
78 | This can be cleaned a lot. |
llvm/lib/Target/AMDGPU/AMDGPUPropagateAttributes.cpp | ||
---|---|---|
244 | Ok | |
271 | Yes, it is still needed. Traversal may go into infinite loop, for example, the common-callees test depicts one scenario where we need this condition for convergence. | |
llvm/test/CodeGen/AMDGPU/propagate-attributes-direct-indirect.ll | ||
78 | How? These are features being propagated from direct callees. |
llvm/test/CodeGen/AMDGPU/propagate-attributes-direct-indirect.ll | ||
---|---|---|
78 | These are not features we check for or propagate. You can reduce the list to only the features we really need to propagate. |
llvm/test/CodeGen/AMDGPU/propagate-attributes-direct-indirect.ll | ||
---|---|---|
78 | Not sure if I understood. These are the features propagated by the "early" version to the function. If you mean just checking the attribute ID i.e. "#0" then I am not sure what value it adds. |
llvm/test/CodeGen/AMDGPU/propagate-attributes-direct-indirect.ll | ||
---|---|---|
78 | You do not need all that long list: +16-bit-insts,+add-no-carry-insts,+aperture-regs,+ci-insts,+dl-insts,+dot1-insts,+dot2-insts,+dpp,+ds-src2-insts,+enable-ds128,+enable-prt-strict-null,+fast-denormal-f32,+fast-fmaf,+flat-address-space,+flat-for-global,+flat-global-insts,+flat-inst-offsets,+flat-scratch-insts,+fma-mix-insts,+fp64,+gcn3-encoding,+gfx7-gfx8-gfx9-insts,+gfx8-insts,+gfx9,+gfx9-insts,+half-rate-64-ops,+image-gather4-d16-bug,+int-clamp-insts,+inv-2pi-inline-imm,+ldsbankcount32,+load-store-opt,+localmemorysize65536,+mad-mac-f32-insts,+no-xnack-support,+promote-alloca,+r128-a16,+s-memrealtime,+s-memtime-inst,+scalar-atomics,+scalar-flat-scratch-insts,+scalar-stores,+sdwa,+sdwa-omod,+sdwa-scalar,+sdwa-sdst,+sram-ecc,+trap-handler,+unaligned-access-mode,+unaligned-buffer-access,+unaligned-ds-access,+vgpr-index-mode,+vop3p |
llvm/lib/Target/AMDGPU/AMDGPUPropagateAttributes.cpp | ||
---|---|---|
242 | Just if (!Roots.count(&F)) NewRoots.insert(&F); NewRoots is a set. | |
llvm/test/CodeGen/AMDGPU/propagate-attributes-common-callees.ll | ||
11 ↗ | (On Diff #317918) | 95% of the code can be removed, it is not needed for the test. Here and in other places. You only need empty functions and calls. |
llvm/test/CodeGen/AMDGPU/propagate-attributes-direct-indirect.ll | ||
78 | It still can be cleaned. +xnack is not needed. Most of the stuff in #1 is not needed too. Here and in other places. |
LGTM, just remove "target datalayout" from all tests before you submit. I believe it is not needed.
This change is causing infinite loop when compiling rocThrust and hipCUB. Can you take a look? Thanks.
FWIW, after testing rocThrust and hipCUB, it turned out that, it is not an infinite loop but a rise in compile-time which crossed timing threshold of the internal infra. The apps eventually compiled with 1-2% increment in the compile-time. I figured out the cause behind this. This patch makes two additional calls to Function::hasAddressTaken() and hasAddressTaken() is not optimal. Each time Function::hasAddressTaken() is called, it traverses over all uses of the function to deduce the required info. The calls made by this patch are itself in the loop which effectively made the suboptimality of Function::hasAddressTaken() conspicuous. In the new patch I am going to remove one call to Function::hasAddressTaken() which still preserves the intention of this patch. The new patch would basically be Diff3 of this patch.
We should think about optimizing Function::hasAddressTaken() by introducing a bool in Function class which would cache the result. In addition to this, Function::hasAddressTaken() can accept an additional bool parameter and based on its truthness, the function can return the cached value. The default value of parameter bool should be false to preserve today's behavior but client can set it to "true" if they want the cached value. Latter would be useful for this patch as the information is unlikely to change in a module (which is static naturally). This optimization would significantly improve the run-time of hasAddressTaken().
Without this change, internal ci takes ~10 minutes on average to compile rocThrust, ~9 minutes on average to compile hipCUB.
With this change, internal ci timed out after compiling rocThrust for 1.5 hours, timed out after compiling hipCUB after 1 hour.
When bisecting on my local machine, I gave 1 hour on top of what ci gave before terminating...
It did *not* look like a 1-2% increase in time.
Well, when I tried locally in the CI's container, it did not take more than 20mins to compile so I am not sure why it was stuck for more 1-2 hours.
This is what I got from -time-passes of llc.
W/o my change:
8.1618 ( 1.1%) 0.0000 ( 0.0%) 8.1618 ( 1.1%) 8.1637 ( 1.1%) Early propagate attributes from kernels to functions
W/ my change:
9.8957 ( 1.4%) 0.0130 ( 0.3%) 9.9088 ( 1.4%) 9.9124 ( 1.4%) Early propagate attributes from kernels to functions
Before we continue to solutions could you please clarify what in the current rocThrust has address taken? It may not have indirect function calls and without it this change should be a no-op.
I think you still need to add it to NewRoots to propagate from F down the stack.