This is an archive of the discontinued LLVM Phabricator instance.

[IndirectFunctions] Skip propagating attributes to address taken functions
ClosedPublic

Authored by madhur13490 on Jan 13 2021, 1:59 AM.

Details

Summary

In case of indirect calls or address taken functions,
skip propagating any attributes to them. We just
propagate features to such functions.

Diff Detail

Event Timeline

madhur13490 created this revision.Jan 13 2021, 1:59 AM
madhur13490 requested review of this revision.Jan 13 2021, 1:59 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 13 2021, 1:59 AM
arsenm added inline comments.Jan 13 2021, 6:17 AM
llvm/lib/Target/AMDGPU/AMDGPUPropagateAttributes.cpp
240–245

Propagating the subtarget features is broken. The cases that need this both should be set up front, or shouldn't be subtarget features

llvm/test/CodeGen/AMDGPU/propagate-attributes-direct-indirect.ll
42

CHECK-LABEL, and don't need the arguments/attribute group

45–65

This can be simplified. You shouldn't need allocas or addrspacecasts

With this patch you would set features on an address-taken function and ignore whole call stack below it. I.e. its own callees will not be processed. I think you need to continue traversal, just skip actual setting of attributes on such a function. Setting these attributes on a functions it may call in turn shall be fine.

llvm/test/CodeGen/AMDGPU/propagate-attributes-direct-indirect.ll
59

There should be no numbered variables in tests.

rebase + address comments + fix test

With this patch you would set features on an address-taken function and ignore whole call stack below it. I.e. its own callees will not be processed. I think you need to continue traversal, just skip actual setting of attributes on such a function. Setting these attributes on a functions it may call in turn shall be fine.

Decided to skip both features and attributes to address taken functions as discussed offline.

madhur13490 marked 3 inline comments as done.Jan 15 2021, 12:57 AM
rampitec added inline comments.Jan 15 2021, 10:47 AM
llvm/lib/Target/AMDGPU/AMDGPUPropagateAttributes.cpp
242

I think you still need to add it to NewRoots to propagate from F down the stack.

arsenm added inline comments.Jan 15 2021, 11:27 AM
llvm/test/CodeGen/AMDGPU/propagate-attributes-common-callees.ll
78–79

Don't need all this, most of this is noise

Address Matt's comments

Bump!

llvm/lib/Target/AMDGPU/AMDGPUPropagateAttributes.cpp
242

I don't think that is needed. The address taken functions will neither have features not attributes to no need to propagate them to their callees. If a function is called from both address taken function and a non-address taken function then the traversal would address that.

rampitec added inline comments.Jan 19 2021, 10:17 AM
llvm/lib/Target/AMDGPU/AMDGPUPropagateAttributes.cpp
242

I'd say never say never. You had a proper and simple code to add it the roots in this if() block.

Address Stas's comment

rampitec added inline comments.Jan 20 2021, 9:23 AM
llvm/lib/Target/AMDGPU/AMDGPUPropagateAttributes.cpp
242

NewRoots is a set, there is no need to check for the NewRoots.count(F), just insert.

244

It is not really changed yet.

270

Do you still need this part? It was needed when you was doing partial update of the properties, which you do not do now.

llvm/test/CodeGen/AMDGPU/propagate-attributes-direct-indirect.ll
78

This can be cleaned a lot.

madhur13490 added inline comments.Jan 20 2021, 9:28 AM
llvm/lib/Target/AMDGPU/AMDGPUPropagateAttributes.cpp
244

Ok

270

Yes, it is still needed. Traversal may go into infinite loop, for example, the common-callees test depicts one scenario where we need this condition for convergence.

llvm/test/CodeGen/AMDGPU/propagate-attributes-direct-indirect.ll
78

How? These are features being propagated from direct callees.

rampitec added inline comments.Jan 20 2021, 9:32 AM
llvm/test/CodeGen/AMDGPU/propagate-attributes-direct-indirect.ll
78

These are not features we check for or propagate. You can reduce the list to only the features we really need to propagate.

Remove changed

madhur13490 added inline comments.Jan 20 2021, 9:39 AM
llvm/test/CodeGen/AMDGPU/propagate-attributes-direct-indirect.ll
78

Not sure if I understood. These are the features propagated by the "early" version to the function. If you mean just checking the attribute ID i.e. "#0" then I am not sure what value it adds.

rampitec added inline comments.Jan 20 2021, 9:46 AM
llvm/test/CodeGen/AMDGPU/propagate-attributes-direct-indirect.ll
78

You do not need all that long list: +16-bit-insts,+add-no-carry-insts,+aperture-regs,+ci-insts,+dl-insts,+dot1-insts,+dot2-insts,+dpp,+ds-src2-insts,+enable-ds128,+enable-prt-strict-null,+fast-denormal-f32,+fast-fmaf,+flat-address-space,+flat-for-global,+flat-global-insts,+flat-inst-offsets,+flat-scratch-insts,+fma-mix-insts,+fp64,+gcn3-encoding,+gfx7-gfx8-gfx9-insts,+gfx8-insts,+gfx9,+gfx9-insts,+half-rate-64-ops,+image-gather4-d16-bug,+int-clamp-insts,+inv-2pi-inline-imm,+ldsbankcount32,+load-store-opt,+localmemorysize65536,+mad-mac-f32-insts,+no-xnack-support,+promote-alloca,+r128-a16,+s-memrealtime,+s-memtime-inst,+scalar-atomics,+scalar-flat-scratch-insts,+scalar-stores,+sdwa,+sdwa-omod,+sdwa-scalar,+sdwa-sdst,+sram-ecc,+trap-handler,+unaligned-access-mode,+unaligned-buffer-access,+unaligned-ds-access,+vgpr-index-mode,+vop3p

Remove unnecessary things from feature list

rampitec added inline comments.Jan 20 2021, 10:21 AM
llvm/lib/Target/AMDGPU/AMDGPUPropagateAttributes.cpp
242

Just

if (!Roots.count(&F))
  NewRoots.insert(&F);

NewRoots is a set.

llvm/test/CodeGen/AMDGPU/propagate-attributes-common-callees.ll
12

95% of the code can be removed, it is not needed for the test. Here and in other places. You only need empty functions and calls.

llvm/test/CodeGen/AMDGPU/propagate-attributes-direct-indirect.ll
78

It still can be cleaned. +xnack is not needed. Most of the stuff in #1 is not needed too. Here and in other places.

simplify tests; remove align, add dummy body; remove input param

rampitec accepted this revision.Jan 20 2021, 11:56 AM

LGTM, just remove "target datalayout" from all tests before you submit. I believe it is not needed.

This revision is now accepted and ready to land.Jan 20 2021, 11:56 AM

Remove target layout and addrspace

update summary

This revision was landed with ongoing or failed builds.Jan 20 2021, 11:04 PM
This revision was automatically updated to reflect the committed changes.

This change is causing infinite loop when compiling rocThrust and hipCUB. Can you take a look? Thanks.

This change is causing infinite loop when compiling rocThrust and hipCUB. Can you take a look? Thanks.

FWIW, after testing rocThrust and hipCUB, it turned out that, it is not an infinite loop but a rise in compile-time which crossed timing threshold of the internal infra. The apps eventually compiled with 1-2% increment in the compile-time. I figured out the cause behind this. This patch makes two additional calls to Function::hasAddressTaken() and hasAddressTaken() is not optimal. Each time Function::hasAddressTaken() is called, it traverses over all uses of the function to deduce the required info. The calls made by this patch are itself in the loop which effectively made the suboptimality of Function::hasAddressTaken() conspicuous. In the new patch I am going to remove one call to Function::hasAddressTaken() which still preserves the intention of this patch. The new patch would basically be Diff3 of this patch.

We should think about optimizing Function::hasAddressTaken() by introducing a bool in Function class which would cache the result. In addition to this, Function::hasAddressTaken() can accept an additional bool parameter and based on its truthness, the function can return the cached value. The default value of parameter bool should be false to preserve today's behavior but client can set it to "true" if they want the cached value. Latter would be useful for this patch as the information is unlikely to change in a module (which is static naturally). This optimization would significantly improve the run-time of hasAddressTaken().

This change is causing infinite loop when compiling rocThrust and hipCUB. Can you take a look? Thanks.

FWIW, after testing rocThrust and hipCUB, it turned out that, it is not an infinite loop but a rise in compile-time which crossed timing threshold of the internal infra. The apps eventually compiled with 1-2% increment in the compile-time. I figured out the cause behind this. This patch makes two additional calls to Function::hasAddressTaken() and hasAddressTaken() is not optimal. Each time Function::hasAddressTaken() is called, it traverses over all uses of the function to deduce the required info. The calls made by this patch are itself in the loop which effectively made the suboptimality of Function::hasAddressTaken() conspicuous. In the new patch I am going to remove one call to Function::hasAddressTaken() which still preserves the intention of this patch. The new patch would basically be Diff3 of this patch.

We should think about optimizing Function::hasAddressTaken() by introducing a bool in Function class which would cache the result. In addition to this, Function::hasAddressTaken() can accept an additional bool parameter and based on its truthness, the function can return the cached value. The default value of parameter bool should be false to preserve today's behavior but client can set it to "true" if they want the cached value. Latter would be useful for this patch as the information is unlikely to change in a module (which is static naturally). This optimization would significantly improve the run-time of hasAddressTaken().

Without this change, internal ci takes ~10 minutes on average to compile rocThrust, ~9 minutes on average to compile hipCUB.

With this change, internal ci timed out after compiling rocThrust for 1.5 hours, timed out after compiling hipCUB after 1 hour.

When bisecting on my local machine, I gave 1 hour on top of what ci gave before terminating...

It did *not* look like a 1-2% increase in time.

This change is causing infinite loop when compiling rocThrust and hipCUB. Can you take a look? Thanks.

FWIW, after testing rocThrust and hipCUB, it turned out that, it is not an infinite loop but a rise in compile-time which crossed timing threshold of the internal infra. The apps eventually compiled with 1-2% increment in the compile-time. I figured out the cause behind this. This patch makes two additional calls to Function::hasAddressTaken() and hasAddressTaken() is not optimal. Each time Function::hasAddressTaken() is called, it traverses over all uses of the function to deduce the required info. The calls made by this patch are itself in the loop which effectively made the suboptimality of Function::hasAddressTaken() conspicuous. In the new patch I am going to remove one call to Function::hasAddressTaken() which still preserves the intention of this patch. The new patch would basically be Diff3 of this patch.

We should think about optimizing Function::hasAddressTaken() by introducing a bool in Function class which would cache the result. In addition to this, Function::hasAddressTaken() can accept an additional bool parameter and based on its truthness, the function can return the cached value. The default value of parameter bool should be false to preserve today's behavior but client can set it to "true" if they want the cached value. Latter would be useful for this patch as the information is unlikely to change in a module (which is static naturally). This optimization would significantly improve the run-time of hasAddressTaken().

Without this change, internal ci takes ~10 minutes on average to compile rocThrust, ~9 minutes on average to compile hipCUB.

With this change, internal ci timed out after compiling rocThrust for 1.5 hours, timed out after compiling hipCUB after 1 hour.

When bisecting on my local machine, I gave 1 hour on top of what ci gave before terminating...

It did *not* look like a 1-2% increase in time.

Well, when I tried locally in the CI's container, it did not take more than 20mins to compile so I am not sure why it was stuck for more 1-2 hours.

This is what I got from -time-passes of llc.

W/o my change:
8.1618 ( 1.1%) 0.0000 ( 0.0%) 8.1618 ( 1.1%) 8.1637 ( 1.1%) Early propagate attributes from kernels to functions

W/ my change:
9.8957 ( 1.4%) 0.0130 ( 0.3%) 9.9088 ( 1.4%) 9.9124 ( 1.4%) Early propagate attributes from kernels to functions

Before we continue to solutions could you please clarify what in the current rocThrust has address taken? It may not have indirect function calls and without it this change should be a no-op.