This is an archive of the discontinued LLVM Phabricator instance.

[IndirectFunctions] Skip propagating attributes to address taken functions
ClosedPublic

Authored by madhur13490 on Jan 13 2021, 1:59 AM.

Download Raw Diff

Details

Reviewers

arsenm
rampitec

Commits

rGdd8ae42674b4: [IndirectFunctions] Skip propagating attributes to address taken functions

Summary

In case of indirect calls or address taken functions,
skip propagating any attributes to them. We just
propagate features to such functions.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

madhur13490 created this revision.Jan 13 2021, 1:59 AM

Herald added subscribers: jdoerfert, kerbowa, jfb and 3 others. · View Herald TranscriptJan 13 2021, 1:59 AM

madhur13490 requested review of this revision.Jan 13 2021, 1:59 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 13 2021, 1:59 AM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

Harbormaster completed remote builds in B84988: Diff 316350.Jan 13 2021, 2:34 AM

arsenm added inline comments.Jan 13 2021, 6:17 AM

llvm/lib/Target/AMDGPU/AMDGPUPropagateAttributes.cpp
240–245	Propagating the subtarget features is broken. The cases that need this both should be set up front, or shouldn't be subtarget features
llvm/test/CodeGen/AMDGPU/propagate-attributes-direct-indirect.ll
42	CHECK-LABEL, and don't need the arguments/attribute group
45–65	This can be simplified. You shouldn't need allocas or addrspacecasts

With this patch you would set features on an address-taken function and ignore whole call stack below it. I.e. its own callees will not be processed. I think you need to continue traversal, just skip actual setting of attributes on such a function. Setting these attributes on a functions it may call in turn shall be fine.

llvm/test/CodeGen/AMDGPU/propagate-attributes-direct-indirect.ll
59	There should be no numbered variables in tests.

rebase + address comments + fix test

In D94585#2496288, @rampitec wrote:

With this patch you would set features on an address-taken function and ignore whole call stack below it. I.e. its own callees will not be processed. I think you need to continue traversal, just skip actual setting of attributes on such a function. Setting these attributes on a functions it may call in turn shall be fine.

Decided to skip both features and attributes to address taken functions as discussed offline.

madhur13490 marked 3 inline comments as done.Jan 15 2021, 12:57 AM

Harbormaster completed remote builds in B85304: Diff 316862.Jan 15 2021, 1:40 AM

rampitec added inline comments.Jan 15 2021, 10:47 AM

llvm/lib/Target/AMDGPU/AMDGPUPropagateAttributes.cpp
242	I think you still need to add it to NewRoots to propagate from F down the stack.

arsenm added inline comments.Jan 15 2021, 11:27 AM

llvm/test/CodeGen/AMDGPU/propagate-attributes-common-callees.ll
78–79	Don't need all this, most of this is noise

Address Matt's comments

Harbormaster completed remote builds in B85605: Diff 317366.Jan 18 2021, 9:02 AM

Bump!

llvm/lib/Target/AMDGPU/AMDGPUPropagateAttributes.cpp
242	I don't think that is needed. The address taken functions will neither have features not attributes to no need to propagate them to their callees. If a function is called from both address taken function and a non-address taken function then the traversal would address that.

rampitec added inline comments.Jan 19 2021, 10:17 AM

llvm/lib/Target/AMDGPU/AMDGPUPropagateAttributes.cpp
242	I'd say never say never. You had a proper and simple code to add it the roots in this if() block.

Address Stas's comment

rebase

rampitec added inline comments.Jan 20 2021, 9:23 AM

llvm/lib/Target/AMDGPU/AMDGPUPropagateAttributes.cpp
242	NewRoots is a set, there is no need to check for the NewRoots.count(F), just insert.
244	It is not really changed yet.
271	Do you still need this part? It was needed when you was doing partial update of the properties, which you do not do now.
llvm/test/CodeGen/AMDGPU/propagate-attributes-direct-indirect.ll
78	This can be cleaned a lot.

madhur13490 added inline comments.Jan 20 2021, 9:28 AM

llvm/lib/Target/AMDGPU/AMDGPUPropagateAttributes.cpp
244	Ok
271	Yes, it is still needed. Traversal may go into infinite loop, for example, the common-callees test depicts one scenario where we need this condition for convergence.
llvm/test/CodeGen/AMDGPU/propagate-attributes-direct-indirect.ll
78	How? These are features being propagated from direct callees.

rampitec added inline comments.Jan 20 2021, 9:32 AM

llvm/test/CodeGen/AMDGPU/propagate-attributes-direct-indirect.ll
78	These are not features we check for or propagate. You can reduce the list to only the features we really need to propagate.

Remove changed

madhur13490 added inline comments.Jan 20 2021, 9:39 AM

llvm/test/CodeGen/AMDGPU/propagate-attributes-direct-indirect.ll
78	Not sure if I understood. These are the features propagated by the "early" version to the function. If you mean just checking the attribute ID i.e. "#0" then I am not sure what value it adds.

rampitec added inline comments.Jan 20 2021, 9:46 AM

llvm/test/CodeGen/AMDGPU/propagate-attributes-direct-indirect.ll
78	You do not need all that long list: +16-bit-insts,+add-no-carry-insts,+aperture-regs,+ci-insts,+dl-insts,+dot1-insts,+dot2-insts,+dpp,+ds-src2-insts,+enable-ds128,+enable-prt-strict-null,+fast-denormal-f32,+fast-fmaf,+flat-address-space,+flat-for-global,+flat-global-insts,+flat-inst-offsets,+flat-scratch-insts,+fma-mix-insts,+fp64,+gcn3-encoding,+gfx7-gfx8-gfx9-insts,+gfx8-insts,+gfx9,+gfx9-insts,+half-rate-64-ops,+image-gather4-d16-bug,+int-clamp-insts,+inv-2pi-inline-imm,+ldsbankcount32,+load-store-opt,+localmemorysize65536,+mad-mac-f32-insts,+no-xnack-support,+promote-alloca,+r128-a16,+s-memrealtime,+s-memtime-inst,+scalar-atomics,+scalar-flat-scratch-insts,+scalar-stores,+sdwa,+sdwa-omod,+sdwa-scalar,+sdwa-sdst,+sram-ecc,+trap-handler,+unaligned-access-mode,+unaligned-buffer-access,+unaligned-ds-access,+vgpr-index-mode,+vop3p

Remove unnecessary things from feature list

Harbormaster completed remote builds in B85902: Diff 317897.Jan 20 2021, 10:20 AM

rampitec added inline comments.Jan 20 2021, 10:21 AM

llvm/lib/Target/AMDGPU/AMDGPUPropagateAttributes.cpp
242	Just if (!Roots.count(&F)) NewRoots.insert(&F); NewRoots is a set.
llvm/test/CodeGen/AMDGPU/propagate-attributes-common-callees.ll
12	95% of the code can be removed, it is not needed for the test. Here and in other places. You only need empty functions and calls.
llvm/test/CodeGen/AMDGPU/propagate-attributes-direct-indirect.ll
78	It still can be cleaned. +xnack is not needed. Most of the stuff in #1 is not needed too. Here and in other places.

Harbormaster completed remote builds in B85905: Diff 317901.Jan 20 2021, 10:39 AM

Harbormaster completed remote builds in B85909: Diff 317905.Jan 20 2021, 11:12 AM

simplify tests; remove align, add dummy body; remove input param

Harbormaster completed remote builds in B85918: Diff 317918.Jan 20 2021, 11:54 AM

LGTM, just remove "target datalayout" from all tests before you submit. I believe it is not needed.

This revision is now accepted and ready to land.Jan 20 2021, 11:56 AM

Harbormaster completed remote builds in B85939: Diff 317962.Jan 20 2021, 1:44 PM

Remove target layout and addrspace

update summary

rebase

This revision was landed with ongoing or failed builds.Jan 20 2021, 11:04 PM

Closed by commit rGdd8ae42674b4: [IndirectFunctions] Skip propagating attributes to address taken functions (authored by madhur13490). · Explain Why

This revision was automatically updated to reflect the committed changes.

madhur13490 added a commit: rGdd8ae42674b4: [IndirectFunctions] Skip propagating attributes to address taken functions.

Harbormaster completed remote builds in B86042: Diff 318114.Jan 21 2021, 12:40 AM

Harbormaster completed remote builds in B86043: Diff 318116.Jan 21 2021, 1:01 AM

Harbormaster completed remote builds in B86046: Diff 318118.Jan 21 2021, 1:34 AM

kzhuravl added a reverting change: D95389: AMDGPU: Revert "[IndirectFunctions] Skip propagating attributes to address taken functions".Jan 25 2021, 12:54 PM

kzhuravl added a reverting change: rG2cdb34efdac5: Revert "[IndirectFunctions] Skip propagating attributes to address taken….Jan 25 2021, 12:58 PM

This change is causing infinite loop when compiling rocThrust and hipCUB. Can you take a look? Thanks.

In D94585#2520966, @kzhuravl wrote:

This change is causing infinite loop when compiling rocThrust and hipCUB. Can you take a look? Thanks.

FWIW, after testing rocThrust and hipCUB, it turned out that, it is not an infinite loop but a rise in compile-time which crossed timing threshold of the internal infra. The apps eventually compiled with 1-2% increment in the compile-time. I figured out the cause behind this. This patch makes two additional calls to Function::hasAddressTaken() and hasAddressTaken() is not optimal. Each time Function::hasAddressTaken() is called, it traverses over all uses of the function to deduce the required info. The calls made by this patch are itself in the loop which effectively made the suboptimality of Function::hasAddressTaken() conspicuous. In the new patch I am going to remove one call to Function::hasAddressTaken() which still preserves the intention of this patch. The new patch would basically be Diff3 of this patch.

We should think about optimizing Function::hasAddressTaken() by introducing a bool in Function class which would cache the result. In addition to this, Function::hasAddressTaken() can accept an additional bool parameter and based on its truthness, the function can return the cached value. The default value of parameter bool should be false to preserve today's behavior but client can set it to "true" if they want the cached value. Latter would be useful for this patch as the information is unlikely to change in a module (which is static naturally). This optimization would significantly improve the run-time of hasAddressTaken().

In D94585#2524476, @madhur13490 wrote:

In D94585#2520966, @kzhuravl wrote:

This change is causing infinite loop when compiling rocThrust and hipCUB. Can you take a look? Thanks.

FWIW, after testing rocThrust and hipCUB, it turned out that, it is not an infinite loop but a rise in compile-time which crossed timing threshold of the internal infra. The apps eventually compiled with 1-2% increment in the compile-time. I figured out the cause behind this. This patch makes two additional calls to Function::hasAddressTaken() and hasAddressTaken() is not optimal. Each time Function::hasAddressTaken() is called, it traverses over all uses of the function to deduce the required info. The calls made by this patch are itself in the loop which effectively made the suboptimality of Function::hasAddressTaken() conspicuous. In the new patch I am going to remove one call to Function::hasAddressTaken() which still preserves the intention of this patch. The new patch would basically be Diff3 of this patch.

We should think about optimizing Function::hasAddressTaken() by introducing a bool in Function class which would cache the result. In addition to this, Function::hasAddressTaken() can accept an additional bool parameter and based on its truthness, the function can return the cached value. The default value of parameter bool should be false to preserve today's behavior but client can set it to "true" if they want the cached value. Latter would be useful for this patch as the information is unlikely to change in a module (which is static naturally). This optimization would significantly improve the run-time of hasAddressTaken().

Without this change, internal ci takes ~10 minutes on average to compile rocThrust, ~9 minutes on average to compile hipCUB.

With this change, internal ci timed out after compiling rocThrust for 1.5 hours, timed out after compiling hipCUB after 1 hour.

When bisecting on my local machine, I gave 1 hour on top of what ci gave before terminating...

It did *not* look like a 1-2% increase in time.

In D94585#2524490, @kzhuravl wrote:

In D94585#2524476, @madhur13490 wrote:

In D94585#2520966, @kzhuravl wrote:

This change is causing infinite loop when compiling rocThrust and hipCUB. Can you take a look? Thanks.

FWIW, after testing rocThrust and hipCUB, it turned out that, it is not an infinite loop but a rise in compile-time which crossed timing threshold of the internal infra. The apps eventually compiled with 1-2% increment in the compile-time. I figured out the cause behind this. This patch makes two additional calls to Function::hasAddressTaken() and hasAddressTaken() is not optimal. Each time Function::hasAddressTaken() is called, it traverses over all uses of the function to deduce the required info. The calls made by this patch are itself in the loop which effectively made the suboptimality of Function::hasAddressTaken() conspicuous. In the new patch I am going to remove one call to Function::hasAddressTaken() which still preserves the intention of this patch. The new patch would basically be Diff3 of this patch.

We should think about optimizing Function::hasAddressTaken() by introducing a bool in Function class which would cache the result. In addition to this, Function::hasAddressTaken() can accept an additional bool parameter and based on its truthness, the function can return the cached value. The default value of parameter bool should be false to preserve today's behavior but client can set it to "true" if they want the cached value. Latter would be useful for this patch as the information is unlikely to change in a module (which is static naturally). This optimization would significantly improve the run-time of hasAddressTaken().

Without this change, internal ci takes ~10 minutes on average to compile rocThrust, ~9 minutes on average to compile hipCUB.

With this change, internal ci timed out after compiling rocThrust for 1.5 hours, timed out after compiling hipCUB after 1 hour.

When bisecting on my local machine, I gave 1 hour on top of what ci gave before terminating...

It did *not* look like a 1-2% increase in time.

Well, when I tried locally in the CI's container, it did not take more than 20mins to compile so I am not sure why it was stuck for more 1-2 hours.

This is what I got from -time-passes of llc.

W/o my change:
8.1618 ( 1.1%) 0.0000 ( 0.0%) 8.1618 ( 1.1%) 8.1637 ( 1.1%) Early propagate attributes from kernels to functions

W/ my change:
9.8957 ( 1.4%) 0.0130 ( 0.3%) 9.9088 ( 1.4%) 9.9124 ( 1.4%) Early propagate attributes from kernels to functions

Before we continue to solutions could you please clarify what in the current rocThrust has address taken? It may not have indirect function calls and without it this change should be a no-op.

madhur13490 mentioned this in D103138: [AMDGPU] [IndirectCalls] Don't propagate attributes to address taken functions and their callees.May 25 2021, 10:50 PM

madhur13490 mentioned this in rG6a3beb1f68d6: [AMDGPU] [IndirectCalls] Don't propagate attributes to address taken functions….Jun 3 2021, 11:07 PM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUPropagateAttributes.cpp

15 lines

test/

CodeGen/

AMDGPU/

propagate-attributes-common-callees.ll

79 lines

propagate-attributes-direct-indirect-common-callee.ll

69 lines

propagate-attributes-direct-indirect.ll

79 lines

propagate-attributes-indirect.ll

67 lines

Diff 317905

llvm/lib/Target/AMDGPU/AMDGPUPropagateAttributes.cpp

Show First 20 Lines • Show All 230 Lines • ▼ Show 20 Lines	bool AMDGPUPropagateAttributes::process() {
do {		do {
Roots.insert(NewRoots.begin(), NewRoots.end());		Roots.insert(NewRoots.begin(), NewRoots.end());
NewRoots.clear();		NewRoots.clear();

for (auto &F : M.functions()) {		for (auto &F : M.functions()) {
if (F.isDeclaration())		if (F.isDeclaration())
continue;		continue;

		// Skip propagating attributes and features to
		// address taken functions.
		if (F.hasAddressTaken()) {
		if (!Roots.count(&F) && !NewRoots.count(&F)) {
		rampitecUnsubmitted Not Done Reply Inline Actions I think you still need to add it to NewRoots to propagate from F down the stack. rampitec: I think you still need to add it to NewRoots to propagate from F down the stack.
		madhur13490AuthorUnsubmitted Done Reply Inline Actions I don't think that is needed. The address taken functions will neither have features not attributes to no need to propagate them to their callees. If a function is called from both address taken function and a non-address taken function then the traversal would address that. madhur13490: I don't think that is needed. The address taken functions will neither have features not…
		rampitecUnsubmitted Not Done Reply Inline Actions I'd say never say never. You had a proper and simple code to add it the roots in this if() block. rampitec: I'd say never say never. You had a proper and simple code to add it the roots in this if()…
		rampitecUnsubmitted Not Done Reply Inline Actions NewRoots is a set, there is no need to check for the NewRoots.count(F), just insert. rampitec: NewRoots is a set, there is no need to check for the NewRoots.count(F), just insert.
		rampitecUnsubmitted Not Done Reply Inline Actions Just if (!Roots.count(&F)) NewRoots.insert(&F); NewRoots is a set. rampitec: Just ``` if (!Roots.count(&F)) NewRoots.insert(&F); ``` NewRoots is a set.
		NewRoots.insert(&F);
		}
		rampitecUnsubmitted Not Done Reply Inline Actions It is not really changed yet. rampitec: It is not really changed yet.
		madhur13490AuthorUnsubmitted Done Reply Inline Actions Ok madhur13490: Ok
		continue;
		arsenmUnsubmitted Done Reply Inline Actions Propagating the subtarget features is broken. The cases that need this both should be set up front, or shouldn't be subtarget features arsenm: Propagating the subtarget features is broken. The cases that need this both should be set up…
		}

const FnProperties CalleeProps(*TM, F);		const FnProperties CalleeProps(*TM, F);
SmallVector<std::pair<CallBase , Function >, 32> ToReplace;		SmallVector<std::pair<CallBase , Function >, 32> ToReplace;
SmallSet<CallBase *, 32> Visited;		SmallSet<CallBase *, 32> Visited;

for (User *U : F.users()) {		for (User *U : F.users()) {
Instruction *I = dyn_cast<Instruction>(U);		Instruction *I = dyn_cast<Instruction>(U);
if (!I)		if (!I)
continue;		continue;
CallBase *CI = dyn_cast<CallBase>(I);		CallBase *CI = dyn_cast<CallBase>(I);
if (!CI)		if (!CI)
continue;		continue;
Function *Caller = CI->getCaller();		Function *Caller = CI->getCaller();
if (!Caller \|\| !Visited.insert(CI).second)		if (!Caller \|\| !Visited.insert(CI).second)
continue;		continue;
if (!Roots.count(Caller) && !NewRoots.count(Caller))		if (!Roots.count(Caller) && !NewRoots.count(Caller))
continue;		continue;

const FnProperties CallerProps(TM, Caller);		const FnProperties CallerProps(TM, Caller);

if (CalleeProps == CallerProps) {		// Convergence is allowed if the caller has its
		// address taken because all callee's (attributes + features)
		// may not agree as the callee may be the target of
		// more than one function (called directly or indirectly).
		if (Caller->hasAddressTaken() \|\| CalleeProps == CallerProps) {
		rampitecUnsubmitted Not Done Reply Inline Actions Do you still need this part? It was needed when you was doing partial update of the properties, which you do not do now. rampitec: Do you still need this part? It was needed when you was doing partial update of the properties…
		madhur13490AuthorUnsubmitted Done Reply Inline Actions Yes, it is still needed. Traversal may go into infinite loop, for example, the common-callees test depicts one scenario where we need this condition for convergence. madhur13490: Yes, it is still needed. Traversal may go into infinite loop, for example, the common-callees…
if (!Roots.count(&F))		if (!Roots.count(&F))
NewRoots.insert(&F);		NewRoots.insert(&F);
continue;		continue;
}		}

Function *NewF = findFunction(CallerProps, &F);		Function *NewF = findFunction(CallerProps, &F);
if (!NewF) {		if (!NewF) {
const FnProperties NewProps = CalleeProps.adjustToCaller(CallerProps);		const FnProperties NewProps = CalleeProps.adjustToCaller(CallerProps);
▲ Show 20 Lines • Show All 156 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/propagate-attributes-common-callees.ll

This file was added.

				; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -amdgpu-propagate-attributes-early %s \| FileCheck %s

				target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7"

				; Complicated call graph where a function is called
				; directly from a kernel abd also from a function
				; whose address is taken.

				; CHECK-LABEL: define float @common_callee.gc(i32 %a) #0 {
				define float @common_callee.gc(i32 %a) {
				%add = add i32 %a, 6
				%mul = mul nsw i32 %add, 9
				rampitecUnsubmitted Not Done Reply Inline Actions 95% of the code can be removed, it is not needed for the test. Here and in other places. You only need empty functions and calls. rampitec: 95% of the code can be removed, it is not needed for the test. Here and in other places. You…
				%div = sdiv i32 %mul, 8
				%f = sitofp i32 %div to float
				ret float %f
				}

				; CHECK-LABEL: define float @foo(i32 %a) {
				define float @foo(i32 %a) {
				entry:
				%mul = mul nsw i32 %a, 5
				%cast = sitofp i32 %mul to float

				ret float %cast
				}

				; CHECK-LABEL: define float @bar(i32 %a) {
				define float @bar(i32 %a) {
				entry:
				%div = sdiv i32 %a, 7
				%direct_call = call contract float @common_callee.gc(i32 5)
				ret float %direct_call
				}

				; CHECK-LABEL: define float @baz(i32 %a) {
				define float @baz(i32 %a) {
				entry:
				%mul = mul nsw i32 %a, 6
				%div = sdiv i32 %mul, 7
				%conv = sitofp i32 %div to float
				ret float %conv
				}

				define amdgpu_kernel void @switch_indirect_kernel(float *%result, i32 %type) #1 {
				entry:
				%fn = alloca float (i32)*, align 8, addrspace(5)
				switch i32 %type, label %sw.default [
				i32 1, label %sw.bb
				i32 2, label %sw.bb2
				i32 3, label %sw.bb3
				]

				sw.bb:
				store float (i32)* @foo, float (i32)* addrspace(5)* %fn, align 8
				br label %sw.epilog

				sw.bb2:
				store float (i32)* @bar, float (i32)* addrspace(5)* %fn, align 8
				br label %sw.epilog

				sw.bb3:
				store float (i32)* @baz, float (i32)* addrspace(5)* %fn, align 8
				br label %sw.epilog

				sw.default:
				br label %sw.epilog

				sw.epilog:
				%fp = load float (i32), float (i32) addrspace(5)* %fn, align 8
				%direct_call = call contract float @common_callee.gc(i32 4)
				%conv = fptosi float %direct_call to i32
				%call4 = call contract float %fp(i32 %conv)
				store float %call4, float* %result, align 4
				ret void
				}

				attributes #0 = { "amdgpu-flat-work-group-size"="1,256" "target-features"="+16-bit-insts,+add-no-carry-insts,+aperture-regs,+ci-insts,+dl-insts,+dot1-insts,+dot2-insts,+dpp,+ds-src2-insts,+enable-ds128,+enable-prt-strict-null,+fast-denormal-f32,+fast-fmaf,+flat-address-space,+flat-for-global,+flat-global-insts,+flat-inst-offsets,+flat-scratch-insts,+fma-mix-insts,+fp64,+gcn3-encoding,+gfx7-gfx8-gfx9-insts,+gfx8-insts,+gfx9,+gfx9-insts,+half-rate-64-ops,+image-gather4-d16-bug,+int-clamp-insts,+inv-2pi-inline-imm,+ldsbankcount32,+load-store-opt,+localmemorysize65536,+mad-mac-f32-insts,+no-xnack-support,+promote-alloca,+r128-a16,+s-memrealtime,+s-memtime-inst,+scalar-atomics,+scalar-flat-scratch-insts,+scalar-stores,+sdwa,+sdwa-omod,+sdwa-scalar,+sdwa-sdst,+sram-ecc,+trap-handler,+unaligned-access-mode,+unaligned-buffer-access,+unaligned-ds-access,+vgpr-index-mode,+vop3p,-wavefrontsize16,-wavefrontsize32,+wavefrontsize64,+xnack" }
				attributes #1 = { convergent norecurse nounwind mustprogress
				"amdgpu-flat-work-group-size"="1,256"}
				arsenmUnsubmitted Not Done Reply Inline Actions Don't need all this, most of this is noise arsenm: Don't need all this, most of this is noise

llvm/test/CodeGen/AMDGPU/propagate-attributes-direct-indirect-common-callee.ll

This file was added.

				; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -amdgpu-propagate-attributes-early %s \| FileCheck %s

				target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7"

				; Test to check if we skip propgating attributes even if
				; a function is called directly as well as
				; indirectly. "baz" is called directly as well indirectly.

				; CHECK-LABEL: define float @foo(i32 %a) {
				define float @foo(i32 %a) {
				entry:
				%mul = mul nsw i32 %a, 5
				%cast = sitofp i32 %mul to float

				ret float %cast
				}

				; CHECK-LABEL: define float @bar(i32 %a) {
				define float @bar(i32 %a) {
				entry:
				%div = sdiv i32 %a, 7
				%conv = sitofp i32 %div to float
				ret float %conv
				}

				; CHECK-LABEL: define float @baz(i32 %a) {
				define float @baz(i32 %a) {
				entry:
				%mul = mul nsw i32 %a, 6
				%div = sdiv i32 %mul, 7
				%conv = sitofp i32 %div to float
				ret float %conv
				}

				define amdgpu_kernel void @switch_indirect_kernel(float *%result, i32 %type) #1 {
				entry:
				%fn = alloca float (i32)*, align 8, addrspace(5)
				switch i32 %type, label %sw.default [
				i32 1, label %sw.bb
				i32 2, label %sw.bb2
				i32 3, label %sw.bb3
				]

				sw.bb:
				store float (i32)* @foo, float (i32)* addrspace(5)* %fn, align 8
				br label %sw.epilog

				sw.bb2:
				store float (i32)* @bar, float (i32)* addrspace(5)* %fn, align 8
				br label %sw.epilog

				sw.bb3:
				store float (i32)* @baz, float (i32)* addrspace(5)* %fn, align 8
				br label %sw.epilog

				sw.default:
				br label %sw.epilog

				sw.epilog:
				%fp = load float (i32), float (i32) addrspace(5)* %fn, align 8
				%direct_call = call contract float @baz(i32 4)
				%conv = fptosi float %direct_call to i32
				%call4 = call contract float %fp(i32 %conv)
				store float %call4, float* %result, align 4
				ret void
				}

				attributes #1 = { convergent norecurse nounwind mustprogress
				"amdgpu-flat-work-group-size"="1,256" }

llvm/test/CodeGen/AMDGPU/propagate-attributes-direct-indirect.ll

This file was added.

				; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -amdgpu-propagate-attributes-early %s \| FileCheck %s

				target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7"

				; Test to check if we skip attributes on address
				; taken functions but pass to direct callees.

				; CHECK-LABEL: define float @foo(i32 %a) {
				define float @foo(i32 %a) {
				entry:
				%mul = mul nsw i32 %a, 5
				%cast = sitofp i32 %mul to float

				ret float %cast
				}

				; CHECK-LABEL: define float @bar(i32 %a) {
				define float @bar(i32 %a) {
				entry:
				%div = sdiv i32 %a, 7
				%conv = sitofp i32 %div to float
				ret float %conv
				}

				; CHECK-LABEL: define float @baz(i32 %a) {
				define float @baz(i32 %a) {
				entry:
				%mul = mul nsw i32 %a, 6
				%div = sdiv i32 %mul, 7
				%conv = sitofp i32 %div to float
				ret float %conv
				}

				; CHECK-LABEL: define float @baz2(i32 %a) #0 {
				define float @baz2(i32 %a) {
				%mul = mul nsw i32 %a, 6
				%div = sdiv i32 %mul, 8
				%add = add i32 %div , 12
				%conv = sitofp i32 %add to float
				ret float %conv
				}

				arsenmUnsubmitted Not Done Reply Inline Actions CHECK-LABEL, and don't need the arguments/attribute group arsenm: CHECK-LABEL, and don't need the arguments/attribute group
				define amdgpu_kernel void @switch_indirect_kernel(float *%result, i32 %type) #1 {
				entry:
				%fn = alloca float (i32)*, align 8, addrspace(5)
				switch i32 %type, label %sw.default [
				i32 1, label %sw.bb
				i32 2, label %sw.bb2
				i32 3, label %sw.bb3
				]

				sw.bb:
				store float (i32)* @foo, float (i32)* addrspace(5)* %fn, align 8
				br label %sw.epilog

				sw.bb2:
				store float (i32)* @bar, float (i32)* addrspace(5)* %fn, align 8
				br label %sw.epilog

				rampitecUnsubmitted Done Reply Inline Actions There should be no numbered variables in tests. rampitec: There should be no numbered variables in tests.
				sw.bb3:
				store float (i32)* @baz, float (i32)* addrspace(5)* %fn, align 8
				br label %sw.epilog

				sw.default:
				br label %sw.epilog
				arsenmUnsubmitted Done Reply Inline Actions This can be simplified. You shouldn't need allocas or addrspacecasts arsenm: This can be simplified. You shouldn't need allocas or addrspacecasts

				sw.epilog:
				%fp = load float (i32), float (i32) addrspace(5)* %fn, align 8
				%direct_call = call contract float @baz2(i32 5)
				%conv = fptosi float %direct_call to i32
				%call4 = call contract float %fp(i32 %conv)
				store float %call4, float* %result, align 4
				ret void
				}


				attributes #0 = { "amdgpu-flat-work-group-size"="1,256" "target-features"="+16-bit-insts,+add-no-carry-insts,+aperture-regs,+ci-insts,+dl-insts,+dot1-insts,+dot2-insts,+dpp,+ds-src2-insts,+enable-ds128,+enable-prt-strict-null,+fast-denormal-f32,+fast-fmaf,+flat-address-space,+flat-for-global,+flat-global-insts,+flat-inst-offsets,+flat-scratch-insts,+fma-mix-insts,+fp64,+gcn3-encoding,+gfx7-gfx8-gfx9-insts,+gfx8-insts,+gfx9,+gfx9-insts,+half-rate-64-ops,+image-gather4-d16-bug,+int-clamp-insts,+inv-2pi-inline-imm,+ldsbankcount32,+load-store-opt,+localmemorysize65536,+mad-mac-f32-insts,+no-xnack-support,+promote-alloca,+r128-a16,+s-memrealtime,+s-memtime-inst,+scalar-atomics,+scalar-flat-scratch-insts,+scalar-stores,+sdwa,+sdwa-omod,+sdwa-scalar,+sdwa-sdst,+sram-ecc,+trap-handler,+unaligned-access-mode,+unaligned-buffer-access,+unaligned-ds-access,+vgpr-index-mode,+vop3p,-wavefrontsize16,-wavefrontsize32,+wavefrontsize64,+xnack" }
				attributes #1 = { convergent norecurse nounwind mustprogress
				rampitecUnsubmitted Not Done Reply Inline Actions This can be cleaned a lot. rampitec: This can be cleaned a lot.
				madhur13490AuthorUnsubmitted Done Reply Inline Actions How? These are features being propagated from direct callees. madhur13490: How? These are features being propagated from direct callees.
				rampitecUnsubmitted Not Done Reply Inline Actions These are not features we check for or propagate. You can reduce the list to only the features we really need to propagate. rampitec: These are not features we check for or propagate. You can reduce the list to only the features…
				madhur13490AuthorUnsubmitted Done Reply Inline Actions Not sure if I understood. These are the features propagated by the "early" version to the function. If you mean just checking the attribute ID i.e. "#0" then I am not sure what value it adds. madhur13490: Not sure if I understood. These are the features propagated by the "early" version to the…
				rampitecUnsubmitted Not Done Reply Inline Actions You do not need all that long list: +16-bit-insts,+add-no-carry-insts,+aperture-regs,+ci-insts,+dl-insts,+dot1-insts,+dot2-insts,+dpp,+ds-src2-insts,+enable-ds128,+enable-prt-strict-null,+fast-denormal-f32,+fast-fmaf,+flat-address-space,+flat-for-global,+flat-global-insts,+flat-inst-offsets,+flat-scratch-insts,+fma-mix-insts,+fp64,+gcn3-encoding,+gfx7-gfx8-gfx9-insts,+gfx8-insts,+gfx9,+gfx9-insts,+half-rate-64-ops,+image-gather4-d16-bug,+int-clamp-insts,+inv-2pi-inline-imm,+ldsbankcount32,+load-store-opt,+localmemorysize65536,+mad-mac-f32-insts,+no-xnack-support,+promote-alloca,+r128-a16,+s-memrealtime,+s-memtime-inst,+scalar-atomics,+scalar-flat-scratch-insts,+scalar-stores,+sdwa,+sdwa-omod,+sdwa-scalar,+sdwa-sdst,+sram-ecc,+trap-handler,+unaligned-access-mode,+unaligned-buffer-access,+unaligned-ds-access,+vgpr-index-mode,+vop3p rampitec: You do not need all that long list: +16-bit-insts,+add-no-carry-insts,+aperture-regs,+ci-insts…
				rampitecUnsubmitted Not Done Reply Inline Actions It still can be cleaned. +xnack is not needed. Most of the stuff in #1 is not needed too. Here and in other places. rampitec: It still can be cleaned. +xnack is not needed. Most of the stuff in #1 is not needed too. Here…
				"amdgpu-flat-work-group-size"="1,256"}

llvm/test/CodeGen/AMDGPU/propagate-attributes-indirect.ll

This file was added.

				; RUN: opt -S -mtriple=amdgcn-amd-amdhsa -amdgpu-propagate-attributes-early %s \| FileCheck %s

				target datalayout = "e-p:64:64-p1:64:64-p2:32:32-p3:32:32-p4:64:64-p5:32:32-p6:32:32-i64:64-v16:16-v24:32-v32:32-v48:64-v96:128-v192:256-v256:256-v512:512-v1024:1024-v2048:2048-n32:64-S32-A5-G1-ni:7"

				; Test to check if we skip attributes on address
				; taken functions in a simple call graph.

				; CHECK-LABEL: define float @foo(i32 %a) {
				define float @foo(i32 %a) {
				entry:
				%mul = mul nsw i32 %a, 5
				%cast = sitofp i32 %mul to float

				ret float %cast
				}

				; CHECK-LABEL: define float @bar(i32 %a) {
				define float @bar(i32 %a) {
				entry:
				%div = sdiv i32 %a, 7
				%conv = sitofp i32 %div to float
				ret float %conv
				}

				; CHECK-LABEL: define float @baz(i32 %a) {
				define float @baz(i32 %a) {
				entry:
				%mul = mul nsw i32 %a, 6
				%div = sdiv i32 %mul, 7
				%conv = sitofp i32 %div to float
				ret float %conv
				}

				define amdgpu_kernel void @switch_indirect_kernel(float *%result, i32 %type) #1 {
				entry:
				%fn = alloca float (i32)*, align 8, addrspace(5)
				switch i32 %type, label %sw.default [
				i32 1, label %sw.bb
				i32 2, label %sw.bb2
				i32 3, label %sw.bb3
				]

				sw.bb:
				store float (i32)* @foo, float (i32)* addrspace(5)* %fn, align 8
				br label %sw.epilog

				sw.bb2:
				store float (i32)* @bar, float (i32)* addrspace(5)* %fn, align 8
				br label %sw.epilog

				sw.bb3:
				store float (i32)* @baz, float (i32)* addrspace(5)* %fn, align 8
				br label %sw.epilog

				sw.default:
				br label %sw.epilog

				sw.epilog:
				%fp = load float (i32), float (i32) addrspace(5)* %fn, align 8
				%call4 = call contract float %fp(i32 7)
				store float %call4, float* %result, align 4
				ret void
				}

				attributes #1 = { convergent norecurse nounwind mustprogress
				"amdgpu-flat-work-group-size"="1,256"}