This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Enable the pass "amdgpu-replace-lds-use-with-pointer"
AbandonedPublic

Authored by hsmhsm on Sep 15 2021, 10:34 PM.

Download Raw Diff

Details

Reviewers

ronlieb
arsenm
foad
t-tye
b-sumner
JonChesterfield
lebedev.ri
nhaehnle
jdoerfert
rampitec

Summary

Reland below reverted commits with some changes:

As an optimization, split the entry block of kernel K just after top most static
alloca cluster.

From the better code transformation/optimization perspective, it is expected
that all static alloca appear as a single contiguous cluster at the start of the
entry block. If this canonical form is *not* maintained, then few static alloca
may become dynamic after the entry block split.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

hsmhsm created this revision.Sep 15 2021, 10:34 PM

Herald added subscribers: kerbowa, hiraditya, tpr and 5 others. · View Herald TranscriptSep 15 2021, 10:34 PM

hsmhsm requested review of this revision.Sep 15 2021, 10:34 PM

Herald added a project: Restricted Project. · View Herald TranscriptSep 15 2021, 10:34 PM

Herald added subscribers: llvm-commits, wdng. · View Herald Transcript

hsmhsm mentioned this in D109594: [AMDGPU] Initialize LDS pointers after alloca, but before call..Sep 15 2021, 10:36 PM

Harbormaster completed remote builds in B124144: Diff 372867.Sep 16 2021, 12:07 AM

arsenm added inline comments.Sep 16 2021, 7:56 PM

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
230	Why is this using a reverse iterator? Why not just forward iterate since that's where the allocas are?

hsmhsm added inline comments.Sep 16 2021, 8:05 PM

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
230	(At least in the failed openmp test case), what I noticed is that alloca at the beginning are not contiguous one after other, but there are in between addressspace casts like below (I am not sure, if there is a possibility of other in-between instructions too other than addressspace). So I went with the reverse iteration. But, actually if we are able to successfully forward iterate, that would very neat. alloca addresss-space-cast alloca addresss-space-cast ....

hsmhsm marked an inline comment as not done.Sep 16 2021, 9:00 PM

hsmhsm added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
230	Another possibility is as below: Since we only care about allocas inserted at the beginning of the block, and since we need to split the block before any call (since called functions is where the initialized pointers are accessed), and since it is safe to assume that all entry point allocas do appear before any call, we can forward traverse until first call, and split the block at that point.

arsenm added inline comments.Sep 17 2021, 6:08 AM

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
230	Can we fix the insertion of these allocas to not intersperse addrspacecasts?

hsmhsm marked an inline comment as not done.Sep 17 2021, 7:11 AM

hsmhsm added inline comments.Sep 17 2021, 7:11 AM

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
230	Let me check if we can handle it.

Cluster all allocas within the entry block of the kernel at the beginning of the block
and split the block just after the cluster.

Harbormaster completed remote builds in B124528: Diff 373400.Sep 18 2021, 3:08 AM

Rebase.

Harbormaster completed remote builds in B124558: Diff 373439.Sep 18 2021, 9:42 PM

Consider only static allocas, we don't care about dynamic allocas.

Harbormaster completed remote builds in B124582: Diff 373468.Sep 19 2021, 10:12 AM

hsmhsm edited the summary of this revision. (Show Details)Sep 19 2021, 8:16 PM

Rebase.

Harbormaster completed remote builds in B124668: Diff 373596.Sep 20 2021, 8:53 AM

rampitec added inline comments.Sep 20 2021, 4:04 PM

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
230	I though the idea was to fix the alloca insertion, not to move them. Even if we want to cluster them it does not seem to be a right place to do it here as it is a more general thing. Really I'd prefer to fix FE and produce a canonical IR.

hsmhsm added inline comments.Sep 20 2021, 9:40 PM

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
230	Let me clarify and re-iterate few points w.r.t alloca (which you probably know), and to clarify the requirements of this patch w.r.t alloca, and then come to a common conclusion based on it. ALLOCA: LLVM has a notion of static and dynamic allocas. Static allocas are those which appear in the entry basic block and their size is known at compile time. All other are dynamic allocas, including those which appear in the entry basic block but their size is not known. Both front-end and llvm opt passes usually (almost always) make sure that static allocas are inserted at the beginning or near beginning of the entry block, but their contiguity (continuous sequence of static allocas) is not guaranteed. For example, FE may insert in-between addressspace casts, there could be gap between the FE inserted static allocas and inlined static allocas. Nevertheless, they do appear before any call in the entry block, at the beginning or near beginning of the entry block. Above is the general expected static alloca placement for better LLVM optimization, and all LLVM passes should respect this static alloca placement and they do. ALLOCA HANDLING IN CURRENT PATCH: Naturally this patch should respect the placement of static allocas in the entry block and should split the entry block after all static allocas in the entry block. Note that this patch does not care about dynamic allocas. Now, the question is how to find the last static alloca in the entry block so that entry block split happens after it. There are two approaches. Forward iteration of entry block Backward iteration of entry block Since the contiguity of static allocas are not guaranteed even though they are placed at the beginning or near beginning of the entry block, there is no neat way of forward iteration, however, backward iteration is kind of neat. By noticing above, I had implemented the backward iteration first. But, Matt has asked me - why not forward iteration. Hence I ended-up with the forward iterated static alloca clustering business as implemented in this patch. WHAT IS NEXT?: Go with backward iteration of entry block until we find very first static alloca in the backward direction. Go with forward iteration. If we want to do this, we need a clustering work-around as implemented here. However, this work-around implementation should go as a separate work-around patch which we can revert later (as me and Matt discussed offline) when it is ensured that all static allocas are contiguous. PS: Fixing the contiguity of static allocas in the entry block by changing FE or elsewhere OR handling AMDGPU backend issues (correctness issues) due to dynamic allocas is outside the scope of this patch, that we should take them as separate tasks. And, I do not have a luxury of keep on holding this patch until these issues are being investigated and fixed.

hsmhsm marked 3 inline comments as done.Sep 20 2021, 9:58 PM

hsmhsm added inline comments.Sep 20 2021, 10:01 PM

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
230	With the above clarification, I will again stick to backward iteration of entry block and get rid of entering into unnecessary business of alloca clustering (from this patch perspective).

Find entry block split point (just after last static alloca) via backward iteration
of the block and get rid of entering into unnecessary workaround business of alloca
clustering (from this patch perspective) when we do forward iteration.

hsmhsm edited the summary of this revision. (Show Details)Sep 20 2021, 11:03 PM

Harbormaster completed remote builds in B124811: Diff 373783.Sep 20 2021, 11:39 PM

JonChesterfield added inline comments.Sep 20 2021, 11:57 PM

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
230	Both front-end and llvm opt passes usually (almost always) make sure that static allocas are inserted at the beginning or near beginning of the entry block, but their contiguity (continuous sequence of static allocas) is not guaranteed. For example, FE may insert in-between addressspace casts, there could be gap between the FE inserted static allocas and inlined static allocas. Nevertheless, they do appear before any call in the entry block, at the beginning or near beginning of the entry block. Above is the general expected static alloca placement for better LLVM optimization, and all LLVM passes should respect this static alloca placement and they do. If this is an invariant that passes maintain, which essentially means the IR verifier rejects other forms and preferably it is documented, then we can rely on it here. If however it is merely a feature of the test cases you've looked at then we cannot. If unsure, please verify (by walking over the entry basic block and looking) that your required precondition is met before applying the transform. Given this is not a lowering pass, it's a transform that trades off cycles for memory, it is not justified to risk breaking applications in exchange for not dynamically checking the assumptions.

hsmhsm marked an inline comment as done.Sep 21 2021, 12:14 AM

hsmhsm added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
230	If this is an invariant that passes maintain, which essentially means the IR verifier rejects other forms and preferably it is documented, then we can rely on it here. If however it is merely a feature of the test cases you've looked at then we cannot. This is not something which is required from correctness perspective, which is something required from better optimization perspective. So, IR verifier cannot reject other forms as you mention above. And it is documented here https://llvm.org/docs/Frontend/PerformanceTips.html#use-of-allocas after related discussion happened here https://lists.llvm.org/pipermail/llvm-dev/2015-September/090168.html If unsure, please verify (by walking over the entry basic block and looking) that your required precondition is met before applying the transform. Given this is not a lowering pass, it's a transform that trades off cycles for memory, it is not justified to risk breaking applications in exchange for not dynamically checking the assumptions. As I clarified in my previous lengthy comment, all I need to do here is just to respect static alloca placement (from better optimization perspective). I do not understand why I need to do what you tell above. If application breaks because of dynamic allocas or something else, then it is altogether a different issue and not related to breaking of functionality due to this patch. Let's root cause and fix them when we encounter them. Why should I overcomplicate this patch by adding unrelated stuff into this patch?

hsmhsm marked an inline comment as done.Sep 21 2021, 12:24 AM

hsmhsm added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
230	Also [if you have not yet read it], please go through the reply's to the related alloca question that I had recently posted to llvm-dev. Search for the message - Alloca, Inlining and Entry Basic Block. Though, I have not yet replied to it.

JonChesterfield added inline comments.Sep 21 2021, 12:54 AM

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
230	You are assuming that all entry block alloca with fixed size occur before all calls in the entry block, which you agree is not an invariant on the IR. So you know your pass is unsound, but you want to ship it anyway?

hsmhsm marked an inline comment as done.Sep 21 2021, 2:19 AM

hsmhsm added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
230	As I already clarified, it is LLVM expectation that "all static allocas should be placed before any call" (from better optimization perspective). And, as I understand it, community expect that all passes (and whatever implementation which touches it including openmp) should adhere to this assumption, even though it is not a strictly required invariant property from correctness perspective, and even expect to fix it if in case any pass breaks it. Within this pass, I am piggy-backing on above LLVM expectation and assumption. If this assumption is broken by some other pass P then this pass is also broken until P fixes it. That said, if we want to be extra cautious about the above assumption within this pass, we can actually check whether entry blocks of kernels satisfy above assumption or not, and skip running the pass even if single kernel violates it. In-fact, I even tried it in the patch https://reviews.llvm.org/D109594 (line 433). But, it was rejected.

Transforming legal IR to broken IR cannot be justified by "the input probably doesn't look like that". That you disagree with that axiom is wasting a lot of development time.

This revision now requires changes to proceed.Sep 21 2021, 3:25 AM

In D109870#3011810, @JonChesterfield wrote:

Transforming legal IR to broken IR cannot be justified by "the input probably doesn't look like that". That you disagree with that axiom is wasting a lot of development time.

Show me an input legal example which is broken after this pass.

Interleave alloca with calls to functions that use LDS in the entry block. You'll either fail to initialise the LDS variable before use or convert static alloca to dynamic, depending on the iteration order you choose. Iterating forwards will work for LDS and pessimise alloca. Iterating backwards will fail to initialise LDS variables before use.

Edit: and at time of comment you're iterating backwards

In D109870#3011849, @JonChesterfield wrote:

Interleave alloca with calls to functions that use LDS in the entry block. You'll either fail to initialise the LDS variable before use or convert static alloca to dynamic, depending on the iteration order you choose. Iterating forwards will work for LDS and pessimise alloca. Iterating backwards will fail to initialise LDS variables before use.

Edit: and at time of comment you're iterating backwards

As I clarified in my previous comments, If there exist Interleave alloca with calls to functions that use LDS in the entry block in the input ir, then input ir is already broken before coming to this pass, and this pass is also broken because it works on broken input ir. So, the pass which first broke the input should be fixed (if openmp is breaking it then openmp should fix it). You cannot ask and expect all the broken things to be fixed as part of this patch.

Nothing is clear here. I have zero comprehension of what you are trying to do. This pass will take IR that describes a meaningful computation and translate it to one that uses uninitialised memory. You'd rather argue that this is fine than write an 'if' statement to skip a function containing unexpected IR.

I suggest you escalate this to whatever authority you consider appropriate.

In D109870#3011994, @JonChesterfield wrote:

Nothing is clear here. I have zero comprehension of what you are trying to do. This pass will take IR that describes a meaningful computation and translate it to one that uses uninitialised memory. You'd rather argue that this is fine than write an 'if' statement to skip a function containing unexpected IR.

I suggest you escalate this to whatever authority you consider appropriate.

No, actually it is not clear to me your line of argument - Please checkout with community if an llvm ir with "function call with interleaving static allocas" is a meaningful ir from llvm transformation and optimization perspective. As I understand it from replies to my email, community thinks that it is broken ir, it should be fixed, even though there is nothing wrong in the ir from correctness perspective, and community expect to place all static allocas before any call.

And, as I also clarified, if we think that in-spite of above expectation, if we want to be cautious about it in this pass, then we can add IF check as I implemented at https://reviews.llvm.org/D109594 and skip running the pass if there is "function call with interleaving static allocas". Note that you cannot just skip kernels which have this pattern, and continue with other kernels. We should skip the pass, even if one kernel has this pattern. The reason is, consider this situation - Kernel K1 and K2 calls function FOO, FOO uses LDS, K1 has this pattern, but K2 does not. In this case, if we skip K1, and go with K2, then transformation itself will be broken.

Reviewers need to agree on above.

The point I am making here is - I cannot fix whole universe of issues as part of this patch within in a short frame of time. We need to go with a solution which works for us now, and come back it once the pass settles down.

Yes, we need to have internal meeting for better clarification, we cannot go endlessly like this without common conclusion.

What's the subject of the email to llvm-dev you are referring to here?

In D109870#3012073, @JonChesterfield wrote:

What's the subject of the email to llvm-dev you are referring to here?

"Alloca, Inlining and Entry Basic Block"

Here is the link - https://groups.google.com/g/llvm-dev/c/-E6cKEQXTIw

Hi Roman, Nicolai,

This is an IR transform which assumes alloca are contiguous within the entry block. Mahesha is presenting your replies to the linked thread as community approval for miscompiling code where that assumption does not hold. Somewhat indirectly, this pass will replace load/store of a addrspace(3) global variable with load/store from an uninitialised address if there happens to be a call to a function using said variable in the middle of the alloca in the entry block.

I won't speak for either of you here. I will say that I consider using your replies to a tangentially related question as authority to fast track a known broken IR pass to be inconceivable behaviour.

In D109870#3012287, @JonChesterfield wrote:

This is an IR transform which assumes alloca are contiguous within the entry block. Mahesha is presenting your replies to the linked thread as community approval for miscompiling code where that assumption does not hold.

FWIW: LLVM-IR, in general and especially when coming from Clang, has all but VLA allocas in the entry block. That said, there is no rule in the IR that it needs to have them in the entry block. The IR is not broken.
If you encounter IR with allocas not in the entry block it is worth to look at that and to do canonicalization. However, from the tests and patch it looks like there is a second assumption here that does not hold at all:
Allocas do not need to be clustered in any way in the entry block.

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
188	There is no reason to assume allocas are clustered in any way in the entry block. That is certainly never the case.

In D109870#3012287, @JonChesterfield wrote:

Hi Roman, Nicolai,

This is an IR transform which assumes alloca are contiguous within the entry block. Mahesha is presenting your replies to the linked thread as community approval for miscompiling code where that assumption does not hold. Somewhat indirectly, this pass will replace load/store of a addrspace(3) global variable with load/store from an uninitialised address if there happens to be a call to a function using said variable in the middle of the alloca in the entry block.

I won't speak for either of you here. I will say that I consider using your replies to a tangentially related question as authority to fast track a known broken IR pass to be inconceivable behaviour.

This pass is an optimization. We do not need anything here for correctness and can do what's simplest. The fact that out of entry block allocas may be broken is totally unrelated to this patch, and all discussion related to correctness is a distraction here

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
188	It's not required, but it's good enough for optimization purposes. Nothing should be trying to produce IR with allocas in any way

In D109870#3012287, @JonChesterfield wrote:

Hi Roman, Nicolai,

This is an IR transform which assumes alloca are contiguous within the entry block. Mahesha is presenting your replies to the linked thread as community approval for miscompiling code where that assumption does not hold. Somewhat indirectly, this pass will replace load/store of a addrspace(3) global variable with load/store from an uninitialised address if there happens to be a call to a function using said variable in the middle of the alloca in the entry block.

I won't speak for either of you here. I will say that I consider using your replies to a tangentially related question as authority to fast track a known broken IR pass to be inconceivable behaviour.

Hi Jon,

Small correction: I am not claiming that "alloca are contiguous within the entry block". please re-read my replies.

Hi Roman, Nicolai,

Here is my understanding and claim about static allocas (not dynamic allocas) - All static allocas should appear in the entry basic block before any function call for better optimization opportunities. If there are interleaved static allocas with function call in-between, such an ir is considered broken, even though the ir is valid from correctness perspective. And if any pass is not adhering to the requirement that all static allocas should be placed in the entry block before any function call, then such a pass is considered broken since it may leads to surprising result in general.

Let me know if my above understanding is correct or not.

PS: I also sent an email to llvm-dev with the subject line - "Placement of static allocas" for wider visibility, and probably we can take this discussion to llvm-dev.

In D109870#3012515, @jdoerfert wrote:

In D109870#3012287, @JonChesterfield wrote:

This is an IR transform which assumes alloca are contiguous within the entry block. Mahesha is presenting your replies to the linked thread as community approval for miscompiling code where that assumption does not hold.

FWIW: LLVM-IR, in general and especially when coming from Clang, has all but VLA allocas in the entry block. That said, there is no rule in the IR that it needs to have them in the entry block. The IR is not broken.
If you encounter IR with allocas not in the entry block it is worth to look at that and to do canonicalization. However, from the tests and patch it looks like there is a second assumption here that does not hold at all:
Allocas do not need to be clustered in any way in the entry block.

Please note that I am talking *only* about static allocas here, and not dynamic allocas. Also note that I am talking only about requirement of placing all static allocas within the entry block before any function call, not about their contiguity or cluster.

At the moment, it is not *very clear* if LLVM IR with static alloca after call
is legal or not.

In this pass, since we need to split the entry block before any call, any static
alloca after the call will be converted into dynamic alloca which we want to avoid.

So, for now, we skip running this pass, if there a kernel which has static
alloca after call, and we will revisit this code later once we have clear clarity
on the placement of static alloca.

Harbormaster completed remote builds in B125030: Diff 374091.Sep 21 2021, 7:39 PM

In D109870#3014083, @hsmhsm wrote:

At the moment, it is not *very clear* if LLVM IR with static alloca after call
is legal or not.

Calls have absolutely nothing to do with allocas. Allocas may legally appear anywhere, this isn't in question. Allocas are strongly recommended to be placed in the entry block to enable optimizations. Further, clustering them at the top of the entry block is nicer IR. This is in no way a requirement, but an optimization pass that doesn't strictly need to touch every alloca doesn't need to find every alloca

In this pass, since we need to split the entry block before any call, any static
alloca after the call will be converted into dynamic alloca which we want to avoid.

So, for now, we skip running this pass, if there a kernel which has static
alloca after call, and we will revisit this code later once we have clear clarity
on the placement of static alloca.

This pass has no reason to inspect calls. What you care about is splitting the entry block to insert some initialization code. In the process of doing this, you want to avoid moving allocas out of the entry block. This is trivial to do if all the allocas are clustered at the top of the block. If you ignore allocas beyond the top cluster, there is nothing wrong with the IR. If you happen to sink some out of the entry block, it's still correct, just suboptimal.

In D109870#3015473, @arsenm wrote:

In D109870#3014083, @hsmhsm wrote:

At the moment, it is not *very clear* if LLVM IR with static alloca after call
is legal or not.

Calls have absolutely nothing to do with allocas. Allocas may legally appear anywhere, this isn't in question. Allocas are strongly recommended to be placed in the entry block to enable optimizations. Further, clustering them at the top of the entry block is nicer IR. This is in no way a requirement, but an optimization pass that doesn't strictly need to touch every alloca doesn't need to find every alloca

In this pass, since we need to split the entry block before any call, any static
alloca after the call will be converted into dynamic alloca which we want to avoid.

So, for now, we skip running this pass, if there a kernel which has static
alloca after call, and we will revisit this code later once we have clear clarity
on the placement of static alloca.

This pass has no reason to inspect calls. What you care about is splitting the entry block to insert some initialization code. In the process of doing this, you want to avoid moving allocas out of the entry block. This is trivial to do if all the allocas are clustered at the top of the block. If you ignore allocas beyond the top cluster, there is nothing wrong with the IR. If you happen to sink some out of the entry block, it's still correct, just suboptimal.

Agreed cent percent.

Further, I have pushed CFE patch at https://reviews.llvm.org/D110257 which fixes the problem where the contiguity of static allocas at the very top of the entry block was broken. Once this patch makes it's way to trunk, all I need to do here in this patch is to just scan the very top static alloca cluster (as an optimization), and split the block just after the cluster.

If any other static alloca which is not part of the top cluster (for some in-efficient reason) happens to move out of entry block, and which exposes hidden bug in AMDGPU back-end, then it is the *not* the responsibility of this patch. It is a different issue to be fixed, and blaming this patch for it is being exposing the hidden bug is not at all acceptable anymore.

As an optimization, split the entry block of kernel K just after top most
static alloca cluster.

From the better code transformation/optimization perspective, it is expected
that all static alloca appear as a single contiguous cluster at the start of
the entry block. If this canonical form is *not* maintained, then few static
alloca may become dynamic after the entry block split here.

hsmhsm added a parent revision: D110257: [CFE][Codegen] Make sure to maintain the contiguity of all the static allocas.Sep 29 2021, 2:16 AM

hsmhsm edited the summary of this revision. (Show Details)Sep 29 2021, 2:24 AM

Harbormaster completed remote builds in B126275: Diff 375812.Sep 29 2021, 2:42 AM

Rebase and update while loop for better readability.

Harbormaster completed remote builds in B126489: Diff 376097.Sep 29 2021, 10:03 PM

Rebase.

Harbormaster completed remote builds in B127925: Diff 378436.Oct 9 2021, 5:44 AM

hsmhsm marked 2 inline comments as done.Oct 9 2021, 11:02 PM

Rebase.

Harbormaster completed remote builds in B128262: Diff 378877.Oct 11 2021, 10:23 PM

Rebase.

Hi @arsenm, The pre-requisite front-end patch (https://reviews.llvm.org/D110257) has landed to llvm trunk. Now, this patch needs an action to be taken. Please have a look at this patch during your free time.

Hi @JonChesterfield and @jdoerfert, you both had rejected this patch initially asking for the changes. Your review and acceptance of this patch is must, otherwise this patch will not land even if some other reviewer accept it. So, kind request to take a look at this patch.

Minor changes to comment.

Harbormaster completed remote builds in B133410: Diff 386049.Nov 9 2021, 8:32 PM

hsmhsm removed reviewers: JonChesterfield, lebedev.ri, nhaehnle, jdoerfert.Nov 15 2021, 7:52 PM

Rebase.

hsmhsm edited the summary of this revision. (Show Details)Nov 15 2021, 7:57 PM

Harbormaster completed remote builds in B134419: Diff 387482.Nov 15 2021, 8:36 PM

Hi @JonChesterfield and @jdoerfert, you both had rejected this patch initially asking for the changes. Your review and acceptance of this patch is must, otherwise this patch will not land even if some other reviewer accept it. So, kind request to take a look at this patch.

I haven't seen someone remove reviewers from their patch when said reviewers marked it 'requested changes' before. Can you indicate why you did so? Reinstating my requested changes flag in the mean time.

This revision now requires changes to proceed.Nov 15 2021, 11:56 PM

Resigning from the review as it should get approval from people blocking it. This is THE process.

In D109870#3133844, @JonChesterfield wrote:

Hi @JonChesterfield and @jdoerfert, you both had rejected this patch initially asking for the changes. Your review and acceptance of this patch is must, otherwise this patch will not land even if some other reviewer accept it. So, kind request to take a look at this patch.

I haven't seen someone remove reviewers from their patch when said reviewers marked it 'requested changes' before. Can you indicate why you did so? Reinstating my requested changes flag in the mean time.

I removed you from the reviewer because, I personally think that you just want to avoid this patch from getting approved, though I do not know the real reason. And, I think we have discussed enough w.r.t this patch. There is nothing that I can make any changes to this patch as you insist here. I am going to abandon this patch.

I do not think this patch makes any progress. Enough is enough, I am abandoning it.

I personally think that you just want to avoid this patch from getting approved, though I do not know the real reason

I'm mistrusting of this patch because it has been landed repeatedly and broke amdgpu openmp every time. That's the 'real reason', every time you land this is breaks openmp and wastes a bunch of my time picking up the pieces. I think you required D110257 to land before it might work this time around and that only went in a few hours ago.

What I'd have liked to see as part of this patch is runtime tests that show that it works now, given at all points in time we ran into tests crashing with 'error: memory access' or similar.

Abandoning this optimisation also works from my perspective, but leaving the transform as dead code in trunk indefinitely isn't OK. We need to either make the pass work (for which IR-only tests have repeatedly proven insufficient) or delete it from the tree.

I can't find anything on https://llvm.org/docs/CodeReview.html that explicitly rules out your strategy of removing reviewers when they request changes. Maybe it hasn't come up before, I guess I'll ping llvm-dev for clarification.

Patch was abandoned while writing the above reply - do you intend to delete the dead pass from llvm trunk shortly?

In D109870#3133898, @JonChesterfield wrote:

I personally think that you just want to avoid this patch from getting approved, though I do not know the real reason

I'm mistrusting of this patch because it has been landed repeatedly and broke amdgpu openmp every time. That's the 'real reason', every time you land this is breaks openmp and wastes a bunch of my time picking up the pieces.

This is not really true, openmp test fails because this patch exposes some other bug - either in openmp or in amdgpu backend. We should fix that bug instead of blaming this patch.

I think you required D110257 to land before it might work this time around and that only went in a few hours ago.

D110257 landed on Nov 10th, not few hours ago, Please cross-check.

What I'd have liked to see as part of this patch is runtime tests that show that it works now, given at all points in time we ran into tests crashing with 'error: memory access' or similar.

If you really wanted this, we could have collaborated internally in a meaningful manner. By the way, if you just wanted thorough openmp testing before approving this patch, why did you again mark it as "needs revision"? After my ping here (which was last week on 10th), you could have approached me to help performing thorough openmp testing.

Abandoning this optimisation also works from my perspective,
but leaving the transform as dead code in trunk indefinitely isn't OK. We need to either make the pass work (for which IR-only tests have repeatedly proven insufficient) or delete it from the tree.

Inform it to AMD technical team, and take further action accordingly.

I can't find anything on https://llvm.org/docs/CodeReview.html that explicitly rules out your strategy of removing reviewers when they request changes. Maybe it hasn't come up before, I guess I'll ping llvm-dev for clarification.

No comment.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUReplaceLDSUseWithPointer.cpp

36 lines

AMDGPUTargetMachine.cpp

2 lines

test/

CodeGen/

AMDGPU/

llc-pipeline.ll

5 lines

replace-lds-by-ptr-call-diamond-shape.ll

2 lines

replace-lds-by-ptr-call-selected_functions.ll

2 lines

replace-lds-by-ptr-call-to-declare-only-func.ll

2 lines

replace-lds-by-ptr-ignore-global-scope-use.ll

2 lines

replace-lds-by-ptr-ignore-inline-asm-call.ll

2 lines

replace-lds-by-ptr-ignore-kernel-only-used-lds.ll

2 lines

replace-lds-by-ptr-ignore-not-reachable-lds.ll

2 lines

replace-lds-by-ptr-ignore-small-lds.ll

2 lines

replace-lds-by-ptr-indirect-call-diamond-shape.ll

2 lines

replace-lds-by-ptr-indirect-call-selected_functions.ll

2 lines

replace-lds-by-ptr-indirect-call-signature-match.ll

2 lines

	replace-lds-by-ptr-split-entry-bb-after-top-static-alloca-cluster.ll
	replace-lds-by-ptr-use-within-const-expr1.ll

31 lines

replace-lds-by-ptr-use-multiple-lds.ll

2 lines

replace-lds-by-ptr-use-same-lds.ll

2 lines

replace-lds-by-ptr-use-within-const-expr1.ll

2 lines

replace-lds-by-ptr-use-within-const-expr2.ll

2 lines

replace-lds-by-ptr-use-within-phi-inst.ll

2 lines

Diff 378877

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp

Show First 20 Lines • Show All 176 Lines • ▼ Show 20 Lines	GlobalVariable createLDSPointer(GlobalVariable GV) {
LDSPointer->setAlignment(AMDGPU::getAlign(DL, LDSPointer));		LDSPointer->setAlignment(AMDGPU::getAlign(DL, LDSPointer));

// Mark that an associated LDS pointer is created for LDS.		// Mark that an associated LDS pointer is created for LDS.
LDSToPointer[GV] = LDSPointer;		LDSToPointer[GV] = LDSPointer;

return LDSPointer;		return LDSPointer;
}		}

		// As an optimization, split the entry block of kernel K just after top most
		// static alloca cluster.
		//
		// From the better code transformation/optimization perspective, it is
		jdoerfertUnsubmitted Done Reply Inline Actions There is no reason to assume allocas are clustered in any way in the entry block. That is certainly never the case. jdoerfert: There is no reason to assume allocas are clustered in any way in the entry block. That is…
		arsenmUnsubmitted Done Reply Inline Actions It's not required, but it's good enough for optimization purposes. Nothing should be trying to produce IR with allocas in any way arsenm: It's not required, but it's good enough for optimization purposes. Nothing should be trying to…
		// expected that all static alloca appear as a single contiguous cluster at
		// the start of the entry block. If this canonical form is not maintained,
		// then few static alloca may become dynamic after the entry block split here.
		Instruction findEntryBlockSplitPoint(Function K) {
		auto &EBB = K->getEntryBlock();
		auto Iter = EBB.getFirstInsertionPt();
		auto EI = &Iter;
		auto *TI = EBB.getTerminator();

		// Locate the first static alloca.
		while ((&Iter != TI) && !isa<CallBase>(&Iter) && !isa<AllocaInst>(&*Iter))
		++Iter;

		// Top most static alloca cluster not found? Split the block at the first
		// insertion point.
		if (!isa<AllocaInst>(&*Iter))
		return EI;

		// We have found (at least one) static alloca at the top of the entry block.
		// Skip them.
		while (isa<AllocaInst>(&*Iter))
		++Iter;

		// Split the block just after top most static alloca cluster.
		return &*Iter;
		}

// Split entry basic block in such a way that only lane 0 of each wave does		// Split entry basic block in such a way that only lane 0 of each wave does
// the LDS pointer initialization, and return newly created basic block.		// the LDS pointer initialization, and return newly created basic block.
BasicBlock activateLaneZero(Function K) {		BasicBlock activateLaneZero(Function K) {
// If the entry basic block of kernel K is already split, then return		// If the entry basic block of kernel K is already split, then return
// newly created basic block.		// newly created basic block.
auto BasicBlockEntry = KernelToInitBB.insert(std::make_pair(K, nullptr));		auto BasicBlockEntry = KernelToInitBB.insert(std::make_pair(K, nullptr));
if (!BasicBlockEntry.second)		if (!BasicBlockEntry.second)
return BasicBlockEntry.first->second;		return BasicBlockEntry.first->second;

// Split entry basic block of kernel K.		// Split entry basic block of kernel K just after all static allocas.
auto EI = &((K->getEntryBlock().getFirstInsertionPt()));		IRBuilder<> Builder(findEntryBlockSplitPoint(K));
IRBuilder<> Builder(EI);

Value *Mbcnt =		Value *Mbcnt =
Builder.CreateIntrinsic(Intrinsic::amdgcn_mbcnt_lo, {},		Builder.CreateIntrinsic(Intrinsic::amdgcn_mbcnt_lo, {},
{Builder.getInt32(-1), Builder.getInt32(0)});		{Builder.getInt32(-1), Builder.getInt32(0)});
		arsenmUnsubmitted Done Reply Inline Actions Why is this using a reverse iterator? Why not just forward iterate since that's where the allocas are? arsenm: Why is this using a reverse iterator? Why not just forward iterate since that's where the…
		hsmhsmAuthorUnsubmitted Done Reply Inline Actions (At least in the failed openmp test case), what I noticed is that alloca at the beginning are not contiguous one after other, but there are in between addressspace casts like below (I am not sure, if there is a possibility of other in-between instructions too other than addressspace). So I went with the reverse iteration. But, actually if we are able to successfully forward iterate, that would very neat. alloca addresss-space-cast alloca addresss-space-cast .... hsmhsm: (At least in the failed openmp test case), what I noticed is that alloca at the beginning are…
		hsmhsmAuthorUnsubmitted Done Reply Inline Actions Another possibility is as below: Since we only care about allocas inserted at the beginning of the block, and since we need to split the block before any call (since called functions is where the initialized pointers are accessed), and since it is safe to assume that all entry point allocas do appear before any call, we can forward traverse until first call, and split the block at that point. hsmhsm: Another possibility is as below: Since we only care about allocas inserted at the beginning of…
		arsenmUnsubmitted Done Reply Inline Actions Can we fix the insertion of these allocas to not intersperse addrspacecasts? arsenm: Can we fix the insertion of these allocas to not intersperse addrspacecasts?
		hsmhsmAuthorUnsubmitted Done Reply Inline Actions Let me check if we can handle it. hsmhsm: Let me check if we can handle it.
		rampitecUnsubmitted Done Reply Inline Actions I though the idea was to fix the alloca insertion, not to move them. Even if we want to cluster them it does not seem to be a right place to do it here as it is a more general thing. Really I'd prefer to fix FE and produce a canonical IR. rampitec: I though the idea was to fix the alloca insertion, not to move them. Even if we want to cluster…
		hsmhsmAuthorUnsubmitted Done Reply Inline Actions Let me clarify and re-iterate few points w.r.t alloca (which you probably know), and to clarify the requirements of this patch w.r.t alloca, and then come to a common conclusion based on it. ALLOCA: LLVM has a notion of static and dynamic allocas. Static allocas are those which appear in the entry basic block and their size is known at compile time. All other are dynamic allocas, including those which appear in the entry basic block but their size is not known. Both front-end and llvm opt passes usually (almost always) make sure that static allocas are inserted at the beginning or near beginning of the entry block, but their contiguity (continuous sequence of static allocas) is not guaranteed. For example, FE may insert in-between addressspace casts, there could be gap between the FE inserted static allocas and inlined static allocas. Nevertheless, they do appear before any call in the entry block, at the beginning or near beginning of the entry block. Above is the general expected static alloca placement for better LLVM optimization, and all LLVM passes should respect this static alloca placement and they do. ALLOCA HANDLING IN CURRENT PATCH: Naturally this patch should respect the placement of static allocas in the entry block and should split the entry block after all static allocas in the entry block. Note that this patch does not care about dynamic allocas. Now, the question is how to find the last static alloca in the entry block so that entry block split happens after it. There are two approaches. Forward iteration of entry block Backward iteration of entry block Since the contiguity of static allocas are not guaranteed even though they are placed at the beginning or near beginning of the entry block, there is no neat way of forward iteration, however, backward iteration is kind of neat. By noticing above, I had implemented the backward iteration first. But, Matt has asked me - why not forward iteration. Hence I ended-up with the forward iterated static alloca clustering business as implemented in this patch. WHAT IS NEXT?: Go with backward iteration of entry block until we find very first static alloca in the backward direction. Go with forward iteration. If we want to do this, we need a clustering work-around as implemented here. However, this work-around implementation should go as a separate work-around patch which we can revert later (as me and Matt discussed offline) when it is ensured that all static allocas are contiguous. PS: Fixing the contiguity of static allocas in the entry block by changing FE or elsewhere OR handling AMDGPU backend issues (correctness issues) due to dynamic allocas is outside the scope of this patch, that we should take them as separate tasks. And, I do not have a luxury of keep on holding this patch until these issues are being investigated and fixed. hsmhsm: Let me clarify and re-iterate few points w.r.t alloca (which you probably know), and to clarify…
		hsmhsmAuthorUnsubmitted Done Reply Inline Actions With the above clarification, I will again stick to backward iteration of entry block and get rid of entering into unnecessary business of alloca clustering (from this patch perspective). hsmhsm: With the above clarification, I will again stick to backward iteration of entry block and get…
		JonChesterfieldUnsubmitted Done Reply Inline Actions Both front-end and llvm opt passes usually (almost always) make sure that static allocas are inserted at the beginning or near beginning of the entry block, but their contiguity (continuous sequence of static allocas) is not guaranteed. For example, FE may insert in-between addressspace casts, there could be gap between the FE inserted static allocas and inlined static allocas. Nevertheless, they do appear before any call in the entry block, at the beginning or near beginning of the entry block. Above is the general expected static alloca placement for better LLVM optimization, and all LLVM passes should respect this static alloca placement and they do. If this is an invariant that passes maintain, which essentially means the IR verifier rejects other forms and preferably it is documented, then we can rely on it here. If however it is merely a feature of the test cases you've looked at then we cannot. If unsure, please verify (by walking over the entry basic block and looking) that your required precondition is met before applying the transform. Given this is not a lowering pass, it's a transform that trades off cycles for memory, it is not justified to risk breaking applications in exchange for not dynamically checking the assumptions. JonChesterfield: > -- Both front-end and llvm opt passes usually (almost always) make sure that static allocas…
		hsmhsmAuthorUnsubmitted Done Reply Inline Actions If this is an invariant that passes maintain, which essentially means the IR verifier rejects other forms and preferably it is documented, then we can rely on it here. If however it is merely a feature of the test cases you've looked at then we cannot. This is not something which is required from correctness perspective, which is something required from better optimization perspective. So, IR verifier cannot reject other forms as you mention above. And it is documented here https://llvm.org/docs/Frontend/PerformanceTips.html#use-of-allocas after related discussion happened here https://lists.llvm.org/pipermail/llvm-dev/2015-September/090168.html If unsure, please verify (by walking over the entry basic block and looking) that your required precondition is met before applying the transform. Given this is not a lowering pass, it's a transform that trades off cycles for memory, it is not justified to risk breaking applications in exchange for not dynamically checking the assumptions. As I clarified in my previous lengthy comment, all I need to do here is just to respect static alloca placement (from better optimization perspective). I do not understand why I need to do what you tell above. If application breaks because of dynamic allocas or something else, then it is altogether a different issue and not related to breaking of functionality due to this patch. Let's root cause and fix them when we encounter them. Why should I overcomplicate this patch by adding unrelated stuff into this patch? hsmhsm: >> If this is an invariant that passes maintain, which essentially means the IR verifier…
		hsmhsmAuthorUnsubmitted Done Reply Inline Actions Also [if you have not yet read it], please go through the reply's to the related alloca question that I had recently posted to llvm-dev. Search for the message - Alloca, Inlining and Entry Basic Block. Though, I have not yet replied to it. hsmhsm: Also [if you have not yet read it], please go through the reply's to the related alloca…
		JonChesterfieldUnsubmitted Done Reply Inline Actions You are assuming that all entry block alloca with fixed size occur before all calls in the entry block, which you agree is not an invariant on the IR. So you know your pass is unsound, but you want to ship it anyway? JonChesterfield: You are assuming that all entry block alloca with fixed size occur before all calls in the…
		hsmhsmAuthorUnsubmitted Done Reply Inline Actions As I already clarified, it is LLVM expectation that "all static allocas should be placed before any call" (from better optimization perspective). And, as I understand it, community expect that all passes (and whatever implementation which touches it including openmp) should adhere to this assumption, even though it is not a strictly required invariant property from correctness perspective, and even expect to fix it if in case any pass breaks it. Within this pass, I am piggy-backing on above LLVM expectation and assumption. If this assumption is broken by some other pass P then this pass is also broken until P fixes it. That said, if we want to be extra cautious about the above assumption within this pass, we can actually check whether entry blocks of kernels satisfy above assumption or not, and skip running the pass even if single kernel violates it. In-fact, I even tried it in the patch https://reviews.llvm.org/D109594 (line 433). But, it was rejected. hsmhsm: As I already clarified, it is LLVM expectation that "all static allocas should be placed before…
Value *Cond = Builder.CreateICmpEQ(Mbcnt, Builder.getInt32(0));		Value *Cond = Builder.CreateICmpEQ(Mbcnt, Builder.getInt32(0));
Instruction *WB = cast<Instruction>(		Instruction *WB = cast<Instruction>(
Builder.CreateIntrinsic(Intrinsic::amdgcn_wave_barrier, {}, {}));		Builder.CreateIntrinsic(Intrinsic::amdgcn_wave_barrier, {}, {}));

BasicBlock *NBB = SplitBlockAndInsertIfThen(Cond, WB, false)->getParent();		BasicBlock *NBB = SplitBlockAndInsertIfThen(Cond, WB, false)->getParent();

// Mark that the entry basic block of kernel K is split.		// Mark that the entry basic block of kernel K is split.
KernelToInitBB[K] = NBB;		KernelToInitBB[K] = NBB;
▲ Show 20 Lines • Show All 252 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

	Show First 20 Lines • Show All 287 Lines • ▼ Show 20 Lines

	static cl::opt<bool> EnableStructurizerWorkarounds(			static cl::opt<bool> EnableStructurizerWorkarounds(
	"amdgpu-enable-structurizer-workarounds",			"amdgpu-enable-structurizer-workarounds",
	cl::desc("Enable workarounds for the StructurizeCFG pass"), cl::init(true),			cl::desc("Enable workarounds for the StructurizeCFG pass"), cl::init(true),
	cl::Hidden);			cl::Hidden);

	static cl::opt<bool> EnableLDSReplaceWithPointer(			static cl::opt<bool> EnableLDSReplaceWithPointer(
	"amdgpu-enable-lds-replace-with-pointer",			"amdgpu-enable-lds-replace-with-pointer",
	cl::desc("Enable LDS replace with pointer pass"), cl::init(false),			cl::desc("Enable LDS replace with pointer pass"), cl::init(true),
	cl::Hidden);			cl::Hidden);

	static cl::opt<bool, true> EnableLowerModuleLDS(			static cl::opt<bool, true> EnableLowerModuleLDS(
	"amdgpu-enable-lower-module-lds", cl::desc("Enable lower module lds pass"),			"amdgpu-enable-lower-module-lds", cl::desc("Enable lower module lds pass"),
	cl::location(AMDGPUTargetMachine::EnableLowerModuleLDS), cl::init(true),			cl::location(AMDGPUTargetMachine::EnableLowerModuleLDS), cl::init(true),
	cl::Hidden);			cl::Hidden);

	static cl::opt<bool> EnablePreRAOptimizations(			static cl::opt<bool> EnablePreRAOptimizations(
	▲ Show 20 Lines • Show All 1,193 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/llc-pipeline.ll

	Show All 36 Lines
	; GCN-O0-NEXT: Early propagate attributes from kernels to functions			; GCN-O0-NEXT: Early propagate attributes from kernels to functions
	; GCN-O0-NEXT: AMDGPU Lower Intrinsics			; GCN-O0-NEXT: AMDGPU Lower Intrinsics
	; GCN-O0-NEXT: AMDGPU Inline All Functions			; GCN-O0-NEXT: AMDGPU Inline All Functions
	; GCN-O0-NEXT: CallGraph Construction			; GCN-O0-NEXT: CallGraph Construction
	; GCN-O0-NEXT: Call Graph SCC Pass Manager			; GCN-O0-NEXT: Call Graph SCC Pass Manager
	; GCN-O0-NEXT: Inliner for always_inline functions			; GCN-O0-NEXT: Inliner for always_inline functions
	; GCN-O0-NEXT: A No-Op Barrier Pass			; GCN-O0-NEXT: A No-Op Barrier Pass
	; GCN-O0-NEXT: Lower OpenCL enqueued blocks			; GCN-O0-NEXT: Lower OpenCL enqueued blocks
				; GCN-O0-NEXT: Replace within non-kernel function use of LDS with pointer
	; GCN-O0-NEXT: Lower uses of LDS variables from non-kernel functions			; GCN-O0-NEXT: Lower uses of LDS variables from non-kernel functions
	; GCN-O0-NEXT: FunctionPass Manager			; GCN-O0-NEXT: FunctionPass Manager
	; GCN-O0-NEXT: Expand Atomic instructions			; GCN-O0-NEXT: Expand Atomic instructions
	; GCN-O0-NEXT: Lower constant intrinsics			; GCN-O0-NEXT: Lower constant intrinsics
	; GCN-O0-NEXT: Remove unreachable blocks from the CFG			; GCN-O0-NEXT: Remove unreachable blocks from the CFG
	; GCN-O0-NEXT: Expand vector predication intrinsics			; GCN-O0-NEXT: Expand vector predication intrinsics
	; GCN-O0-NEXT: Scalarize Masked Memory Intrinsics			; GCN-O0-NEXT: Scalarize Masked Memory Intrinsics
	; GCN-O0-NEXT: Expand reduction intrinsics			; GCN-O0-NEXT: Expand reduction intrinsics
	▲ Show 20 Lines • Show All 120 Lines • ▼ Show 20 Lines
	; GCN-O1-NEXT: Early propagate attributes from kernels to functions			; GCN-O1-NEXT: Early propagate attributes from kernels to functions
	; GCN-O1-NEXT: AMDGPU Lower Intrinsics			; GCN-O1-NEXT: AMDGPU Lower Intrinsics
	; GCN-O1-NEXT: AMDGPU Inline All Functions			; GCN-O1-NEXT: AMDGPU Inline All Functions
	; GCN-O1-NEXT: CallGraph Construction			; GCN-O1-NEXT: CallGraph Construction
	; GCN-O1-NEXT: Call Graph SCC Pass Manager			; GCN-O1-NEXT: Call Graph SCC Pass Manager
	; GCN-O1-NEXT: Inliner for always_inline functions			; GCN-O1-NEXT: Inliner for always_inline functions
	; GCN-O1-NEXT: A No-Op Barrier Pass			; GCN-O1-NEXT: A No-Op Barrier Pass
	; GCN-O1-NEXT: Lower OpenCL enqueued blocks			; GCN-O1-NEXT: Lower OpenCL enqueued blocks
				; GCN-O1-NEXT: Replace within non-kernel function use of LDS with pointer
	; GCN-O1-NEXT: Lower uses of LDS variables from non-kernel functions			; GCN-O1-NEXT: Lower uses of LDS variables from non-kernel functions
	; GCN-O1-NEXT: FunctionPass Manager			; GCN-O1-NEXT: FunctionPass Manager
	; GCN-O1-NEXT: Infer address spaces			; GCN-O1-NEXT: Infer address spaces
	; GCN-O1-NEXT: Expand Atomic instructions			; GCN-O1-NEXT: Expand Atomic instructions
	; GCN-O1-NEXT: AMDGPU Promote Alloca			; GCN-O1-NEXT: AMDGPU Promote Alloca
	; GCN-O1-NEXT: Dominator Tree Construction			; GCN-O1-NEXT: Dominator Tree Construction
	; GCN-O1-NEXT: SROA			; GCN-O1-NEXT: SROA
	; GCN-O1-NEXT: Post-Dominator Tree Construction			; GCN-O1-NEXT: Post-Dominator Tree Construction
	▲ Show 20 Lines • Show All 235 Lines • ▼ Show 20 Lines
	; GCN-O1-OPTS-NEXT: Early propagate attributes from kernels to functions			; GCN-O1-OPTS-NEXT: Early propagate attributes from kernels to functions
	; GCN-O1-OPTS-NEXT: AMDGPU Lower Intrinsics			; GCN-O1-OPTS-NEXT: AMDGPU Lower Intrinsics
	; GCN-O1-OPTS-NEXT: AMDGPU Inline All Functions			; GCN-O1-OPTS-NEXT: AMDGPU Inline All Functions
	; GCN-O1-OPTS-NEXT: CallGraph Construction			; GCN-O1-OPTS-NEXT: CallGraph Construction
	; GCN-O1-OPTS-NEXT: Call Graph SCC Pass Manager			; GCN-O1-OPTS-NEXT: Call Graph SCC Pass Manager
	; GCN-O1-OPTS-NEXT: Inliner for always_inline functions			; GCN-O1-OPTS-NEXT: Inliner for always_inline functions
	; GCN-O1-OPTS-NEXT: A No-Op Barrier Pass			; GCN-O1-OPTS-NEXT: A No-Op Barrier Pass
	; GCN-O1-OPTS-NEXT: Lower OpenCL enqueued blocks			; GCN-O1-OPTS-NEXT: Lower OpenCL enqueued blocks
				; GCN-O1-OPTS-NEXT: Replace within non-kernel function use of LDS with pointer
	; GCN-O1-OPTS-NEXT: Lower uses of LDS variables from non-kernel functions			; GCN-O1-OPTS-NEXT: Lower uses of LDS variables from non-kernel functions
	; GCN-O1-OPTS-NEXT: FunctionPass Manager			; GCN-O1-OPTS-NEXT: FunctionPass Manager
	; GCN-O1-OPTS-NEXT: Infer address spaces			; GCN-O1-OPTS-NEXT: Infer address spaces
	; GCN-O1-OPTS-NEXT: Expand Atomic instructions			; GCN-O1-OPTS-NEXT: Expand Atomic instructions
	; GCN-O1-OPTS-NEXT: AMDGPU Promote Alloca			; GCN-O1-OPTS-NEXT: AMDGPU Promote Alloca
	; GCN-O1-OPTS-NEXT: Dominator Tree Construction			; GCN-O1-OPTS-NEXT: Dominator Tree Construction
	; GCN-O1-OPTS-NEXT: SROA			; GCN-O1-OPTS-NEXT: SROA
	; GCN-O1-OPTS-NEXT: Basic Alias Analysis (stateless AA impl)			; GCN-O1-OPTS-NEXT: Basic Alias Analysis (stateless AA impl)
	▲ Show 20 Lines • Show All 268 Lines • ▼ Show 20 Lines
	; GCN-O2-NEXT: Early propagate attributes from kernels to functions			; GCN-O2-NEXT: Early propagate attributes from kernels to functions
	; GCN-O2-NEXT: AMDGPU Lower Intrinsics			; GCN-O2-NEXT: AMDGPU Lower Intrinsics
	; GCN-O2-NEXT: AMDGPU Inline All Functions			; GCN-O2-NEXT: AMDGPU Inline All Functions
	; GCN-O2-NEXT: CallGraph Construction			; GCN-O2-NEXT: CallGraph Construction
	; GCN-O2-NEXT: Call Graph SCC Pass Manager			; GCN-O2-NEXT: Call Graph SCC Pass Manager
	; GCN-O2-NEXT: Inliner for always_inline functions			; GCN-O2-NEXT: Inliner for always_inline functions
	; GCN-O2-NEXT: A No-Op Barrier Pass			; GCN-O2-NEXT: A No-Op Barrier Pass
	; GCN-O2-NEXT: Lower OpenCL enqueued blocks			; GCN-O2-NEXT: Lower OpenCL enqueued blocks
				; GCN-O2-NEXT: Replace within non-kernel function use of LDS with pointer
	; GCN-O2-NEXT: Lower uses of LDS variables from non-kernel functions			; GCN-O2-NEXT: Lower uses of LDS variables from non-kernel functions
	; GCN-O2-NEXT: FunctionPass Manager			; GCN-O2-NEXT: FunctionPass Manager
	; GCN-O2-NEXT: Infer address spaces			; GCN-O2-NEXT: Infer address spaces
	; GCN-O2-NEXT: Expand Atomic instructions			; GCN-O2-NEXT: Expand Atomic instructions
	; GCN-O2-NEXT: AMDGPU Promote Alloca			; GCN-O2-NEXT: AMDGPU Promote Alloca
	; GCN-O2-NEXT: Dominator Tree Construction			; GCN-O2-NEXT: Dominator Tree Construction
	; GCN-O2-NEXT: SROA			; GCN-O2-NEXT: SROA
	; GCN-O2-NEXT: Basic Alias Analysis (stateless AA impl)			; GCN-O2-NEXT: Basic Alias Analysis (stateless AA impl)
	▲ Show 20 Lines • Show All 270 Lines • ▼ Show 20 Lines
	; GCN-O3-NEXT: Early propagate attributes from kernels to functions			; GCN-O3-NEXT: Early propagate attributes from kernels to functions
	; GCN-O3-NEXT: AMDGPU Lower Intrinsics			; GCN-O3-NEXT: AMDGPU Lower Intrinsics
	; GCN-O3-NEXT: AMDGPU Inline All Functions			; GCN-O3-NEXT: AMDGPU Inline All Functions
	; GCN-O3-NEXT: CallGraph Construction			; GCN-O3-NEXT: CallGraph Construction
	; GCN-O3-NEXT: Call Graph SCC Pass Manager			; GCN-O3-NEXT: Call Graph SCC Pass Manager
	; GCN-O3-NEXT: Inliner for always_inline functions			; GCN-O3-NEXT: Inliner for always_inline functions
	; GCN-O3-NEXT: A No-Op Barrier Pass			; GCN-O3-NEXT: A No-Op Barrier Pass
	; GCN-O3-NEXT: Lower OpenCL enqueued blocks			; GCN-O3-NEXT: Lower OpenCL enqueued blocks
				; GCN-O3-NEXT: Replace within non-kernel function use of LDS with pointer
	; GCN-O3-NEXT: Lower uses of LDS variables from non-kernel functions			; GCN-O3-NEXT: Lower uses of LDS variables from non-kernel functions
	; GCN-O3-NEXT: FunctionPass Manager			; GCN-O3-NEXT: FunctionPass Manager
	; GCN-O3-NEXT: Infer address spaces			; GCN-O3-NEXT: Infer address spaces
	; GCN-O3-NEXT: Expand Atomic instructions			; GCN-O3-NEXT: Expand Atomic instructions
	; GCN-O3-NEXT: AMDGPU Promote Alloca			; GCN-O3-NEXT: AMDGPU Promote Alloca
	; GCN-O3-NEXT: Dominator Tree Construction			; GCN-O3-NEXT: Dominator Tree Construction
	; GCN-O3-NEXT: SROA			; GCN-O3-NEXT: SROA
	; GCN-O3-NEXT: Basic Alias Analysis (stateless AA impl)			; GCN-O3-NEXT: Basic Alias Analysis (stateless AA impl)
	▲ Show 20 Lines • Show All 264 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-diamond-shape.ll

	; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer -amdgpu-enable-lds-replace-with-pointer=true < %s \| FileCheck %s			; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer < %s \| FileCheck %s

	; DESCRIPTION:			; DESCRIPTION:
	;			;
	; The lds global @lds_used_within_func is used within non-kernel function @func_uses_lds			; The lds global @lds_used_within_func is used within non-kernel function @func_uses_lds
	; which is recheable from kernel @kernel_reaches_lds, hence pointer replacement takes place			; which is recheable from kernel @kernel_reaches_lds, hence pointer replacement takes place
	; for @lds_used_within_func.			; for @lds_used_within_func.
	;			;

	▲ Show 20 Lines • Show All 79 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-selected_functions.ll

	; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer -amdgpu-enable-lds-replace-with-pointer=true < %s \| FileCheck %s			; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer < %s \| FileCheck %s

	; DESCRIPTION:			; DESCRIPTION:
	;			;
	; There are three lds globals defined here, and these three lds are used respectively within			; There are three lds globals defined here, and these three lds are used respectively within
	; three non-kernel functions. There are three kernels, which call two of the non-kernel functions.			; three non-kernel functions. There are three kernels, which call two of the non-kernel functions.
	; Hence pointer replacement should take place for all three lds, and pointer initialization within			; Hence pointer replacement should take place for all three lds, and pointer initialization within
	; kernel should selectively happen depending on which lds is reachable from the kernel.			; kernel should selectively happen depending on which lds is reachable from the kernel.
	;			;
	▲ Show 20 Lines • Show All 121 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-to-declare-only-func.ll

	; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer -amdgpu-enable-lds-replace-with-pointer=true < %s \| FileCheck %s			; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer < %s \| FileCheck %s

	; DESCRIPTION:			; DESCRIPTION:
	;			;
	; The kernel 'kern' makes a call to declared only function `foo`, hence `foo`			; The kernel 'kern' makes a call to declared only function `foo`, hence `foo`
	; is not considered as reachable callee, and is ignored. The function `goo`			; is not considered as reachable callee, and is ignored. The function `goo`
	; which uses LDS is not called from kernel 'kern', hence it is also ignored.			; which uses LDS is not called from kernel 'kern', hence it is also ignored.
	;			;

	Show All 29 Lines

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-ignore-global-scope-use.ll

	; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer -amdgpu-enable-lds-replace-with-pointer=true < %s \| FileCheck %s			; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer < %s \| FileCheck %s

	; DESCRIPTION:			; DESCRIPTION:
	;			;
	; None of lds are pointer-replaced since they are all used in global scope in one or the other way.			; None of lds are pointer-replaced since they are all used in global scope in one or the other way.
	;			;

	; CHECK: @lds = internal addrspace(3) global [4 x i32] undef, align 4			; CHECK: @lds = internal addrspace(3) global [4 x i32] undef, align 4
	; CHECK: @lds.1 = addrspace(3) global i16 undef, align 2			; CHECK: @lds.1 = addrspace(3) global i16 undef, align 2
	▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-ignore-inline-asm-call.ll

	; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer -amdgpu-enable-lds-replace-with-pointer=true < %s \| FileCheck %s			; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer < %s \| FileCheck %s

	; DESCRIPTION:			; DESCRIPTION:
	;			;
	; We do not know what to do with inline asm call, we ignore it, hence pointer replacement for			; We do not know what to do with inline asm call, we ignore it, hence pointer replacement for
	; @used_only_within_func does not take place.			; @used_only_within_func does not take place.
	;			;

	; CHECK: @used_only_within_func = addrspace(3) global [4 x i32] undef, align 4			; CHECK: @used_only_within_func = addrspace(3) global [4 x i32] undef, align 4
	Show All 21 Lines

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-ignore-kernel-only-used-lds.ll

	; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer -amdgpu-enable-lds-replace-with-pointer=true < %s \| FileCheck %s			; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer < %s \| FileCheck %s

	; DESCRIPTION ;			; DESCRIPTION ;
	;			;
	; LDS global @used_only_within_kern is used only within kernel @k0, hence pointer replacement			; LDS global @used_only_within_kern is used only within kernel @k0, hence pointer replacement
	; does not take place for @used_only_within_kern.			; does not take place for @used_only_within_kern.
	;			;

	; CHECK: @used_only_within_kern = addrspace(3) global [4 x i32] undef, align 4			; CHECK: @used_only_within_kern = addrspace(3) global [4 x i32] undef, align 4
	Show All 16 Lines

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-ignore-not-reachable-lds.ll

	; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer -amdgpu-enable-lds-replace-with-pointer=true < %s \| FileCheck %s			; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer < %s \| FileCheck %s

	; DESCRIPTION ;			; DESCRIPTION ;
	;			;
	; LDS global @not-reachable-lds is used within non-kernel function @f0, but @f0 is not			; LDS global @not-reachable-lds is used within non-kernel function @f0, but @f0 is not
	; reachable from kernel @k, hence pointer replacement does not take place.			; reachable from kernel @k, hence pointer replacement does not take place.
	;			;

	; CHECK: @not-reachable-lds = internal addrspace(3) global [4 x i32] undef, align 4			; CHECK: @not-reachable-lds = internal addrspace(3) global [4 x i32] undef, align 4
	Show All 19 Lines

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-ignore-small-lds.ll

	; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer -amdgpu-enable-lds-replace-with-pointer=true < %s \| FileCheck %s			; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer < %s \| FileCheck %s

	; DESCRIPTION ;			; DESCRIPTION ;
	;			;
	; LDS global @small_lds is used within non-kernel function @f0, and @f0 is reachable			; LDS global @small_lds is used within non-kernel function @f0, and @f0 is reachable
	; from kernel @k0, but since @small_lds too small for pointer replacement, pointer			; from kernel @k0, but since @small_lds too small for pointer replacement, pointer
	; replacement does not take place.			; replacement does not take place.
	;			;

	Show All 22 Lines

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-indirect-call-diamond-shape.ll

	; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer -amdgpu-enable-lds-replace-with-pointer=true < %s \| FileCheck %s			; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer < %s \| FileCheck %s

	; DESCRIPTION:			; DESCRIPTION:
	;			;
	; The lds global @lds_used_within_func is used within non-kernel function @func_uses_lds			; The lds global @lds_used_within_func is used within non-kernel function @func_uses_lds
	; which is indirectly recheable from kernel @kernel_reaches_lds, hence pointer replacement			; which is indirectly recheable from kernel @kernel_reaches_lds, hence pointer replacement
	; takes place for @lds_used_within_func.			; takes place for @lds_used_within_func.

	; Original LDS should exit.			; Original LDS should exit.
	▲ Show 20 Lines • Show All 86 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-indirect-call-selected_functions.ll

	; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer -amdgpu-enable-lds-replace-with-pointer=true < %s \| FileCheck %s			; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer < %s \| FileCheck %s

	; DESCRIPTION:			; DESCRIPTION:
	;			;
	; There are three lds globals defined here, and these three lds are used respectively within			; There are three lds globals defined here, and these three lds are used respectively within
	; three non-kernel functions. There are three kernels, which indirectly call two of the			; three non-kernel functions. There are three kernels, which indirectly call two of the
	; non-kernel functions. Hence pointer replacement should take place for all three lds, and			; non-kernel functions. Hence pointer replacement should take place for all three lds, and
	; pointer initialization within kernel should selectively happen depending on which lds is			; pointer initialization within kernel should selectively happen depending on which lds is
	; reachable from the kernel.			; reachable from the kernel.
	▲ Show 20 Lines • Show All 142 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-indirect-call-signature-match.ll

	; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer -amdgpu-enable-lds-replace-with-pointer=true < %s \| FileCheck %s			; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer < %s \| FileCheck %s

	; DESCRIPTION:			; DESCRIPTION:
	;			;
	; There are three lds globals defined here, and these three lds are used respectively within			; There are three lds globals defined here, and these three lds are used respectively within
	; three non-kernel functions. There is one kernel which indirectly calls one of the non-kernel			; three non-kernel functions. There is one kernel which indirectly calls one of the non-kernel
	; functions. But since all the three non-kernel functions have same signature, all three			; functions. But since all the three non-kernel functions have same signature, all three
	; non-kernel functions are resolved as potential callees for indirect call-site. Hence we land-up			; non-kernel functions are resolved as potential callees for indirect call-site. Hence we land-up
	; pointer replacement for three lds globals.			; pointer replacement for three lds globals.
	▲ Show 20 Lines • Show All 85 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-split-entry-bb-after-top-static-alloca-cluster.ll

This file was copied from llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-use-within-const-expr1.ll.

	; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer -amdgpu-enable-lds-replace-with-pointer=true < %s \| FileCheck %s			; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer < %s \| FileCheck %s

	; DESCRIPTION:			; DESCRIPTION:
	;			;
	; There is one lds global defined here, and this lds is used within a single non-kernel			; There is one lds global defined here, and this lds is used within a single non-kernel
	; function, as an operand of nested constant expression, and this non-kernel function is			; function, as an operand of nested constant expression, and this non-kernel function is
	; reachable from kernel. Hence nested constant expression should to be converted into a			; reachable from kernel. Hence nested constant expression should to be converted into a
	; series of instructons and pointer replacement should take place.			; series of instructons and pointer replacement should take place.
	;			;
				; Further the entry basic block of the kernel @k0 contains alloca instruction. Hence the
				; entry basic splitting for pointer initialization should happen after alloca.
				;

	; Original LDS should exist.			; Original LDS should exist.
	; CHECK: @used_only_within_func = addrspace(3) global [4 x i32] undef, align 4			; CHECK: @used_only_within_func = addrspace(3) global [4 x i32] undef, align 4
	@used_only_within_func = addrspace(3) global [4 x i32] undef, align 4			@used_only_within_func = addrspace(3) global [4 x i32] undef, align 4

	; Pointers should be created.			; Pointers should be created.
	; CHECK: @used_only_within_func.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2			; CHECK: @used_only_within_func.ptr = internal unnamed_addr addrspace(3) global i16 undef, align 2

	Show All 13 Lines
	entry:			entry:
	store i32 %x, i32* inttoptr (i64 add (i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast ([4 x i32] addrspace(3)* @used_only_within_func to i32 addrspace(3)) to i32) to i64), i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast ([4 x i32] addrspace(3)* @used_only_within_func to i32 addrspace(3)) to i32) to i64)) to i32*), align 4			store i32 %x, i32* inttoptr (i64 add (i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast ([4 x i32] addrspace(3)* @used_only_within_func to i32 addrspace(3)) to i32) to i64), i64 ptrtoint (i32* addrspacecast (i32 addrspace(3)* bitcast ([4 x i32] addrspace(3)* @used_only_within_func to i32 addrspace(3)) to i32) to i64)) to i32*), align 4
	ret void			ret void
	}			}

	; Pointer initialization code shoud be added			; Pointer initialization code shoud be added
	define amdgpu_kernel void @k0() {			define amdgpu_kernel void @k0() {
	; CHECK-LABEL: entry:			; CHECK-LABEL: entry:
	; CHECK: %0 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)			; CHECK: %0 = alloca i64, align 8, addrspace(5)
	; CHECK: %1 = icmp eq i32 %0, 0			; CHECK: %1 = alloca i64, align 8, addrspace(5)
	; CHECK: br i1 %1, label %2, label %3			; CHECK: %2 = alloca i64, align 8, addrspace(5)
				; CHECK: %3 = call i32 @llvm.amdgcn.mbcnt.lo(i32 -1, i32 0)
				; CHECK: %4 = icmp eq i32 %3, 0
				; CHECK: br i1 %4, label %5, label %6
	;			;
	; CHECK-LABEL: 2:			; CHECK-LABEL: 5:
	; CHECK: store i16 ptrtoint ([4 x i32] addrspace(3)* @used_only_within_func to i16), i16 addrspace(3)* @used_only_within_func.ptr, align 2			; CHECK: store i16 ptrtoint ([4 x i32] addrspace(3)* @used_only_within_func to i16), i16 addrspace(3)* @used_only_within_func.ptr, align 2
	; CHECK: br label %3			; CHECK: br label %6
	;
	; CHECK-LABEL: 3:			; CHECK-LABEL: 6:
	; CHECK: call void @llvm.amdgcn.wave.barrier()			; CHECK: call void @llvm.amdgcn.wave.barrier()
				; CHECK: %7 = addrspacecast i64 addrspace(5)* %0 to i64*
				; CHECK: %8 = addrspacecast i64 addrspace(5)* %1 to i64*
				; CHECK: %9 = addrspacecast i64 addrspace(5)* %2 to i64*
	; CHECK: call void @f0(i32 0)			; CHECK: call void @f0(i32 0)
	; CHECK: ret void			; CHECK: ret void
	entry:			entry:
				%0 = alloca i64, align 8, addrspace(5)
				%1 = alloca i64, align 8, addrspace(5)
				%2 = alloca i64, align 8, addrspace(5)
				%3 = addrspacecast i64 addrspace(5)* %0 to i64*
				%4 = addrspacecast i64 addrspace(5)* %1 to i64*
				%5 = addrspacecast i64 addrspace(5)* %2 to i64*
	call void @f0(i32 0)			call void @f0(i32 0)
	ret void			ret void
	}			}

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-use-multiple-lds.ll

	; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer -amdgpu-enable-lds-replace-with-pointer=true < %s \| FileCheck %s			; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer < %s \| FileCheck %s

	; DESCRIPTION:			; DESCRIPTION:
	;			;
	; There are three lds globals defined here, and these three lds are used within a single			; There are three lds globals defined here, and these three lds are used within a single
	; non-kernel function, and this non-kernel function is reachable from kernel. Hence pointer			; non-kernel function, and this non-kernel function is reachable from kernel. Hence pointer
	; replacement is required for all three lds globals.			; replacement is required for all three lds globals.
	;			;

	▲ Show 20 Lines • Show All 57 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-use-same-lds.ll

	; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer -amdgpu-enable-lds-replace-with-pointer=true < %s \| FileCheck %s			; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer < %s \| FileCheck %s

	; DESCRIPTION:			; DESCRIPTION:
	;			;
	; There is one lds global defined here, and this lds is used within a single non-kernel			; There is one lds global defined here, and this lds is used within a single non-kernel
	; function multiple times, and this non-kernel function is reachable from kernel. Hence			; function multiple times, and this non-kernel function is reachable from kernel. Hence
	; pointer takes place. But important note is - store-to/load-from pointer should happen			; pointer takes place. But important note is - store-to/load-from pointer should happen
	; only once irrespective of number of uses.			; only once irrespective of number of uses.
	;			;
	▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-use-within-const-expr1.ll

This file was copied to llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-split-entry-bb-after-top-static-alloca-cluster.ll.

	; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer -amdgpu-enable-lds-replace-with-pointer=true < %s \| FileCheck %s			; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer < %s \| FileCheck %s

	; DESCRIPTION:			; DESCRIPTION:
	;			;
	; There is one lds global defined here, and this lds is used within a single non-kernel			; There is one lds global defined here, and this lds is used within a single non-kernel
	; function, as an operand of nested constant expression, and this non-kernel function is			; function, as an operand of nested constant expression, and this non-kernel function is
	; reachable from kernel. Hence nested constant expression should to be converted into a			; reachable from kernel. Hence nested constant expression should to be converted into a
	; series of instructons and pointer replacement should take place.			; series of instructons and pointer replacement should take place.
	;			;
	▲ Show 20 Lines • Show All 45 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-use-within-const-expr2.ll

	; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer -amdgpu-enable-lds-replace-with-pointer=true < %s \| FileCheck %s			; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer < %s \| FileCheck %s

	; DESCRIPTION:			; DESCRIPTION:
	; There is one lds global defined here, and this lds is used within a single non-kernel			; There is one lds global defined here, and this lds is used within a single non-kernel
	; function, as an operand of nested constant expression, and this non-kernel function is			; function, as an operand of nested constant expression, and this non-kernel function is
	; reachable from kernel. Hence nested constant expression should to be converted into a			; reachable from kernel. Hence nested constant expression should to be converted into a
	; series of instructons and pointer replacement should take place. But, important note			; series of instructons and pointer replacement should take place. But, important note
	; is - only constant expression operands which uses lds should be converted into			; is - only constant expression operands which uses lds should be converted into
	; instructions, other constant expression operands which do not use lds should be left			; instructions, other constant expression operands which do not use lds should be left
	▲ Show 20 Lines • Show All 49 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-use-within-phi-inst.ll

	; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer -amdgpu-enable-lds-replace-with-pointer=true < %s \| FileCheck %s			; RUN: opt -S -mtriple=amdgcn-- -amdgpu-replace-lds-use-with-pointer < %s \| FileCheck %s

	; DESCRIPTION:			; DESCRIPTION:
	;			;
	; Replace lds globals used within phi instruction.			; Replace lds globals used within phi instruction.
	;			;

	; Original LDS should exist.			; Original LDS should exist.
	; CHECK: @lds.1 = addrspace(3) global i32 undef, align 4			; CHECK: @lds.1 = addrspace(3) global i32 undef, align 4
	▲ Show 20 Lines • Show All 84 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Enable the pass "amdgpu-replace-lds-use-with-pointer"AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 378877

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp

llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

llvm/test/CodeGen/AMDGPU/llc-pipeline.ll

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-diamond-shape.ll

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-selected_functions.ll

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-call-to-declare-only-func.ll

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-ignore-global-scope-use.ll

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-ignore-inline-asm-call.ll

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-ignore-kernel-only-used-lds.ll

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-ignore-not-reachable-lds.ll

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-ignore-small-lds.ll

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-indirect-call-diamond-shape.ll

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-indirect-call-selected_functions.ll

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-indirect-call-signature-match.ll

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-split-entry-bb-after-top-static-alloca-cluster.ll

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-use-multiple-lds.ll

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-use-same-lds.ll

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-use-within-const-expr1.ll

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-use-within-const-expr2.ll

llvm/test/CodeGen/AMDGPU/replace-lds-by-ptr-use-within-phi-inst.ll

[AMDGPU] Enable the pass "amdgpu-replace-lds-use-with-pointer"
AbandonedPublic