This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Enable the pass "amdgpu-replace-lds-use-with-pointer"
AbandonedPublic

Authored by hsmhsm on Sep 15 2021, 10:34 PM.

Details

Summary

Reland below reverted commits with some changes:

As an optimization, split the entry block of kernel K just after top most static
alloca cluster.

From the better code transformation/optimization perspective, it is expected
that all static alloca appear as a single contiguous cluster at the start of the
entry block. If this canonical form is *not* maintained, then few static alloca
may become dynamic after the entry block split.

Diff Detail

Event Timeline

hsmhsm created this revision.Sep 15 2021, 10:34 PM
hsmhsm requested review of this revision.Sep 15 2021, 10:34 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 15 2021, 10:34 PM
arsenm added inline comments.Sep 16 2021, 7:56 PM
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
230

Why is this using a reverse iterator? Why not just forward iterate since that's where the allocas are?

hsmhsm added inline comments.Sep 16 2021, 8:05 PM
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
230

(At least in the failed openmp test case), what I noticed is that alloca at the beginning are not contiguous one after other, but there are in between addressspace casts like below (I am not sure, if there is a possibility of other in-between instructions too other than addressspace). So I went with the reverse iteration. But, actually if we are able to successfully forward iterate, that would very neat.

alloca
addresss-space-cast
alloca
addresss-space-cast
....

hsmhsm marked an inline comment as not done.Sep 16 2021, 9:00 PM
hsmhsm added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
230

Another possibility is as below:

Since we only care about allocas inserted at the beginning of the block, and since we need to split the block before any call (since called functions is where the initialized pointers are accessed), and since it is safe to assume that all entry point allocas do appear before any call, we can forward traverse until first call, and split the block at that point.

arsenm added inline comments.Sep 17 2021, 6:08 AM
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
230

Can we fix the insertion of these allocas to not intersperse addrspacecasts?

hsmhsm marked an inline comment as not done.Sep 17 2021, 7:11 AM
hsmhsm added inline comments.Sep 17 2021, 7:11 AM
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
230

Let me check if we can handle it.

hsmhsm updated this revision to Diff 373400.Sep 18 2021, 2:29 AM

Cluster all allocas within the entry block of the kernel at the beginning of the block
and split the block just after the cluster.

hsmhsm updated this revision to Diff 373468.Sep 19 2021, 9:29 AM

Consider only static allocas, we don't care about dynamic allocas.

hsmhsm edited the summary of this revision. (Show Details)Sep 19 2021, 8:16 PM
rampitec added inline comments.Sep 20 2021, 4:04 PM
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
230

I though the idea was to fix the alloca insertion, not to move them. Even if we want to cluster them it does not seem to be a right place to do it here as it is a more general thing. Really I'd prefer to fix FE and produce a canonical IR.

hsmhsm added inline comments.Sep 20 2021, 9:40 PM
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
230

Let me clarify and re-iterate few points w.r.t alloca (which you probably know), and to clarify the requirements of this patch w.r.t alloca, and then come to a common conclusion based on it.

  1. ALLOCA:
  • LLVM has a notion of static and dynamic allocas. Static allocas are those which appear in the *entry* basic block and their size is known at compile time. All other are dynamic allocas, including those which appear in the *entry* basic block but their size is not known.
  • Both front-end and llvm opt passes usually (almost always) make sure that static allocas are inserted at the beginning or near beginning of the entry block, but their contiguity (continuous sequence of static allocas) is not guaranteed. For example, FE may insert in-between addressspace casts, there could be gap between the FE inserted static allocas and inlined static allocas. Nevertheless, they do appear before any call in the entry block, at the beginning or near beginning of the entry block.
  • Above is the general expected static alloca placement for better LLVM optimization, and all LLVM passes should respect this static alloca placement and they do.
  1. ALLOCA HANDLING IN CURRENT PATCH:
  • Naturally this patch should respect the placement of static allocas in the entry block and should split the entry block after all static allocas in the entry block. Note that this patch does not care about dynamic allocas.
  • Now, the question is how to find the last static alloca in the entry block so that entry block split happens after it. There are two approaches.
    • Forward iteration of entry block
    • Backward iteration of entry block
  • Since the contiguity of static allocas are not guaranteed even though they are placed at the beginning or near beginning of the entry block, there is no neat way of forward iteration, however, backward iteration is kind of neat.
  • By noticing above, I had implemented the backward iteration first. But, Matt has asked me - why not forward iteration.
  • Hence I ended-up with the forward iterated static alloca clustering business as implemented in this patch.
  1. WHAT IS NEXT?:
  • Go with backward iteration of entry block until we find very first static alloca in the backward direction.
  • Go with forward iteration. If we want to do this, we need a clustering work-around as implemented here. However, this work-around implementation should go as a separate work-around patch which we can revert later (as me and Matt discussed offline) when it is ensured that all static allocas are contiguous.

PS: Fixing the contiguity of static allocas in the entry block by changing FE or elsewhere OR handling AMDGPU backend issues (correctness issues) due to dynamic allocas is outside the scope of this patch, that we should take them as separate tasks. And, I do not have a luxury of keep on holding this patch until these issues are being investigated and fixed.

hsmhsm marked 3 inline comments as done.Sep 20 2021, 9:58 PM
hsmhsm added inline comments.Sep 20 2021, 10:01 PM
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
230

With the above clarification, I will again stick to backward iteration of entry block and get rid of entering into unnecessary business of alloca clustering (from this patch perspective).

hsmhsm updated this revision to Diff 373783.Sep 20 2021, 11:02 PM

Find entry block split point (just after last static alloca) via backward iteration
of the block and get rid of entering into unnecessary workaround business of alloca
clustering (from this patch perspective) when we do forward iteration.

hsmhsm edited the summary of this revision. (Show Details)Sep 20 2021, 11:03 PM
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
230
  • Both front-end and llvm opt passes usually (almost always) make sure that static allocas are inserted at the beginning or near beginning of the entry block, but their contiguity (continuous sequence of static allocas) is not guaranteed. For example, FE may insert in-between addressspace casts, there could be gap between the FE inserted static allocas and inlined static allocas. Nevertheless, they do appear before any call in the entry block, at the beginning or near beginning of the entry block.
  • Above is the general expected static alloca placement for better LLVM optimization, and all LLVM passes should respect this static alloca placement and they do.

If this is an invariant that passes maintain, which essentially means the IR verifier rejects other forms and preferably it is documented, then we can rely on it here. If however it is merely a feature of the test cases you've looked at then we cannot.

If unsure, please verify (by walking over the entry basic block and looking) that your required precondition is met before applying the transform. Given this is not a lowering pass, it's a transform that trades off cycles for memory, it is not justified to risk breaking applications in exchange for not dynamically checking the assumptions.

hsmhsm marked an inline comment as done.Sep 21 2021, 12:14 AM
hsmhsm added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
230

If this is an invariant that passes maintain, which essentially means the IR verifier rejects other forms and preferably it is documented, then we can rely on it here. If however it is merely a feature of the test cases you've looked at then we cannot.

This is *not* something which is required from correctness perspective, which is something required from *better optimization* perspective. So, IR verifier cannot reject other forms as you mention above. And it is documented here https://llvm.org/docs/Frontend/PerformanceTips.html#use-of-allocas after related discussion happened here https://lists.llvm.org/pipermail/llvm-dev/2015-September/090168.html

If unsure, please verify (by walking over the entry basic block and looking) that your required precondition is met before applying the transform. Given this is not a lowering pass, it's a transform that trades off cycles for memory, it is not justified to risk breaking applications in exchange for not dynamically checking the assumptions.

As I clarified in my previous lengthy comment, all I need to do here is just to respect static alloca placement (from better optimization perspective). I do not understand why I need to do what you tell above. If application breaks because of dynamic allocas or something else, then it is altogether a different issue and not related to breaking of functionality due to this patch. Let's root cause and fix them when we encounter them. Why should I overcomplicate this patch by adding unrelated stuff into this patch?

hsmhsm marked an inline comment as done.Sep 21 2021, 12:24 AM
hsmhsm added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
230

Also [if you have not yet read it], please go through the reply's to the related alloca question that I had recently posted to llvm-dev. Search for the message - Alloca, Inlining and Entry Basic Block.

Though, I have not yet replied to it.

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
230

You are assuming that all entry block alloca with fixed size occur before all calls in the entry block, which you agree is not an invariant on the IR. So you know your pass is unsound, but you want to ship it anyway?

hsmhsm marked an inline comment as done.Sep 21 2021, 2:19 AM
hsmhsm added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
230

As I already clarified, it is LLVM expectation that "all static allocas should be placed before any call" (from better optimization perspective). And, as I understand it, community expect that all passes (and whatever implementation which touches it including openmp) should adhere to this assumption, even though it is not a strictly required invariant property from correctness perspective, and even expect to fix it if in case any pass breaks it.

Within this pass, I am piggy-backing on above LLVM expectation and assumption. If this assumption is broken by some other pass P then this pass is also broken until P fixes it.

That said, if we want to be extra cautious about the above assumption within this pass, we can actually check whether entry blocks of kernels satisfy above assumption or not, and skip running the pass even if single kernel violates it. In-fact, I even tried it in the patch https://reviews.llvm.org/D109594 (line 433). But, it was rejected.

JonChesterfield requested changes to this revision.Sep 21 2021, 3:25 AM

Transforming legal IR to broken IR cannot be justified by "the input probably doesn't look like that". That you disagree with that axiom is wasting a lot of development time.

This revision now requires changes to proceed.Sep 21 2021, 3:25 AM
hsmhsm marked an inline comment as done.Sep 21 2021, 3:33 AM

Transforming legal IR to broken IR cannot be justified by "the input probably doesn't look like that". That you disagree with that axiom is wasting a lot of development time.

Show me an input legal example which is broken after this pass.

JonChesterfield added a comment.EditedSep 21 2021, 3:50 AM

Interleave alloca with calls to functions that use LDS in the entry block. You'll either fail to initialise the LDS variable before use or convert static alloca to dynamic, depending on the iteration order you choose. Iterating forwards will work for LDS and pessimise alloca. Iterating backwards will fail to initialise LDS variables before use.

Edit: and at time of comment you're iterating backwards

hsmhsm added a comment.EditedSep 21 2021, 4:02 AM

Interleave alloca with calls to functions that use LDS in the entry block. You'll either fail to initialise the LDS variable before use or convert static alloca to dynamic, depending on the iteration order you choose. Iterating forwards will work for LDS and pessimise alloca. Iterating backwards will fail to initialise LDS variables before use.

Edit: and at time of comment you're iterating backwards

As I clarified in my previous comments, If there exist Interleave alloca with calls to functions that use LDS in the entry block in the input ir, then input ir is already broken before coming to this pass, and this pass is also broken because it works on broken input ir. So, the pass which first broke the input should be fixed (if openmp is breaking it then openmp should fix it). You cannot ask and expect all the broken things to be fixed as part of this patch.

Nothing is clear here. I have zero comprehension of what you are trying to do. This pass will take IR that describes a meaningful computation and translate it to one that uses uninitialised memory. You'd rather argue that this is fine than write an 'if' statement to skip a function containing unexpected IR.

I suggest you escalate this to whatever authority you consider appropriate.

Nothing is clear here. I have zero comprehension of what you are trying to do. This pass will take IR that describes a meaningful computation and translate it to one that uses uninitialised memory. You'd rather argue that this is fine than write an 'if' statement to skip a function containing unexpected IR.

I suggest you escalate this to whatever authority you consider appropriate.

No, actually it is not clear to me your line of argument - Please checkout with community if an llvm ir with "function call with interleaving static allocas" is a meaningful ir from llvm transformation and optimization perspective. As I understand it from replies to my email, community thinks that it is broken ir, it should be fixed, even though there is nothing wrong in the ir from correctness perspective, and community expect to place all static allocas before any call.

And, as I also clarified, if we think that in-spite of above expectation, if we want to be cautious about it in this pass, then we can add IF check as I implemented at https://reviews.llvm.org/D109594 and skip running the pass if there is "function call with interleaving static allocas". Note that you cannot just skip kernels which have this pattern, and continue with other kernels. We should skip the pass, even if one kernel has this pattern. The reason is, consider this situation - Kernel K1 and K2 calls function FOO, FOO uses LDS, K1 has this pattern, but K2 does not. In this case, if we skip K1, and go with K2, then transformation itself will be broken.

Reviewers need to agree on above.

The point I am making here is - I cannot fix whole universe of issues as part of this patch within in a short frame of time. We need to go with a solution which works for us now, and come back it once the pass settles down.

Yes, we need to have internal meeting for better clarification, we cannot go endlessly like this without common conclusion.

What's the subject of the email to llvm-dev you are referring to here?

What's the subject of the email to llvm-dev you are referring to here?

"Alloca, Inlining and Entry Basic Block"

Here is the link - https://groups.google.com/g/llvm-dev/c/-E6cKEQXTIw

Hi Roman, Nicolai,

This is an IR transform which assumes alloca are contiguous within the entry block. Mahesha is presenting your replies to the linked thread as community approval for miscompiling code where that assumption does not hold. Somewhat indirectly, this pass will replace load/store of a addrspace(3) global variable with load/store from an uninitialised address if there happens to be a call to a function using said variable in the middle of the alloca in the entry block.

I won't speak for either of you here. I will say that I consider using your replies to a tangentially related question as authority to fast track a known broken IR pass to be inconceivable behaviour.

jdoerfert requested changes to this revision.Sep 21 2021, 7:14 AM
jdoerfert added a subscriber: jdoerfert.

This is an IR transform which assumes alloca are contiguous within the entry block. Mahesha is presenting your replies to the linked thread as community approval for miscompiling code where that assumption does not hold.

FWIW: LLVM-IR, in general and especially when coming from Clang, has all but VLA allocas in the entry block. That said, there is no rule in the IR that it needs to have them in the entry block. The IR is not broken.
If you encounter IR with allocas not in the entry block it is worth to look at that and to do canonicalization. However, from the tests and patch it looks like there is a second assumption here that does not hold at all:
Allocas do not need to be clustered in any way in the entry block.

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
188

There is no reason to assume allocas are clustered in any way in the entry block. That is certainly never the case.

Hi Roman, Nicolai,

This is an IR transform which assumes alloca are contiguous within the entry block. Mahesha is presenting your replies to the linked thread as community approval for miscompiling code where that assumption does not hold. Somewhat indirectly, this pass will replace load/store of a addrspace(3) global variable with load/store from an uninitialised address if there happens to be a call to a function using said variable in the middle of the alloca in the entry block.

I won't speak for either of you here. I will say that I consider using your replies to a tangentially related question as authority to fast track a known broken IR pass to be inconceivable behaviour.

This pass is an optimization. We do not need anything here for correctness and can do what's simplest. The fact that out of entry block allocas may be broken is totally unrelated to this patch, and all discussion related to correctness is a distraction here

llvm/lib/Target/AMDGPU/AMDGPUReplaceLDSUseWithPointer.cpp
188

It's not required, but it's good enough for optimization purposes. Nothing should be trying to produce IR with allocas in any way

hsmhsm added a comment.EditedSep 21 2021, 7:25 AM

Hi Roman, Nicolai,

This is an IR transform which assumes alloca are contiguous within the entry block. Mahesha is presenting your replies to the linked thread as community approval for miscompiling code where that assumption does not hold. Somewhat indirectly, this pass will replace load/store of a addrspace(3) global variable with load/store from an uninitialised address if there happens to be a call to a function using said variable in the middle of the alloca in the entry block.

I won't speak for either of you here. I will say that I consider using your replies to a tangentially related question as authority to fast track a known broken IR pass to be inconceivable behaviour.

Hi Jon,

Small correction: I am not claiming that "alloca are contiguous within the entry block". please re-read my replies.

Hi Roman, Nicolai,

Here is my understanding and claim about static allocas (not dynamic allocas) - All static allocas should appear in the entry basic block before any function call for better optimization opportunities. If there are interleaved static allocas with function call in-between, such an ir is considered broken, even though the ir is valid from correctness perspective. And if any pass is not adhering to the requirement that all static allocas should be placed in the entry block before any function call, then such a pass is considered broken since it may leads to surprising result in general.

Let me know if my above understanding is correct or not.

PS: I also sent an email to llvm-dev with the subject line - "Placement of static allocas" for wider visibility, and probably we can take this discussion to llvm-dev.

hsmhsm added a comment.EditedSep 21 2021, 7:49 AM

This is an IR transform which assumes alloca are contiguous within the entry block. Mahesha is presenting your replies to the linked thread as community approval for miscompiling code where that assumption does not hold.

FWIW: LLVM-IR, in general and especially when coming from Clang, has all but VLA allocas in the entry block. That said, there is no rule in the IR that it needs to have them in the entry block. The IR is not broken.
If you encounter IR with allocas not in the entry block it is worth to look at that and to do canonicalization. However, from the tests and patch it looks like there is a second assumption here that does not hold at all:
Allocas do not need to be clustered in any way in the entry block.

Please note that I am talking *only* about static allocas here, and not dynamic allocas. Also note that I am talking only about requirement of placing all static allocas within the entry block before any function call, not about their contiguity or cluster.

hsmhsm updated this revision to Diff 374091.EditedSep 21 2021, 7:16 PM

At the moment, it is not *very clear* if LLVM IR with static alloca after call
is legal or not.

In this pass, since we need to split the entry block before any call, any static
alloca after the call will be converted into dynamic alloca which we want to avoid.

So, for now, we skip running this pass, if there a kernel which has static
alloca after call, and we will revisit this code later once we have clear clarity
on the placement of static alloca.

At the moment, it is not *very clear* if LLVM IR with static alloca after call
is legal or not.

Calls have absolutely nothing to do with allocas. Allocas may legally appear anywhere, this isn't in question. Allocas are strongly recommended to be placed in the entry block to enable optimizations. Further, clustering them at the top of the entry block is nicer IR. This is in no way a requirement, but an optimization pass that doesn't strictly need to touch every alloca doesn't need to find every alloca

In this pass, since we need to split the entry block before any call, any static
alloca after the call will be converted into dynamic alloca which we want to avoid.

So, for now, we skip running this pass, if there a kernel which has static
alloca after call, and we will revisit this code later once we have clear clarity
on the placement of static alloca.

This pass has no reason to inspect calls. What you care about is splitting the entry block to insert some initialization code. In the process of doing this, you want to avoid moving allocas out of the entry block. This is trivial to do if all the allocas are clustered at the top of the block. If you ignore allocas beyond the top cluster, there is nothing wrong with the IR. If you happen to sink some out of the entry block, it's still correct, just suboptimal.

At the moment, it is not *very clear* if LLVM IR with static alloca after call
is legal or not.

Calls have absolutely nothing to do with allocas. Allocas may legally appear anywhere, this isn't in question. Allocas are strongly recommended to be placed in the entry block to enable optimizations. Further, clustering them at the top of the entry block is nicer IR. This is in no way a requirement, but an optimization pass that doesn't strictly need to touch every alloca doesn't need to find every alloca

In this pass, since we need to split the entry block before any call, any static
alloca after the call will be converted into dynamic alloca which we want to avoid.

So, for now, we skip running this pass, if there a kernel which has static
alloca after call, and we will revisit this code later once we have clear clarity
on the placement of static alloca.

This pass has no reason to inspect calls. What you care about is splitting the entry block to insert some initialization code. In the process of doing this, you want to avoid moving allocas out of the entry block. This is trivial to do if all the allocas are clustered at the top of the block. If you ignore allocas beyond the top cluster, there is nothing wrong with the IR. If you happen to sink some out of the entry block, it's still correct, just suboptimal.

Agreed cent percent.

Further, I have pushed CFE patch at https://reviews.llvm.org/D110257 which fixes the problem where the contiguity of static allocas at the very top of the entry block was broken. Once this patch makes it's way to trunk, all I need to do here in this patch is to just scan the very top static alloca cluster (as an optimization), and split the block just after the cluster.

If any other static alloca which is not part of the top cluster (for some in-efficient reason) happens to move out of entry block, and which exposes hidden bug in AMDGPU back-end, then it is the *not* the responsibility of this patch. It is a different issue to be fixed, and blaming this patch for it is being exposing the hidden bug is not at all acceptable anymore.

hsmhsm updated this revision to Diff 375812.Sep 29 2021, 2:15 AM

As an optimization, split the entry block of kernel K just after top most
static alloca cluster.

From the better code transformation/optimization perspective, it is expected
that all static alloca appear as a single contiguous cluster at the start of
the entry block. If this canonical form is *not* maintained, then few static
alloca may become dynamic after the entry block split here.

hsmhsm edited the summary of this revision. (Show Details)Sep 29 2021, 2:24 AM
hsmhsm updated this revision to Diff 376097.Sep 29 2021, 9:34 PM

Rebase and update while loop for better readability.

hsmhsm updated this revision to Diff 378436.Oct 9 2021, 5:00 AM

Rebase.

hsmhsm marked 2 inline comments as done.Oct 9 2021, 11:02 PM
hsmhsm updated this revision to Diff 386048.Nov 9 2021, 7:26 PM

Rebase.

hsmhsm added a comment.Nov 9 2021, 7:34 PM

Hi @arsenm, The pre-requisite front-end patch (https://reviews.llvm.org/D110257) has landed to llvm trunk. Now, this patch needs an action to be taken. Please have a look at this patch during your free time.

Hi @JonChesterfield and @jdoerfert, you both had rejected this patch initially asking for the changes. Your review and acceptance of this patch is must, otherwise this patch will not land even if some other reviewer accept it. So, kind request to take a look at this patch.

hsmhsm updated this revision to Diff 386049.Nov 9 2021, 7:49 PM

Minor changes to comment.

hsmhsm edited the summary of this revision. (Show Details)Nov 15 2021, 7:57 PM
JonChesterfield requested changes to this revision.EditedNov 15 2021, 11:56 PM
JonChesterfield added a subscriber: JonChesterfield.

Hi @JonChesterfield and @jdoerfert, you both had rejected this patch initially asking for the changes. Your review and acceptance of this patch is must, otherwise this patch will not land even if some other reviewer accept it. So, kind request to take a look at this patch.

I haven't seen someone remove reviewers from their patch when said reviewers marked it 'requested changes' before. Can you indicate why you did so? Reinstating my requested changes flag in the mean time.

This revision now requires changes to proceed.Nov 15 2021, 11:56 PM
rampitec resigned from this revision.Nov 16 2021, 12:12 AM

Resigning from the review as it should get approval from people blocking it. This is THE process.

Hi @JonChesterfield and @jdoerfert, you both had rejected this patch initially asking for the changes. Your review and acceptance of this patch is must, otherwise this patch will not land even if some other reviewer accept it. So, kind request to take a look at this patch.

I haven't seen someone remove reviewers from their patch when said reviewers marked it 'requested changes' before. Can you indicate why you did so? Reinstating my requested changes flag in the mean time.

I removed you from the reviewer because, I personally think that you just want to avoid this patch from getting approved, though I do not know the real reason. And, I think we have discussed enough w.r.t this patch. There is nothing that I can make any changes to this patch as you insist here. I am going to abandon this patch.

hsmhsm abandoned this revision.Nov 16 2021, 12:24 AM

I do not think this patch makes any progress. Enough is enough, I am abandoning it.

I personally think that you just want to avoid this patch from getting approved, though I do not know the real reason

I'm mistrusting of this patch because it has been landed repeatedly and broke amdgpu openmp every time. That's the 'real reason', every time you land this is breaks openmp and wastes a bunch of my time picking up the pieces. I think you required D110257 to land before it might work this time around and that only went in a few hours ago.

What I'd have liked to see as part of this patch is runtime tests that show that it works now, given at all points in time we ran into tests crashing with 'error: memory access' or similar.

Abandoning this optimisation also works from my perspective, but leaving the transform as dead code in trunk indefinitely isn't OK. We need to either make the pass work (for which IR-only tests have repeatedly proven insufficient) or delete it from the tree.

I can't find anything on https://llvm.org/docs/CodeReview.html that explicitly rules out your strategy of removing reviewers when they request changes. Maybe it hasn't come up before, I guess I'll ping llvm-dev for clarification.

Patch was abandoned while writing the above reply - do you intend to delete the dead pass from llvm trunk shortly?

I personally think that you just want to avoid this patch from getting approved, though I do not know the real reason

I'm mistrusting of this patch because it has been landed repeatedly and broke amdgpu openmp every time. That's the 'real reason', every time you land this is breaks openmp and wastes a bunch of my time picking up the pieces.

This is not really true, openmp test fails because this patch exposes some other bug - either in openmp or in amdgpu backend. We should fix that bug instead of blaming this patch.

I think you required D110257 to land before it might work this time around and that only went in a few hours ago.

D110257 landed on Nov 10th, not few hours ago, Please cross-check.

What I'd have liked to see as part of this patch is runtime tests that show that it works now, given at all points in time we ran into tests crashing with 'error: memory access' or similar.

If you really wanted this, we could have collaborated internally in a meaningful manner. By the way, if you just wanted thorough openmp testing before approving this patch, why did you again mark it as "needs revision"? After my ping here (which was last week on 10th), you could have approached me to help performing thorough openmp testing.

Abandoning this optimisation also works from my perspective,
but leaving the transform as dead code in trunk indefinitely isn't OK. We need to either make the pass work (for which IR-only tests have repeatedly proven insufficient) or delete it from the tree.

Inform it to AMD technical team, and take further action accordingly.

I can't find anything on https://llvm.org/docs/CodeReview.html that explicitly rules out your strategy of removing reviewers when they request changes. Maybe it hasn't come up before, I guess I'll ping llvm-dev for clarification.

No comment.