at the start of the entry block, which in turn would aid better code transformation/optimization.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Event Timeline
I do think it's cleaner/more canonical IR to cluster these at the top of the block, but I don't understand this comment:
otherwise, inliner's attempt to move static allocas from callee to caller will fail,
The inliner successfully moves allocas to the caller's entry block, even with addrspacecasts interspersed.
clang/lib/CodeGen/CGExpr.cpp | ||
---|---|---|
102–106 | Where are the addrspacecasts inserted? Could you just adjust where those are inserted instead? |
The logic at https://github.com/llvm/llvm-project/blob/main/llvm/lib/Transforms/Utils/InlineFunction.cpp#L2093 assumes that static allocas (within callee) are contiguous.
clang/lib/CodeGen/CGExpr.cpp | ||
---|---|---|
102–106 | The addressspace casts are inserted immediately after all static allocas (top static alloca cluster). An example: Before this patch: entry: %N.addr = alloca i64, align 8, addrspace(5) %N.addr.ascast = addrspacecast i64 addrspace(5)* %N.addr to i64* %vla.addr = alloca i64, align 8, addrspace(5) %vla.addr.ascast = addrspacecast i64 addrspace(5)* %vla.addr to i64* %a.addr = alloca i32*, align 8, addrspace(5) %a.addr.ascast = addrspacecast i32* addrspace(5)* %a.addr to i32** %vla.addr2 = alloca i64, align 8, addrspace(5) %vla.addr2.ascast = addrspacecast i64 addrspace(5)* %vla.addr2 to i64* %b.addr = alloca i32*, align 8, addrspace(5) %b.addr.ascast = addrspacecast i32* addrspace(5)* %b.addr to i32** %N.casted = alloca i64, align 8, addrspace(5) %N.casted.ascast = addrspacecast i64 addrspace(5)* %N.casted to i64* %.zero.addr = alloca i32, align 4, addrspace(5) %.zero.addr.ascast = addrspacecast i32 addrspace(5)* %.zero.addr to i32* %.threadid_temp. = alloca i32, align 4, addrspace(5) %.threadid_temp..ascast = addrspacecast i32 addrspace(5)* %.threadid_temp. to i32* store i64 %N, i64* %N.addr.ascast, align 8 With this patch: entry: %N.addr = alloca i64, align 8, addrspace(5) %vla.addr = alloca i64, align 8, addrspace(5) %a.addr = alloca i32*, align 8, addrspace(5) %vla.addr2 = alloca i64, align 8, addrspace(5) %b.addr = alloca i32*, align 8, addrspace(5) %N.casted = alloca i64, align 8, addrspace(5) %.zero.addr = alloca i32, align 4, addrspace(5) %.threadid_temp. = alloca i32, align 4, addrspace(5) %.threadid_temp..ascast = addrspacecast i32 addrspace(5)* %.threadid_temp. to i32* %.zero.addr.ascast = addrspacecast i32 addrspace(5)* %.zero.addr to i32* %N.casted.ascast = addrspacecast i64 addrspace(5)* %N.casted to i64* %b.addr.ascast = addrspacecast i32* addrspace(5)* %b.addr to i32** %vla.addr2.ascast = addrspacecast i64 addrspace(5)* %vla.addr2 to i64* %a.addr.ascast = addrspacecast i32* addrspace(5)* %a.addr to i32** %vla.addr.ascast = addrspacecast i64 addrspace(5)* %vla.addr to i64* %N.addr.ascast = addrspacecast i64 addrspace(5)* %N.addr to i64* store i64 %N, i64* %N.addr.ascast, align 8 |
True. Even worse. It will bail if a static alloca is used as an inalloca argument.
So, if you now interleave allocas that may be used in inalloca you also break the "canonical form".
I assume this hits mostly windows but still.
clang/lib/CodeGen/CGExpr.cpp | ||
---|---|---|
102–106 | I meant where in the clang code are these emitted, and how is that I set point found? |
No, it doesn't make that assumption. That code is an attempt to optimize the linked list splicing, so that batches of contiguous allocas can be spliced over together. I believe the code would still be correct if you removed this inner loop and left behind the outer loop, which iterates over all allocas in the entry block.
Thanks, understood it now. So, the actual logic (in the outer loop) is - scan the (contiguous ) batches of static allocas within the callees entry block, and move those batches to caller as one chunk at a time.
That said, it is still good idea (and actually an explicitly not mandated requirement) to maintain the contiguity of the static allocas at the top of the basic block as one cluster, and it should start from FE itself. So, this patch is still relevant.
clang/lib/CodeGen/CGExpr.cpp | ||
---|---|---|
102–106 | I understand your question as - where exactly is code emitted? It is Clang Codegen part, the Clang Codegen maintains a builder (function specific?) which builds and emits code, and builder always maintains default current insertion position, and it can also be asked to insert at some other place by calling the api SetInsertPoint(). In this particular case, the addrespace casts are emitted at (by calling the builder) https://github.com/llvm-mirror/clang/blob/master/lib/CodeGen/TargetInfo.cpp#445 All I am doing here is - direct the builder to insert addrespace casts just after all static allocas by appropriately setting the insert position via SetInsertPoint(). |
Pls make sure the patch passes internal CI. Thanks.
clang/test/OpenMP/distribute_parallel_for_simd_if_codegen.cpp | ||
---|---|---|
265 | Is the test updated by a script? If the original test does not check !llvm.access.group, the updated test should not check it either. This makes the test less stable. |
clang/test/OpenMP/distribute_parallel_for_simd_if_codegen.cpp | ||
---|---|---|
265 | Yes, most of the tests here are updated by script only. Probably I might have missed few command line options to the script. I have not passed any option to the script while updating it. Let me check. |
clang/test/OpenMP/distribute_parallel_for_simd_if_codegen.cpp | ||
---|---|---|
265 | I have used exact command line as in the test file plus extra option "--force-update". But it's presence or absence is not making any difference here. But the !llvm.access.group metadata is newly added. And I am not finding anyway of disabling it. Do you have any idea? or anyone else for that matter? |
The patch mostly affects GPU tests, so as long as the GPU folks (AMD/nvptx) are happy with the regenerated test cases, this seems fine to me.
clang/lib/CodeGen/CGExpr.cpp | ||
---|---|---|
103 | Why is there a special AllocaInsertPt iterator in the first place? Can you avoid any iteration logic by just always inserting at the block start? |
clang/lib/CodeGen/CGExpr.cpp | ||
---|---|---|
103 | Right. The alloca insertion point is sometimes changed to a non-entry block, and we should keep that ability. |
clang/lib/CodeGen/CGExpr.cpp | ||
---|---|---|
103 | I really do not understand this comment fully. This block of code here inserts an "addressspace cast" of recently inserted alloca, not the alloca itself. Alloca is already inserted. Please look at the start of this function. The old logic (in the left) inserts addressspace cast of recently inserted alloca immediately after recent alloca using AllocaInsertPt. As a side effect, it also makes AllocaInsertPt now point to this newly inserted addressspace cast. Hence, next alloca is inserted after this new addressspace cast, because now AllocaInsertPt is made to point this newly inserted addressspace cast. The new logic (here) fixes that by moving insertion point just beyond current AllocaInsertPt without updating AllocaInsertPt. How can I insert "addressspace cast" of an alloca, at the beginning of the block even before alloca? As I understand it, AllocaInsertPt is maintained to insert static allocas at the start of entry block, otherwise there is no any special reason to maintain such a special insertion point. Please look at https://github.com/llvm/llvm-project/blob/main/clang/lib/CodeGen/CodeGenFunction.h#L378. That said, I am not really sure, if I have completely misunderstood the comment above. If that is the case, then, I need better clarification here about what really is expected. |
clang/lib/CodeGen/CGExpr.cpp | ||
---|---|---|
103 | Well, inserting at the top of the entry block would reverse the order of the allocas. Currently they appear in source/IRGen order, which is nice. Maintaining the order requires appending, which requires a cursor of some kind. |
clang/lib/CodeGen/CGExpr.cpp | ||
---|---|---|
103 | Yes, correct. And I am waiting for further inputs from @yaxunl / @arsenm / @jdoerfert |
@yaxunl / @jdoerfert / @tra
Can I expect your further comment/decision on this patch?
While I understand people are eager to receive feedback on their patches, it is not helpful to ping/remind the reviewers constantly.
This does generate noise for them and can consequently also reduce their interest in a patch. The recommendation for time without
review before a "ping" is send is still one week.
clang/lib/CodeGen/CGExpr.cpp | ||
---|---|---|
151 | I'm not even sure this is necessarily correct. How do we know the new store is not inside a loop and might write the value more than once, potentially overwriting a value set later? Aside from that (important) question, you need to update the documentation of the function. It doesn't correspond to the new semantics anymore. | |
clang/test/CodeGenCUDA/builtins-amdgcn.cu | ||
2 | Please prepare a pre-commit that adds auto-generated check lines. Adding them as part of a commit makes it impossible to see the difference. |
Agreed.
clang/lib/CodeGen/CGExpr.cpp | ||
---|---|---|
151 | I am not convinced with above comments -Because, this code neither changes the address nor the initializer (https://github.com/llvm/llvm-project/blob/main/clang/lib/CodeGen/CodeGenFunction.h#L2548), but only a place where the initialization happens. Further, as the documentation says, the initializer must be a constant or function argument (and surprisingly, I do not see any assertion about it). Hence, even if the initialization happens within the loop, loop invariant code motion pass should detect it. That said, if we think, it is better to keep the initialization within entry block, we can do it at the end of the block. | |
clang/test/CodeGenCUDA/builtins-amdgcn.cu | ||
2 | Will do |
clang/lib/CodeGen/CGExpr.cpp | ||
---|---|---|
151 | What the initial value is is irrelevant. Arguing LICM should take care of it is also not helpful. Changing the location of the initialization is in itself a problem: Before: a = alloca a = 0; // init is in the entry block for (...) { use(a); a = a + 1; } After, potentially: a = alloca for (...) { a = 0; // init is now at the builder insertion point use(a); a = a + 1; } |
clang/lib/CodeGen/CGExpr.cpp | ||
---|---|---|
151 | Clarification - I am not trying to argue anything here with anybody, I am trying to defend myself, where I feel I am right as per my knowledge. If I realize that I am at false later, then I correct myself, but I am not arguing, and will never ever do. Ok, I will try to fix it from the practical point of view. But, I still think as follows - From theoretical/good programming/good front-end code generation perspective, initialization of something will never happen within construct like loop, otherwise initialization has no meaning at all. |
clang/lib/CodeGen/CGExpr.cpp | ||
---|---|---|
151 |
Why would we not create a new temporary allocas inside of a loop? There is nothing in any of the APIs or descriptions that would indicate you could not create a temporary alloca and initialize it while you are generating code in a loop. I cannot see why initialization would me meaningless given that it was inserted in the entry block prior to this change. |
This patch seems to be confused. You're making two changes. In one of them, you're trying to prevent addrspacecasts from being interleaved with the sequence of allocas in the function prologue. In the other, you're moving the store emitted by InitTempAlloca so that it becomes interleaved with the sequence of allocas, but only when there's an addrspacecast.
Now, I should say that InitTempAlloca is generally a problematic function. Its purpose is to put an initialization in the prologue of the function so that it always happens prior to some other code executing. This is rarely useful, though, because the memory is usually tied to some specific feature in the code, and as Johannes says, that place in the code may be reachable multiple times, and the memory typically needs to be freshly initialized each time. Using InitTempAlloca is therefore frequently wrong, and I'm honestly not sure there's any good reason to use it. Looking at the calls, some of them know that they're in the prologue, and so it should be fine to simply emit a store normally. Other calls seem quite suspect, like the one in CGObjCGNU.cpp. And even if it's semantically okay, it's potentially doing unnecessary work up-front when it only really needs to happen if that path is taken.
So I don't really care about aesthetic differences in the code created by InitTempAlloca, because we should just remove it completely.
If we really care about not interleaving things with the alloca sequence — and right now I am not convinced that we do, because contra the comments and description of this patch, this is not an LLVM requirement of any sort — I think we should lazily create a second InsertPt instruction after the AllocaInsertPt and insert all the secondary instruction prior to that so that they appear in source order.
Not really.
In the absence of this patch, addrspacecast are interleaved with the sequence of allocas, and AllocaInsertPt always point to the end of this *interleaved sequence*. Within InitTempAlloca(), any init to alloca (or to its addrspacecast in case of an addrspacecast) happens just after AllocaInsertPt which is fine.
Now, in the presence of this patch, AllocaInsertPt points to the end of contiguous allocas but *BEFORE* any addrspacecast. This forces the changes to InitTempAlloca(). Otherwise, init of addrspacecast happens before addrspacecast itself.
Now, I should say that InitTempAlloca is generally a problematic function. Its purpose is to put an initialization in the prologue of the function so that it always happens prior to some other code executing. This is rarely useful, though, because the memory is usually tied to some specific feature in the code, and as Johannes says, that place in the code may be reachable multiple times, and the memory typically needs to be freshly initialized each time. Using InitTempAlloca is therefore frequently wrong, and I'm honestly not sure there's any good reason to use it. Looking at the calls, some of them know that they're in the prologue, and so it should be fine to simply emit a store normally. Other calls seem quite suspect, like the one in CGObjCGNU.cpp. And even if it's semantically okay, it's potentially doing unnecessary work up-front when it only really needs to happen if that path is taken.
So I don't really care about aesthetic differences in the code created by InitTempAlloca, because we should just remove it completely.
If we really care about not interleaving things with the alloca sequence — and right now I am not convinced that we do, because contra the comments and description of this patch, this is not an LLVM requirement of any sort — I think we should lazily create a second InsertPt instruction after the AllocaInsertPt and insert all the secondary instruction prior to that so that they appear in source order.
Agree. I will give it a try to make changes as you suggest, though it may take some time since it requires a bit of cleanup and handling of the changes to (possibly many) lit tests as a side effect.
Introduce a new instruction pointer which aid all the addressspace casts of static allocas
to appear in the source order immediately after all static allocas.
clang/lib/CodeGen/CodeGenFunction.h | ||
---|---|---|
392 ↗ | (On Diff #378875) | Please call this something like PostAllocaInsertPt. Instead of eagerly creating it, please create it lazily: add an accessor like getPostAllocaInsertPoint() which creates (and saves here) an instruction that immediately follows AllocaInsertPt if this is currently null. I think you should make your own instruction so that we don't mess up any code that might rely on temporarily creating and then removing dead instructions. Please use an llvm::AssertingVH<llvm::Instruction>. I think that can handle holding a null value. The comment here should be more general, like "a place in the prologue where code can be inserted that will be dominated by all the static allocas." You don't need to talk about addrspacecasts specifically; that's just one possible use case for this. Also, it's either "addrspacecast" (using the IR name of the operation) or "address space cast" (writing out the operation in normal English words); don't run together addressspace as if it were a keyword when it isn't. |
clang/lib/CodeGen/CGExpr.cpp | ||
---|---|---|
114 | Shouldn't this use cast instead? |
@rjmccall I assume, I have fixed all your review comments. In case, if I have missed something OR if you think, few more changes are required for the patch, please do let me know so that I will proceed as per the comments/suggestions. I would like to bring this patch to the closure.
This patch is waiting for an action for a long time. I am expecting at-least anyone of the reviewers to either say "yes" or "no" or "further required improvements" on this patch so that I can take further action on this patch. If you say "no" to this patch with a strong valid reason, then I will abandon this patch and move-on instead of uncertainly waiting.
LGTM. It seems all concerns have been addressed. Shall we move ahead and land this patch? Thanks.
lgtm
I believe you have addressed John's comments, and I think the IR changes mainly affect AMDGPU users, so I don't think this will be too disruptive.
Sorry about the delay, there's a bit of a bystander effect at play here with multiple reviewers.
Please change the commit message to say why this change is necessary / an improvement on what we have now.
My recollection is that the amdgpu backend crashes on some IR and this decreases the probability of that IR pattern occuring, which still sounds like fixing the wrong place to me. Was this the one where hoisting static size alloca into the entry block near the backend would the problem?
I think this patch is missing a documentation update adding the new constraint that allocas must be contiguous in IR. That would help to answer questions about which alloca must be contiguous and which can occur separated by instructions, as currently none of them need to be adjacent. Also, is this only intended to constrain the entry basic block?
Please update the documentation with this new constraint. It would be helpful to know exactly when we now require alloca instructions to be adjacent to one another. If you wish to avoid other passes breaking the invariant in future, I think it needs to be added to the IR verifier, and should have been as part of this patch.
The current commit message reflect semantics of the patch - there is nothing required to change here. The goal here is to make sure that FE keeps all the static allocas as one cluster at the start of the entry block, which is a good canonical form from the perspective of better code transformation/optimization.
This is not something specific to AMDGPU backend, but AMDGPU backend at present requires this canonical form.
Undocumented and not checked by the IR verifier. Canonical form seems to be overstating it until at least one of those is addressed.
We already discussed that this canonical form is not something that IR verifier can verify, but it is good enough for better code transformation/optimization. Please refer llvm-dev email discussion w.r.t it.
If the new invariant is that all alloca must be adjacent to one another, that's a trivial thing for the verifier to check. So I guess it's something else? Please write down what this new invariant is intended to be, preferably in the documentation, perhaps of the alloca instruction.
What llvm-dev discussion do you refer to?
Please check with llvm-dev.
What llvm-dev discussion do you refer to?
I do not remember, please search for keywords, like static allocas, and figure out.
So you won't articulate or document the new invariant and you think there's a llvm-dev discussion that says we can't verify the invariant which you won't reference, but means you won't add this to the verifier.
Request changes doesn't really work after you've applied the patch.
@rnk do you object to me reverting this? I don't think we can add an invariant to IR which is undocumented and unverified/unverifiable and the patch author seems opposed to fixing either omission.
The verifier does not check for whether or things are canonical or not. We don't really have formal definitions for what's considered canonical, it's just what people think make later optimizations easier. This is not a change in the IR rules.
Request changes doesn't really work after you've applied the patch.
@rnk do you object to me reverting this? I don't think we can add an invariant to IR which is undocumented and unverified/unverifiable and the patch author seems opposed to fixing either omission.
I object to reverting this as this has nothing which the verifier should be checking
If the amdgpu backend doesn't require this then it doesn't matter much if other passes undo it. If it's not an invariant, we don't need the verifier to alert people to passes that break it.
Git blame on the new code in clang will lead people here where they may be able to work out from the comments why this change was made.
Is this patch actually causing issues in practice? I think the decision to revert should be based on that.
I don't think this patch creates a new invariant that other passes have to respect, if that's what you're worried about. The way I see it, this patch just makes AMDGPU IR output look "nicer". Middle-end passes are free to insert casts between static allocas if they want.
Why is there a special AllocaInsertPt iterator in the first place? Can you avoid any iteration logic by just always inserting at the block start?