Previously, the OpenMP to LLVM IR conversion was setting the alloca insertion
point to the same position as the main compuation when converting OpenMP
parallel operations. This is problematic if, for example, the parallel
operation is placed inside a loop and would keep allocating on stack on each
iteration leading to stack overflow.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
Thanks @ftynse for this patch.
mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp | ||
---|---|---|
130 | I was planning to a few questions in this area. Will this getParent work correctly with nested parallelism and the if is in the nested parallel operation? llvm.func fn() { omp.parallel { omp.parallel if() { } } } | |
mlir/test/Target/LLVMIR/openmp-llvm.mlir | ||
155 | Nit: funciton -> function | |
158–159 | Is this enough to check that it is in the entry? Or should there be a check for a subsequent block name or number? |
mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp | ||
---|---|---|
130 | And if the above does work correctly, consider the following. llvm.func fn() { omp.parallel { omp.master { //<- The region of this operation will not be outlined. omp.parallel if() { } } } } |
mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp | ||
---|---|---|
130 | Not familiar with this field, but I was thinking about a case similar to (or not) above: what would happen if getParent() doesn't give a function? | |
140 | It seems to me that if bodyGenStatus was failed earlier than restoreIP, the old logic will just abort but this new version will still do the restore. Would this cause any issue, or was the old logic actually problematic? |
I skimmed this and if I get it right this won't work reliably.
The OMPIRBuilder provides an alloca insertion point as part of the body code generation callback, that is the one to use.
The user needs to keep a stack for all alloca insertion points and push/pop them as the body callbacks are entered/left.
The same has to happen for other outlined operators, I'm imagining fir.concurrent.do or something like that.
mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp | ||
---|---|---|
92 | ^^ allocaIP is not used but should be. |
Let me clarify, the alloca insertion point given to the body generation callback should be used for further calls _inside_ the body, right? The outermost, e.g. createParallel, still needs an alloca insertion point, which should normally be the function entry block.
The user needs to keep a stack for all alloca insertion points and push/pop them as the body callbacks are entered/left.
In MLIR, this will be the actual call stack of nested calls to convertOmpParallel, but yes, it needs to carry the insertion point as extra argument.
mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp | ||
---|---|---|
130 | @kiranchandramohan have we added the support for nested omp.parallel yet? Where should we place allocations in the nested case, in the AllocaIP of the body of the surrounding parallel operation? I can just condition this new behavior to only work for the outermost omp.parallel because that's the case I am hitting right now...
Why? It's definitely not FunctionLike (it isn't a symbol, there are no arguments and results, etc.). AutomaticAllocationScope is useless at this point, we don't have any special handling for such operations. @shabalin was trying to add more generic allocation scoping, but was blocked on discussions.
@kumasento, it's LLVM IR, the parent of a block is always a function. | |
140 | It was wrong before. bodyGenStatus can only be modified inside createParallel that actually calls bodyGenCB, but the code was checking it _before_ and ignoring it after the modification. | |
mlir/test/Target/LLVMIR/openmp-llvm.mlir | ||
158–159 | We are checking that these are placed before the icmp which is the first instruction in the function. Should be enough, and there is no block number of the entry block and no block separation after this. |
mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp | ||
---|---|---|
130 | Yes, nested parallel operations work. We have tests also. But the allocaIP issue is not solved. https://github.com/llvm/llvm-project/blob/3d974ac9fc489ac3fec194f324be55e42d1ea4fa/mlir/test/Target/LLVMIR/openmp-llvm.mlir#L230
This is OK with me for now.
In the OpenMP dialect (and possibly others like FIR) we have to distinguish operations with regions which will be outlined (like parallel, task) and which will not (like master, single) be outlined. The operations which will be outlined have some properties similar to functions, like having an entry-block for allocas and finally existing as a function in LLVM IR. Ideally, I want a function (getAllocaBlock) which if I call will traverse up through the nested operations and get me the entry-block of the operation which will be outlined or if there is no operation like that then the function entry-block. |
If I remember correctly there are two allocaIPs, outerAllocaIP which is passed to the createParallel function of the OpenMPIRBuilder and the innerAllocaIP which is the allocaIP of the outlined region which the OpenMPIRBuilder creates. The BodyGenCallback function is called with the innerAllocaIP as argument.
You are correct that for the outermost OpenMP Operation it is the function entry block that should be the allocaIP. But for any nested Operations it should be the allocaIP provided by the BodyGenCallback that should be the AllocaIP.
We alos need some changes so that there is only a single allocaBlock. I believe we need to either,
- Set the block corresponding to the entry-block of the omp.parallel operation as the innerAllocaIP provided by the OpenMPIRBuilder in the bodyGenCallBack. https://github.com/llvm/llvm-project/blob/ab5823867c4aee7f3e02ddfaa217905c87471bf9/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp#L40
- Reserve the entry-block of the omp.parallel operation in MLIR for allocas and pass it to the OpenMPIRBuilder in the createParallel Function. And change the OpenMPIRBuilder to work with this as the inner-alloca entry block of the outlined region.
Right. I was expecting lowering should keep a stack, each new function pushes the entry block, each time a operator gives you a new alloca-IP you push/pop it appropriately.
The user needs to keep a stack for all alloca insertion points and push/pop them as the body callbacks are entered/left.
In MLIR, this will be the actual call stack of nested calls to convertOmpParallel, but yes, it needs to carry the insertion point as extra argument.
I assume some extra state is preferable as there might be various places, called in arbitrary nesting, that need to push/pop a alloca-IP. If you prefer an
argument passed through all of lowering instead of a "global stack", that should work too. However, alloca-IP probably will influence, and be influenced, by
non-OpenMP operators too.
I can just condition this new behavior to only work for the outermost omp.parallel because that's the case I am hitting right now...
This is OK with me for now.
Just re-iterating that special casing this only for the outermost omp.parallel is fine with me.
Yeah, the trick is to do this in a sufficiently modular way for MLIR. That is, the main translation engine must not care about OpenMP specifically. And the rest of the dialects are not concerned with regions (FWIW, I would have required regions to be flattened and the OpenMP dialect to be translated to the LLVM dialect hadn't there been the argument of code reused with OpenMPIRBuilder) or alloca insertion points, all that should happen before translation. I defined a globally visible stack, it is currently only used by the OpenMP dialect, but if others need it, we can factor out the OpenMP stack frame class to be a common alloca insertion point stack frame class.
mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp | ||
---|---|---|
130 |
It's LLVM IR, getParent on a Block is always a function. | |
130 |
Now it is.
This sounds like you want an interface for outlinable OpenMP dialect ops. That is perfectly doable, it just isn't FunctionLike but a different interface.
This can be done with the interface above. Just go up the region tree until you hit an op that is either OutlineableOpenMPOp or an LLVMFuncOp, and take the first block of the first region. For translation purposes, this isn't necessary though. We can keep the stack as @jdoerfert suggested. |
LGTM. This looks great. More general than what I thought it would be.
mlir/include/mlir/Target/LLVMIR/ModuleTranslation.h | ||
---|---|---|
194 ↗ | (On Diff #343663) | Nit: spelling operations, ModuleTranslation. |
^^ allocaIP is not used but should be.