This is an archive of the discontinued LLVM Phabricator instance.

[mlir] OpenMP-to-LLVM: properly set outer alloca insertion point
ClosedPublic

Authored by ftynse on Apr 26 2021, 10:00 AM.

Details

Summary

Previously, the OpenMP to LLVM IR conversion was setting the alloca insertion
point to the same position as the main compuation when converting OpenMP
parallel operations. This is problematic if, for example, the parallel
operation is placed inside a loop and would keep allocating on stack on each
iteration leading to stack overflow.

Diff Detail

Unit TestsFailed

Event Timeline

ftynse created this revision.Apr 26 2021, 10:00 AM
ftynse requested review of this revision.Apr 26 2021, 10:00 AM

Thanks @ftynse for this patch.

mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
171

I was planning to a few questions in this area. Will this getParent work correctly with nested parallelism and the if is in the nested parallel operation?
Should we be marking the omp.parallel operation with some of the function attributes like Function-Like, AutomaticAllocationScope?

llvm.func fn() {
  omp.parallel {
    omp.parallel if() {
    }
  }
}
mlir/test/Target/LLVMIR/openmp-llvm.mlir
155

Nit: funciton -> function

158–159

Is this enough to check that it is in the entry? Or should there be a check for a subsequent block name or number?

mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
171

And if the above does work correctly, consider the following.

llvm.func fn() {
  omp.parallel {
    omp.master { //<- The region of this operation will not be outlined.
      omp.parallel if() {
      }
    }
  }
}
kumasento added inline comments.Apr 27 2021, 2:45 PM
mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
171

Not familiar with this field, but I was thinking about a case similar to (or not) above: what would happen if getParent() doesn't give a function?

173

It seems to me that if bodyGenStatus was failed earlier than restoreIP, the old logic will just abort but this new version will still do the restore. Would this cause any issue, or was the old logic actually problematic?

I skimmed this and if I get it right this won't work reliably.
The OMPIRBuilder provides an alloca insertion point as part of the body code generation callback, that is the one to use.
The user needs to keep a stack for all alloca insertion points and push/pop them as the body callbacks are entered/left.
The same has to happen for other outlined operators, I'm imagining fir.concurrent.do or something like that.

mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
92

^^ allocaIP is not used but should be.

ftynse marked 2 inline comments as done.Apr 28 2021, 5:11 AM
ftynse added a subscriber: shabalin.

The OMPIRBuilder provides an alloca insertion point as part of the body code generation callback, that is the one to use.

Let me clarify, the alloca insertion point given to the body generation callback should be used for further calls _inside_ the body, right? The outermost, e.g. createParallel, still needs an alloca insertion point, which should normally be the function entry block.

The user needs to keep a stack for all alloca insertion points and push/pop them as the body callbacks are entered/left.

In MLIR, this will be the actual call stack of nested calls to convertOmpParallel, but yes, it needs to carry the insertion point as extra argument.

mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
171

@kiranchandramohan have we added the support for nested omp.parallel yet? Where should we place allocations in the nested case, in the AllocaIP of the body of the surrounding parallel operation? I can just condition this new behavior to only work for the outermost omp.parallel because that's the case I am hitting right now...

Should we be marking the omp.parallel operation with some of the function attributes like Function-Like, AutomaticAllocationScope?

Why? It's definitely not FunctionLike (it isn't a symbol, there are no arguments and results, etc.). AutomaticAllocationScope is useless at this point, we don't have any special handling for such operations. @shabalin was trying to add more generic allocation scoping, but was blocked on discussions.

Not familiar with this field, but I was thinking about a case similar to (or not) above: what would happen if getParent() doesn't give a function?

@kumasento, it's LLVM IR, the parent of a block is always a function.

173

It was wrong before. bodyGenStatus can only be modified inside createParallel that actually calls bodyGenCB, but the code was checking it _before_ and ignoring it after the modification.

mlir/test/Target/LLVMIR/openmp-llvm.mlir
158–159

We are checking that these are placed before the icmp which is the first instruction in the function. Should be enough, and there is no block number of the entry block and no block separation after this.

mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
171

Yes, nested parallel operations work. We have tests also. But the allocaIP issue is not solved. https://github.com/llvm/llvm-project/blob/3d974ac9fc489ac3fec194f324be55e42d1ea4fa/mlir/test/Target/LLVMIR/openmp-llvm.mlir#L230

I can just condition this new behavior to only work for the outermost omp.parallel because that's the case I am hitting right now...

This is OK with me for now.

Where should we place allocations in the nested case, in the AllocaIP of the body of the surrounding parallel operation?
Why? It's definitely not FunctionLike (it isn't a symbol, there are no arguments and results, etc.). AutomaticAllocationScope is useless at this point, we don't have any special handling for such operations. @shabalin was trying to add more generic allocation scoping, but was blocked on discussions.

In the OpenMP dialect (and possibly others like FIR) we have to distinguish operations with regions which will be outlined (like parallel, task) and which will not (like master, single) be outlined. The operations which will be outlined have some properties similar to functions, like having an entry-block for allocas and finally existing as a function in LLVM IR. Ideally, I want a function (getAllocaBlock) which if I call will traverse up through the nested operations and get me the entry-block of the operation which will be outlined or if there is no operation like that then the function entry-block.

The OMPIRBuilder provides an alloca insertion point as part of the body code generation callback, that is the one to use.

Let me clarify, the alloca insertion point given to the body generation callback should be used for further calls _inside_ the body, right? The outermost, e.g. createParallel, still needs an alloca insertion point, which should normally be the function entry block.

The user needs to keep a stack for all alloca insertion points and push/pop them as the body callbacks are entered/left.

In MLIR, this will be the actual call stack of nested calls to convertOmpParallel, but yes, it needs to carry the insertion point as extra argument.

If I remember correctly there are two allocaIPs, outerAllocaIP which is passed to the createParallel function of the OpenMPIRBuilder and the innerAllocaIP which is the allocaIP of the outlined region which the OpenMPIRBuilder creates. The BodyGenCallback function is called with the innerAllocaIP as argument.
You are correct that for the outermost OpenMP Operation it is the function entry block that should be the allocaIP. But for any nested Operations it should be the allocaIP provided by the BodyGenCallback that should be the AllocaIP.

We alos need some changes so that there is only a single allocaBlock. I believe we need to either,

  1. Set the block corresponding to the entry-block of the omp.parallel operation as the innerAllocaIP provided by the OpenMPIRBuilder in the bodyGenCallBack. https://github.com/llvm/llvm-project/blob/ab5823867c4aee7f3e02ddfaa217905c87471bf9/mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp#L40
  2. Reserve the entry-block of the omp.parallel operation in MLIR for allocas and pass it to the OpenMPIRBuilder in the createParallel Function. And change the OpenMPIRBuilder to work with this as the inner-alloca entry block of the outlined region.

The OMPIRBuilder provides an alloca insertion point as part of the body code generation callback, that is the one to use.

Let me clarify, the alloca insertion point given to the body generation callback should be used for further calls _inside_ the body, right? The outermost, e.g. createParallel, still needs an alloca insertion point, which should normally be the function entry block.

Right. I was expecting lowering should keep a stack, each new function pushes the entry block, each time a operator gives you a new alloca-IP you push/pop it appropriately.

The user needs to keep a stack for all alloca insertion points and push/pop them as the body callbacks are entered/left.

In MLIR, this will be the actual call stack of nested calls to convertOmpParallel, but yes, it needs to carry the insertion point as extra argument.

I assume some extra state is preferable as there might be various places, called in arbitrary nesting, that need to push/pop a alloca-IP. If you prefer an
argument passed through all of lowering instead of a "global stack", that should work too. However, alloca-IP probably will influence, and be influenced, by
non-OpenMP operators too.

kiranchandramohan requested changes to this revision.May 5 2021, 6:12 AM

I can just condition this new behavior to only work for the outermost omp.parallel because that's the case I am hitting right now...

This is OK with me for now.

Just re-iterating that special casing this only for the outermost omp.parallel is fine with me.

This revision now requires changes to proceed.May 5 2021, 6:12 AM
ftynse updated this revision to Diff 343662.May 7 2021, 6:02 AM
ftynse marked 2 inline comments as done.

Handle nested scopes.

ftynse marked 4 inline comments as done.May 7 2021, 6:12 AM

The OMPIRBuilder provides an alloca insertion point as part of the body code generation callback, that is the one to use.

Let me clarify, the alloca insertion point given to the body generation callback should be used for further calls _inside_ the body, right? The outermost, e.g. createParallel, still needs an alloca insertion point, which should normally be the function entry block.

Right. I was expecting lowering should keep a stack, each new function pushes the entry block, each time a operator gives you a new alloca-IP you push/pop it appropriately.

The user needs to keep a stack for all alloca insertion points and push/pop them as the body callbacks are entered/left.

In MLIR, this will be the actual call stack of nested calls to convertOmpParallel, but yes, it needs to carry the insertion point as extra argument.

I assume some extra state is preferable as there might be various places, called in arbitrary nesting, that need to push/pop a alloca-IP. If you prefer an
argument passed through all of lowering instead of a "global stack", that should work too. However, alloca-IP probably will influence, and be influenced, by
non-OpenMP operators too.

Yeah, the trick is to do this in a sufficiently modular way for MLIR. That is, the main translation engine must not care about OpenMP specifically. And the rest of the dialects are not concerned with regions (FWIW, I would have required regions to be flattened and the OpenMP dialect to be translated to the LLVM dialect hadn't there been the argument of code reused with OpenMPIRBuilder) or alloca insertion points, all that should happen before translation. I defined a globally visible stack, it is currently only used by the OpenMP dialect, but if others need it, we can factor out the OpenMP stack frame class to be a common alloca insertion point stack frame class.

mlir/lib/Target/LLVMIR/Dialect/OpenMP/OpenMPToLLVMIRTranslation.cpp
171

Not familiar with this field, but I was thinking about a case similar to (or not) above: what would happen if getParent() doesn't give a function?

It's LLVM IR, getParent on a Block is always a function.

171

Yes, nested parallel operations work. We have tests also. But the allocaIP issue is not solved.

Now it is.

In the OpenMP dialect (and possibly others like FIR) we have to distinguish operations with regions which will be outlined (like parallel, task) and which will not (like master, single) be outlined. The operations which will be outlined have some properties similar to functions, like having an entry-block for allocas and finally existing as a function in LLVM IR.

This sounds like you want an interface for outlinable OpenMP dialect ops. That is perfectly doable, it just isn't FunctionLike but a different interface.

Ideally, I want a function (getAllocaBlock) which if I call will traverse up through the nested operations and get me the entry-block of the operation which will be outlined or if there is no operation like that then the function entry-block.

This can be done with the interface above. Just go up the region tree until you hit an op that is either OutlineableOpenMPOp or an LLVMFuncOp, and take the first block of the first region. For translation purposes, this isn't necessary though. We can keep the stack as @jdoerfert suggested.

ftynse marked 2 inline comments as done.May 7 2021, 6:12 AM

I can just condition this new behavior to only work for the outermost omp.parallel because that's the case I am hitting right now...

This is OK with me for now.

Just re-iterating that special casing this only for the outermost omp.parallel is fine with me.

Sorry, I was busy with other things.

ftynse updated this revision to Diff 343663.May 7 2021, 6:18 AM
ftynse marked an inline comment as done.

Nits.

kiranchandramohan accepted this revision.May 9 2021, 2:22 PM

LGTM. This looks great. More general than what I thought it would be.

mlir/include/mlir/Target/LLVMIR/ModuleTranslation.h
194

Nit: spelling operations, ModuleTranslation.

This revision is now accepted and ready to land.May 9 2021, 2:22 PM
ftynse updated this revision to Diff 343990.May 10 2021, 1:04 AM

Address review.

This revision was landed with ongoing or failed builds.May 10 2021, 1:05 AM
This revision was automatically updated to reflect the committed changes.