This is an archive of the discontinued LLVM Phabricator instance.

[MLIR][OpenMP][SCF] Mark parallel regions as allocation scopes
ClosedPublic

Authored by wsmoses on Feb 14 2022, 10:54 AM.

Details

Summary

MLIR has the notion of allocation scopes which specify that stack allocations (e.g. memref.alloca, llvm.alloca) should be freed or equivalently aren't available at the end of the corresponding region.
Currently neither OpenMP parallel nor SCF parallel regions have the notion of such a scope.

This clearly makes sense for an OpenMP parallel as this is implemented in with a new function which outlines the region, and clearly any allocations in that newly outlined function have a lifetime that ends at the return of the function, by definition.

While SCF.parallel doesn't have a guaranteed runtime which it is implemented with, this similarly makes sense for SCF.parallel since otherwise an allocation within an SCF.parallel will needlessly continue to allocate stack memory that isn't cleaned up until the function (or other allocation scope op) which contains the SCF.parallel returns. This means that it is impossible to represent thread or iteration-local memory without causing a stack blow-up. In the case that this stack-blow-up behavior is intended, this can be equivalently represented with an allocation outside of the SCF.parallel with a size equal to the number of iterations.

Diff Detail

Event Timeline

wsmoses created this revision.Feb 14 2022, 10:54 AM
wsmoses requested review of this revision.Feb 14 2022, 10:54 AM

This seems like something to handle uniformly across the SCF constructs I think.
The question in general is: should a stack allocation be able to escape this construct? I would think that none of the SCF construct would want this.

I definitely think there's justification for the other SCF ops to have this property, but I think this can be separate from the parallel ops question, where without this it is impossible to represent a truly thread-local allocation.

wsmoses updated this revision to Diff 408596.Feb 14 2022, 1:26 PM

Add Affine parallel and GPU launch parallel regions.

I don't quite get why it is different for scf.parallel than it is for scf.for? The argument you had was " SCF.parallel will needlessly continue to allocate stack memory that isn't cleaned up until the function (or other allocation scope op) which contains the SCF.parallel returns", but is it the same for all iterations of scf.for (or scf.while) isn't it?

Why is there any specificity to the threading aspect of it?

Oh the logic as to why this is desirable as the semantic default still applies to an SCF.for, and so on.

The only use cases of behavior other than this is where the stack allocation escapes the broader op, which there could be a user of for generic SCF ops.

For most parallel operations, these semantics are not even possible (e.g. stack allocations in an omp.parallel cannot persist until the containing mlir function because the omp parallel is implemented with an outlined function, similar with a gpu launch).

Since there's the possibility of some use in other areas, my thought was to begin with a change to places which make little sense, though I'm completely happy adding this elsewhere.

My concern with only doing the scf.parallel operation is that we do lower it to scf.for loops in some cases and we would have to model this behavior there somehow. How would we do this? Do we have a suitable high-level abstraction that allows to insert an allocation scope? Alternatively, we could mark the scf.for loop the same. That seems a reasonable choice but scf.for supports returning values, so we might actually have escaping stack allocations currently. Is this a model we want to support?

So, to move this forward, we either have to extend scf.for in the same way or come up with a lowering strategy from scf.parallel to scf.for that captures this behavior.

There is a scoping op https://mlir.llvm.org/docs/Dialects/MemRef/#memrefalloca_scope-mlirmemrefallocascopeop if we need to use it in lowerings.

This being said, I think we can also construct an example that escapes an allocaed value from scf.parallel (either store it in memory, or have a sort of "assignment reduction").

This being said, I think we can also construct an example that escapes an allocaed value from scf.parallel (either store it in memory, or have a sort of "assignment reduction").

The example of this would be

%escaping = alloca memref<N x memref<i64>>
scf.parallel %i = 0 .. N {
   %internal = alloca memref<i64>
   store %escaping[%i] = %internal
}
use(%escaping)

This, however, cannot be implemented with an omp lowering and I would argue is already semantically invalid.

Again the reason being that the OpenMP lowering (below) cannot preserve a stack allocation outside of an omp.parallel:

%escaping = alloca memref<N x memref<i64>>
omp.parallel {
  omp.for %i = 0 .. N {
     %internal = alloca memref<i64>
     store %escaping[%i] = %internal
   }
}
use(%escaping)

becomes

__kmpc_fork(outlined, escaping)
use(escaping);

void outlined(i64** escaping_cap) {
   kmpc_for(&start, &end);
   for(int i=start; i<end; i++) {
       // This will have its lifetime end at the end of the outlined call (e.g. omp.parallel). 
       %internal = alloca
       escaping_cap[i] = %internal;
   }
}

However @herhut as you say it could be implemented by an scf.for lowering. I concur with Alex that we could use the memref alloca scope when lowering to an scf.for. [1]

[1] As a minor aside, we could (but shouldn't) still use the existing scf.parallel to scf.for lowering and overallocate (e.g. continuing to extend the lifetime of the allocations), even if we could have better optimized the lowering by restricting the lifetime. That is to say, extending the lifetime of a variable should always be legal (although bad for stack size).

Thanks for your reference to the scoping operation @ftynse. I had the feeling we had something like it but could not find it.

Regarding extension of lifetimes. That is a good question. For example, would it be a valid canonicalization pattern to remove an alloca scope operation that has no alloca operations inside? Or would we keep it around in case something gets lowered to an alloca later? Is an allocation scope a property in itself that we should keep up?

We could either lower scf.parallel to an scf.for with allocation scope if the loop contains an alloca. Or we make the lowering illegal if the scf.parallel has an alloca operation. The latter at least preserves the status quo and avoids surprises.

And, @wsmoses, I agree that extending would be correct but it changes the stack requirement quite dramatically, especially in the context of loops. So if we widen the scope during lowering from one iteration to all iterations, that would be an issue (as you also mention). In the land of large tensors, any extension of lifetime can become problematic, even without loops.

This, however, cannot be implemented with an omp lowering and I would argue is already semantically invalid.
Again the reason being that the OpenMP lowering (below) cannot preserve a stack allocation outside of an omp.parallel:

I'm not comfortable using the lowering to OpenMP to justify the semantics of a higher level construct though.
In particular that wouldn't be a good reason to differentiate scf.for and scf.parallel here, I still see these as highly similar and I believe the argument for having them being "auto allocation scope" applies well to both as far as I can tell.

I mean I'm fine having it apply to scf.for as well. Presumably we would also want this to apply to affine.for, etc. Anything else?

wsmoses updated this revision to Diff 409061.Feb 15 2022, 2:31 PM

Add to Affine.for and scf.for

I'm fine with this as-is, but @bondhugula should probably confirm that AffineFor can also get this right now.

ftynse accepted this revision.Feb 18 2022, 12:42 AM

I think we always intended affine loops to be allocation scopes.

This revision is now accepted and ready to land.Feb 18 2022, 12:42 AM