Page MenuHomePhabricator

[MLIR] Improve calling convention for unranked memory descriptor results.
Needs ReviewPublic

Authored by frgossen on Sep 24 2021, 4:50 PM.

Details

Summary

[MLIR] Improve calling convention for unranked memory descriptor results.

For unranked memory descriptor results, their size is not statically known due
to the inner descriptor of dynamic rank. To return such descriptors from
functions generally requires dynamic memory allocation which involves calls to
malloc and which can be expensive. To circumvent this problem, we allocate
allocate buffers on the stack that are big enough to hold the inner descriptors
up to some supported rank (max-unranked-desc-buffer-rank). If the unranked
descriptor does not exceed this rank, we can always copy it to stack-allocated
memory and avoid heap allocation entirely. Otherwise, if the rank of the
returned buffer is too big for the pre-allocated buffer, we fall back to
dynamic memory allocation. This is an optimization similar to the implementation
of an llvm::SmallVector.

Diff Detail

Event Timeline

frgossen created this revision.Sep 24 2021, 4:50 PM
frgossen requested review of this revision.Sep 24 2021, 4:50 PM
bondhugula added inline comments.
mlir/include/mlir/Conversion/LLVMCommon/LoweringOptions.h
37

Doc comment here.

mlir/lib/Conversion/LLVMCommon/Pattern.cpp
250

unsigned

271–276

I'm trying to understand the impact here when such descriptors are being allocated on the stack inside of loops: wouldn't one typically run out of stack space? (Eg. call ops inside loops.)

That's a great optimization!
Something that wasn't clear to me from the description, was that this would not put a hard limit on the max rank size: it is just really the same behavior as SmallVector. Can you maybe mention this in the description?

Something I'm not sure about yet, is the impact on the calling-convention: what you're doing here seems to make this lowering option part of the calling convention, which I think is risky and fragile. This is a nice optimization for private functions maybe, but unless we lower all the call sites and the function together in the same unit we risk these getting out-of-sync.

What may be possible would be to pass the inline size as argument separately, that way the lowering of the called function becomes entirely independent from the lowering of the callers, and various settings for the max-unranked-desc-buffer-rank becomes possible.

Basically what I'm thinking here is changing the calling convention to be friendly/resilient to this small size optimization, instead of basically making the calling convention controlled by this compiler setting.

mlir/test/Conversion/StandardToLLVM/calling-convention.mlir
2

The previous test was testing two variants, with and without emit-c-wrappers, why did we lose the distinction?

Actually, could we leave this test pristine by adding max-unranked-desc-buffer-rank=-1 (and ensuring we disable the "small size optimization in this case)?
We could then add another file that just exercises the effect of the max-unranked-desc-buffer-rank on dedicated case.

In general I feel that that we should shard files like this one into smaller tests that exercises each particular feature of the calling convention. Reviewing a change like this diff is just not possible otherwise.
That said thanks for the nice documentation inline in the test! The way you made each sections explicit is really helpful.

mehdi_amini requested changes to this revision.Sep 24 2021, 6:12 PM

(Also we'll need to ensure up-to-date doc in https://mlir.llvm.org/docs/TargetLLVMIR )

This revision now requires changes to proceed.Sep 24 2021, 6:12 PM
frgossen updated this revision to Diff 376938.Oct 4 2021, 10:11 AM
frgossen marked 4 inline comments as done.

Address comments

Thanks for the useful comments :) I see how the description did not make this clear, which is updated now.

Passing the rank (or buffer size) dynamically is in principle possible but complicates the calling convention quite a bit.
Internally, the buffers are currently passed as preceding arguments where the rank/size could be added relatively easily. For the external C functions, the buffer is currently passed within the memory descriptor, which I think is easier to read in C, but would not compose well with passing the additional rank/size argument.
Also here, I would advocate to implement this if someone actually needs it. This optimization only requires to have lowerings in sync if the max-unranked-desc-buffer-rank is set to some non-default. I see what you mean with "risky and fragile" if people play with max-unranked-desc-buffer-rank in which case they should know what they're doing.

If you are happy with the above, I will update the documentation accordingly ofc. I wish I had seen https://mlir.llvm.org/docs/TargetLLVMIR/ before, which would have helped understand all this :D.

mlir/lib/Conversion/LLVMCommon/Pattern.cpp
271–276

Yes, good point. This is a known issue with unranked memory descriptors in general. The previous calling convention and any other local unranked memory descriptor suffer from this issue.

mlir/test/Conversion/StandardToLLVM/calling-convention.mlir
2

Running the tests with and w/o emit-c-wrappers=1 just switches between looking at the llvm.emit_c_interface attribute or assuming it everywhere. The new tests use the emit_c_interface attribute everywhere to generate C interfaces only where they are tested.

Adding support to disable the desc buffer passing (max-unranked-desc-buffer-rank=-1) to fall back to the old behaviour would complicate the calling convention quite a bit imo. In one case, buffers are passed, in the other they aren't, etc. Until someone needs that, I would rather avoid this complexity. How important do you think this is?

"shard files" - Done :)

I know this is a big CL to review but I don't see a way to break it down into multiple smaller ones. If you prefer, I could land the tests separately.

frgossen edited the summary of this revision. (Show Details)Oct 4 2021, 10:13 AM
herhut accepted this revision.Oct 5 2021, 1:33 AM

Regarding @mehdi_amini comment: I agree that this adds another dimension to the ABI with C but unranked results already have a fairly complex calling convention and I don't think this makes it worse. If there are strong concerns that this will get out of sync, we could hard-wire this to a fixed rank to start with, which would prevent that issue.

If we really want to go down the route of dynamic sized pre-allocated descriptors, I would suggest we redesign the unranked descriptor to also contain the size of the pointed to buffer. That is a refactoring in its own and I suggest we do it as a follow on.

mlir/lib/Conversion/LLVMCommon/TypeConverter.cpp
136

Nit: used to avoid

Regarding @mehdi_amini comment: I agree that this adds another dimension to the ABI with C but unranked results already have a fairly complex calling convention and I don't think this makes it worse. If there are strong concerns that this will get out of sync, we could hard-wire this to a fixed rank to start with, which would prevent that issue.

If we really want to go down the route of dynamic sized pre-allocated descriptors, I would suggest we redesign the unranked descriptor to also contain the size of the pointed to buffer. That is a refactoring in its own and I suggest we do it as a follow on.

+1 we discussed this a bunch in the past but we didn't get to it.

It should be reasonably mechanical (but lenght) to make an UnrankedMemRefDescriptor flat (i.e. just the content of MemRefDescriptor + and int64_t for size and modulo some changes at the place of allocation).
It is unclear to me whether these should also support type-erasure (i.e. do we also want to put an enum for the data type?)
Codegen wouldn't need it but libraries / python interop could make use of it.

frgossen updated this revision to Diff 377127.Oct 5 2021, 2:40 AM
frgossen marked an inline comment as done.

Address comments

mehdi_amini requested changes to this revision.Oct 5 2021, 10:58 AM

I don't think this makes it worse.

Well I gave a very objective criteria that shows how this makes the calling convention work. I'd still like to look more into making this more robust, I need to get back to think about what @frgossen wrote about it:

Passing the rank (or buffer size) dynamically is in principle possible but complicates the calling convention quite a bit.
Internally, the buffers are currently passed as preceding arguments where the rank/size could be added relatively easily. For the external C functions, the buffer is currently passed within the memory descriptor, which I think is easier to read in C, but would not compose well with passing the additional rank/size argument.

But I haven't had time to do so yet, because I likely need to write an example to wrap my head around it.

This revision now requires changes to proceed.Oct 5 2021, 10:58 AM
frgossen added inline comments.Oct 5 2021, 3:39 PM
mlir/test/Conversion/StandardToLLVM/calling-convention-external-c-function-caller.mlir
307

@mehdi_amini , this could be an example that you're looking for.

311–313

@mehdi_amini , this is how the buffer is currently passed through the C interface as part of the pre-allocated result. The alternative is to pass buffer and (size or rank) as separate arguments, which I'd find less intuitive.

frgossen updated this revision to Diff 377522.Oct 6 2021, 6:07 AM

Add documentation

Friendly ping

I had another look at this, and I am still convinced that we shouldn't make calling convention parametric this way: this just does not seem like a good thing to me from a system consistency point of view.

I dug into this, but I'm not to the bottom of it yet.
In particular, it seems that the API for the c interface isn't changed (on the surface), as in the C++ signature stays the same. But I just got lost right now in the contract with the C++ code about alloc/free for these descriptors (both for when we call C++ from MLIR and when we have MLIR generated code that will be invoked from C++).

As an example, when we generate llvm.func @_mlir_ciface_bar(%arg0: !llvm.ptr<struct<(struct<(i64, ptr<i8>)>, f32, struct<(i64, ptr<i8>)>)>>) from func @bar() -> (memref<*xf32>, f32, memref<*xf32>) attributes { llvm.emit_c_interface } ; we would always malloc a new descriptor and the C++ code has to free it. Also the C++ code never needs to pass a valid pointer in I think (that is, before this revision).

mlir/test/Conversion/StandardToLLVM/calling-convention-dbg.mlir
3

(drive by comment: please use the test dialect and remove the allow-unregistered-dialect option)

But actually I think you didn't even intend to have this file here?