This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Transforms/Coroutines/
-
Transforms/
-
Coroutines/
-
CoroFrame.cpp
-
test/Transforms/Coroutines/
-
Transforms/
-
Coroutines/
-
coro-alloca-byval-param.ll

Differential D101980

[RFC] [Coroutines] Put the contents of byval argument to the frame
AbandonedPublic

Authored by ChuanqiXu on May 6 2021, 1:08 AM.

Download Raw Diff

Details

Reviewers

lxfind
rjmccall
junparser
ychen

Summary

This patch solves bug 48857.

From my perspective, the root cause for problem bug 48857 is that the arguments marked with byval wouldn't be put to the frame actually. Consider the following example:

void foo() {
     A a;
     // Some works with a
     coro_func(a);
     // Other works with a
}

Every C++ programmer would believe there would be an copy for a in the frame of coro_func. From 9.5.4 of N4878, we can found:

When a coroutine is invoked, after initializing its parameters (7.6.1.3), a copy is created for each coroutine
parameter. For a parameter of type cv T, the copy is a variable of type cv T with automatic storage duration
that is direct-initialized from an xvalue of type T referring to the parameter.

However, there would be only a pointer to type A in the frame. The problems comes from that LLVM would copy the byval value in the caller site instead of callee site. In other words, the IR for the example above would be:

void foo() {
     %a = Alloca A
     // Some works with %a
     %a.byval-temp = Alloca A
     memcpy(%a.byval-temp, %a)
     coro_func(%a.byval-temp);
     // Other works with %a
}

The actual type for the argument of coro_func would be A* marked with byval.

This patch tries to solve this by copying the byval arguments in the callee site if the callee is a coroutine. Now the semantic looks more consistent with the standard of C++. And I believe it should be good for Swift and MLIR. However, the down side for this approach is the extra copy. It is normal that this would make us worry about the performance. It shows that we lack a benchmark for coroutine again. But now, I could only argue for that it wouldn't hurt the performance much since the extra copy happens after the previous one and it should hit the cache.

BTW, I tried to modify the front end to make the coroutine special again - if the callee is a coroutine, we should copy the byval argument in the callee side instead of caller side.
However, the frontend looks complex and we would make coroutine violates the principle of LLVM again.

Test-Plan: check-llvm, check-clang

Diff Detail

Event Timeline

ChuanqiXu created this revision.May 6 2021, 1:08 AM

Herald added subscribers: rriddle, hiraditya. · View Herald TranscriptMay 6 2021, 1:08 AM

ChuanqiXu requested review of this revision.May 6 2021, 1:08 AM

Herald added a project: Restricted Project. · View Herald TranscriptMay 6 2021, 1:08 AM

Herald added subscribers: llvm-commits, stephenneuendorffer. · View Herald Transcript

Harbormaster completed remote builds in B102933: Diff 343317.May 6 2021, 2:05 AM

How is the situation with byval different from the situation with indirect arguments? It sounds like we generally need to perform this extra copy into the coroutine frame once it's allocated, and yes, that's something we need to emit in the frontend because it might be non-trivial.

Or is the problem that LLVM might be helpfully eliminating the copy when the source is a byval parameter? But that's also something it could be doing with indirect arguments, so again, treating byval specially seems suspect.

In D101980#2743665, @rjmccall wrote:

How is the situation with byval different from the situation with indirect arguments? It sounds like we generally need to perform this extra copy into the coroutine frame once it's allocated, and yes, that's something we need to emit in the frontend because it might be non-trivial.

Or is the problem that LLVM might be helpfully eliminating the copy when the source is a byval parameter? But that's also something it could be doing with indirect arguments, so again, treating byval specially seems suspect.

Sorry. May I ask what does indirect argument mean? I tried to copy all byval parameters since it looks like if the frontend needs to pass parameters by value for structs, it would emit byval attributes.

Fix format problems.

Harbormaster completed remote builds in B103287: Diff 343798.May 7 2021, 8:01 PM

In D101980#2743665, @rjmccall wrote:

How is the situation with byval different from the situation with indirect arguments? It sounds like we generally need to perform this extra copy into the coroutine frame once it's allocated, and yes, that's something we need to emit in the frontend because it might be non-trivial.

Or is the problem that LLVM might be helpfully eliminating the copy when the source is a byval parameter? But that's also something it could be doing with indirect arguments, so again, treating byval specially seems suspect.

I suspect inalloca might have the similar problem. In both cases, the arguments have the hidden copy on the stack which might have a shorter lifetime than coroutines.

In D101980#2747209, @ychen wrote:

In D101980#2743665, @rjmccall wrote:

How is the situation with byval different from the situation with indirect arguments? It sounds like we generally need to perform this extra copy into the coroutine frame once it's allocated, and yes, that's something we need to emit in the frontend because it might be non-trivial.

Or is the problem that LLVM might be helpfully eliminating the copy when the source is a byval parameter? But that's also something it could be doing with indirect arguments, so again, treating byval specially seems suspect.

I suspect inalloca might have the similar problem. In both cases, the arguments have the hidden copy on the stack which might have a shorter lifetime than coroutines.

From the document, it looks you are right. And so the preallocated arguments. I would check if the current implementation hold for inalloca and preallocated arguments.

ChuanqiXu added a reviewer: ychen.May 10 2021, 2:21 AM

In D101980#2743798, @ChuanqiXu wrote:

In D101980#2743665, @rjmccall wrote:

How is the situation with byval different from the situation with indirect arguments? It sounds like we generally need to perform this extra copy into the coroutine frame once it's allocated, and yes, that's something we need to emit in the frontend because it might be non-trivial.

Or is the problem that LLVM might be helpfully eliminating the copy when the source is a byval parameter? But that's also something it could be doing with indirect arguments, so again, treating byval specially seems suspect.

Sorry. May I ask what does indirect argument mean? I tried to copy all byval parameters since it looks like if the frontend needs to pass parameters by value for structs, it would emit byval attributes.

An indirect argument is one that's implicitly passed by reference. For example, most x86-64 ABIs will pass a struct like struct { char x[1024]; }; as a pointer to a temporary instead of on the stack. Normally, functions that take such arguments just bind the parameter variable directly to the pointer that's passed in; if that's a problem for coroutines, you're going to need to force a copy there as well.

Indirect arguments are also used (on pretty much every target except Windows) for non-trivially-copyable class types. That gets tricky because unfortunately you can't just emit a correct move-construction during coroutine lowering; we'll have to do it in the frontend, if we aren't already.

That brings up a related problem: there are LLVM optimizations that assume things about parameters during a function. For example, the pointers passed as indirect arguments are generally assumed not to alias anything, which means that we can generally eliminate redundant things like, say, copies out of them that are perfectly happy to just use the original memory. That's not a problem for copies emitted during coroutine lowering, since we'll immediately split the function and make the coroutine non-redundant from LLVM's perspective, but I am worried that we might trigger these optimizations on copies we emit earlier, and that could mess things up.

For example, consider code like this:

struct A {
  char buffer[32];
  A(const A &);
};

A::A(const A &other) {
  memcpy(buffer, other.buffer, sizeof(buffer));
}

my_coroutine foo(A a) {
  ...
}

If we emit the copy of a in the frontend — which I think we have no choice but to do — it'll generate IR something like this:

define ... @foo(%struct.A* %0) {
  %a = alloca %struct.A
  call void @llvm.memcpy(%a, %0, ...)
  ...
}

If the LLVM optimizer sees this, it very well might just remove the memcpy and use %0 everywhere instead of %a, and then we'll miscompile the coroutine. I'm not sure of the best way to fix this; maybe we just have to disable those optimizations on unlowered coroutines.

I'm asking these questions, not to insist that we fix them all at once, but to make sure that we're setting ourselves up to fix them all and not just doing something short-term that ends up needing to be completely replaced with a more comprehensive fix. The central design questions to me are (1) do we do this during coro splitting or in the frontend and (2) to the extent that we don't do it solely during coro splitting, how do we protect against miscompilation?

In D101980#2749774, @rjmccall wrote:
In D101980#2743798, @ChuanqiXu wrote:

In D101980#2743665, @rjmccall wrote:

How is the situation with byval different from the situation with indirect arguments? It sounds like we generally need to perform this extra copy into the coroutine frame once it's allocated, and yes, that's something we need to emit in the frontend because it might be non-trivial.

Or is the problem that LLVM might be helpfully eliminating the copy when the source is a byval parameter? But that's also something it could be doing with indirect arguments, so again, treating byval specially seems suspect.

Sorry. May I ask what does indirect argument mean? I tried to copy all byval parameters since it looks like if the frontend needs to pass parameters by value for structs, it would emit byval attributes.

An indirect argument is one that's implicitly passed by reference. For example, most x86-64 ABIs will pass a struct like struct { char x[1024]; }; as a pointer to a temporary instead of on the stack. Normally, functions that take such arguments just bind the parameter variable directly to the pointer that's passed in; if that's a problem for coroutines, you're going to need to force a copy there as well.

Indirect arguments are also used (on pretty much every target except Windows) for non-trivially-copyable class types. That gets tricky because unfortunately you can't just emit a correct move-construction during coroutine lowering; we'll have to do it in the frontend, if we aren't already.

That brings up a related problem: there are LLVM optimizations that assume things about parameters during a function. For example, the pointers passed as indirect arguments are generally assumed not to alias anything, which means that we can generally eliminate redundant things like, say, copies out of them that are perfectly happy to just use the original memory. That's not a problem for copies emitted during coroutine lowering, since we'll immediately split the function and make the coroutine non-redundant from LLVM's perspective, but I am worried that we might trigger these optimizations on copies we emit earlier, and that could mess things up.

For example, consider code like this:
struct A {
  char buffer[32];
  A(const A &);
};

A::A(const A &other) {
  memcpy(buffer, other.buffer, sizeof(buffer));
}

my_coroutine foo(A a) {
  ...
}
If we emit the copy of a in the frontend — which I think we have no choice but to do — it'll generate IR something like this:
define ... @foo(%struct.A* %0) {
  %a = alloca %struct.A
  call void @llvm.memcpy(%a, %0, ...)
  ...
}
If the LLVM optimizer sees this, it very well might just remove the memcpy and use %0 everywhere instead of %a, and then we'll miscompile the coroutine. I'm not sure of the best way to fix this; maybe we just have to disable those optimizations on unlowered coroutines.

I'm asking these questions, not to insist that we fix them all at once, but to make sure that we're setting ourselves up to fix them all and not just doing something short-term that ends up needing to be completely replaced with a more comprehensive fix. The central design questions to me are (1) do we do this during coro splitting or in the frontend and (2) to the extent that we don't do it solely during coro splitting, how do we protect against miscompilation?

Thanks for the detailed explanation! I still had some questions.

An indirect argument is one that's implicitly passed by reference

Do you mean by value here? If an argument is passed by reference, it makes no sense the coroutine need to copy the argument by value. Since the reference is pointer really. And I think it is responsibility for the programmers to maintain the lifetime for the pointer. And the programmers should agree on there is only pointer in the coroutine frame.

For example, most x86-64 ABIs will pass a struct like struct { char x[1024]; }; as a pointer to a temporary instead of on the stack. Normally, functions that take such arguments just bind the parameter variable directly to the pointer that's passed in; if that's a problem for coroutines, you're going to need to force a copy there as well.

From the statement, it looks like it is talking about passing argument by value. So I think it means there are some arguments who are passed by value in the language side but didn't be marked with byval. Did I get this? If it is true, I agree with that I need to find another method to handle it.

Indirect arguments are also used (on pretty much every target except Windows) for non-trivially-copyable class types. That gets tricky because unfortunately you can't just emit a correct move-construction during coroutine lowering; we'll have to do it in the frontend, if we aren't already.

For the case of move-construction, I think the current patch could handle it if the argument is marked with byval. For example:

coro_type foo(A); // A is move-constructible.
void bar() {
    A  a;
   // ... some operation
   foo(std::move(A));
}

And the corresponding IR should look like:

void bar() {
    %a = alloca A
   ; ... some operation
   %tmp = alloca A
   ; move-construct %tmp
   foo(%tmp)
}
coro_type foo(A* [byval] %a) {
   ; if a is marked with byval
   %tmp.a.byval = alloca A
   memcpy(%tmp.a.byval, %a)
}

It looks OK to me except the extra copy. The move-construction is already handled by the frontend.

That brings up a related problem: there are LLVM optimizations that assume things about parameters during a function. For example, the pointers passed as indirect arguments are generally assumed not to alias anything, which means that we can generally eliminate redundant things like, say, copies out of them that are perfectly happy to just use the original memory. That's not a problem for copies emitted during coroutine lowering, since we'll immediately split the function and make the coroutine non-redundant from LLVM's perspective, but I am worried that we might trigger these optimizations on copies we emit earlier, and that could mess things up.

I am not sure if I get your point right here. From my understanding, this paragraph and the following example tells about the problem we may met if we try to emit the copy for the argument passed by value in the callee site if the callee is a coroutine. Do I understand right? If I got it, I think the problems told here is a little bit hard to handle. I can't find solutions right now.

I'm asking these questions, not to insist that we fix them all at once, but to make sure that we're setting ourselves up to fix them all and not just doing something short-term that ends up needing to be completely replaced with a more comprehensive fix. The central design questions to me are (1) do we do this during coro splitting or in the frontend and (2) to the extent that we don't do it solely during coro splitting, how do we protect against miscompilation?

Thanks for the good questions! It really helps. And I can't answer it directly right now. I'm just sharing my thoughts:
(1) If there are arguments who are passed by value without being marked by byval, inalloca or anything else, we should mark them in the frontend.
(2) After that, we could copy the arguments passed by value in coroutine. The downside is the extra copy.
(3) I think we could a peephole optimization to emit some of the extra copy in the caller side in the future.

And I think the Swift and MLIR may have the same problem. Since LLVM always emits the copy in the caller side.

And my thoughts about emiting the copy in the callee side in the front are:
(1) I can't find solutions for the questions you asked. And I don't think we can get a clear solution in the near future.
(2) It violates the principle of LLVM. Although I don't know the reason why LLVM choose to copy the argument in the caller side, I believe there must be some fundamental reasons to do so.
(3) It is much more complex to make it in the frontend at least in clang, although I know it isn't a strong argument that the solution is hard : ).
(4) If we decided to do it in the frontend, then we also need to implement it in the swift frontend and MLIR. I know it isn't my responsibility to implement the swift part and MLIR part. But it really looks redundant. From perspective of LLVM, I believe the reason why LLVM decide to implement the coroutine in the middle end is that LLVM want different languages to re-use the same code as much as possible.

In D101980#2750311, @ChuanqiXu wrote:

Thanks for the detailed explanation! I still had some questions.

An indirect argument is one that's implicitly passed by reference

Do you mean by value here? If an argument is passed by reference, it makes no sense the coroutine need to copy the argument by value. Since the reference is pointer really. And I think it is responsibility for the programmers to maintain the lifetime for the pointer. And the programmers should agree on there is only pointer in the coroutine frame.

Implicitly by reference, i.e. the argument type is not a reference type but the ABI just passes a pointer behind the scenes.

For example, most x86-64 ABIs will pass a struct like struct { char x[1024]; }; as a pointer to a temporary instead of on the stack. Normally, functions that take such arguments just bind the parameter variable directly to the pointer that's passed in; if that's a problem for coroutines, you're going to need to force a copy there as well.

From the statement, it looks like it is talking about passing argument by value. So I think it means there are some arguments who are passed by value in the language side but didn't be marked with byval. Did I get this? If it is true, I agree with that I need to find another method to handle it.

byval in LLVM IR has a very specific meaning: it means the argument is passed in the stack argument area. On the caller side there's an implicit copy from the argument pointer into the argument area, and on the callee side the parameter pointer value is directly bound to the position in the argument area. A lot of targets never use it for anything, because it's not considered beneficial for small arguments (like integers and small structs), and those targets pass large aggregates indirectly (as discussed above).

Indirect arguments are also used (on pretty much every target except Windows) for non-trivially-copyable class types. That gets tricky because unfortunately you can't just emit a correct move-construction during coroutine lowering; we'll have to do it in the frontend, if we aren't already.

For the case of move-construction, I think the current patch could handle it if the argument is marked with byval. For example:
coro_type foo(A); // A is move-constructible.
void bar() {
    A  a;
   // ... some operation
   foo(std::move(A));
}
And the corresponding IR should look like:
void bar() {
    %a = alloca A
   ; ... some operation
   %tmp = alloca A
   ; move-construct %tmp
   foo(%tmp)
}
coro_type foo(A* [byval] %a) {
   ; if a is marked with byval
   %tmp.a.byval = alloca A
   memcpy(%tmp.a.byval, %a)
}
It looks OK to me except the extra copy. The move-construction is already handled by the frontend.

You can't assume that move construction is equivalent to a memcpy. That's just not semantically correct. Only the frontend has this information.

That brings up a related problem: there are LLVM optimizations that assume things about parameters during a function. For example, the pointers passed as indirect arguments are generally assumed not to alias anything, which means that we can generally eliminate redundant things like, say, copies out of them that are perfectly happy to just use the original memory. That's not a problem for copies emitted during coroutine lowering, since we'll immediately split the function and make the coroutine non-redundant from LLVM's perspective, but I am worried that we might trigger these optimizations on copies we emit earlier, and that could mess things up.

I am not sure if I get your point right here. From my understanding, this paragraph and the following example tells about the problem we may met if we try to emit the copy for the argument passed by value in the callee site if the callee is a coroutine. Do I understand right? If I got it, I think the problems told here is a little bit hard to handle. I can't find solutions right now.

I am saying that there is a risk of miscompilation if LLVM applies assumptions that are valid in normal functions but aren't valid in unlowered coroutines. Probably the right solution is just to teach LLVM that those assumptions aren't valid in unlowered coroutines. You should probably bring it up on llvm-dev.

In D101980#2751187, @rjmccall wrote:

Implicitly by reference, i.e. the argument type is not a reference type but the ABI just passes a pointer behind the scenes.

Does it mean there are arguments who should be passed by value but are passed by reference actually? If it is true, we are in trouble.

byval in LLVM IR has a very specific meaning: it means the argument is passed in the stack argument area. On the caller side there's an implicit copy from the argument pointer into the argument area, and on the callee side the parameter pointer value is directly bound to the position in the argument area. A lot of targets never use it for anything, because it's not considered beneficial for small arguments (like integers and small structs), and those targets pass large aggregates indirectly (as discussed above).

I see, since the usage of byval is very limited. So the problems we can solve by using byval is very limited.

You can't assume that move construction is equivalent to a memcpy. That's just not semantically correct. Only the frontend has this information.

Yeah, I need to investigate more. The problem here I found is the destruction process. If there isn't destruction, I think it is fun to use memcpy here. But with destruction, we may be in trouble:

struct. A {
    void *data_ptr;
    A(A&&);
    ~A() {
        delete data_ptr;
    }
};

If we memcpy A into the frame, then it is possible that the data_ptr is already deleted when we need to use the data_ptr in the resumed coroutine. The model is kinda complex in my head. I need to check the c++ standard and the LLVM implementation more to make a further decision.

I am saying that there is a risk of miscompilation if LLVM applies assumptions that are valid in normal functions but aren't valid in unlowered coroutines. Probably the right solution is just to teach LLVM that those assumptions aren't valid in unlowered coroutines. You should probably bring it up on llvm-dev.

We may need to insert many guard codes in other passes if we want to teach LLVM the rules about coroutine. From my minds, the complexity of the codes becomes very high if we did so. I am not sure if it is the only right way to solve it. I would do more investigation. But I would be happy to bring it up to the llvm-dev first.

By the way, do you @rjmccall think this patch is acceptable when it doesn't fix all the problems? If yes, I would try to handle the case about inalloca and preallocated. If no, I would mark this with plan changes. From my point of view, I think we can work with it now at least it solves part of the problems.

In D101980#2752763, @ChuanqiXu wrote:

In D101980#2751187, @rjmccall wrote:

Implicitly by reference, i.e. the argument type is not a reference type but the ABI just passes a pointer behind the scenes.

Does it mean there are arguments who should be passed by value but are passed by reference actually? If it is true, we are in trouble.

That are semantically passed by value but "physically" passed as a pointer to a temporary? Yes, absolutely, depending on the target. Example: https://godbolt.org/z/oK1Yadjvq

byval in LLVM IR has a very specific meaning: it means the argument is passed in the stack argument area. On the caller side there's an implicit copy from the argument pointer into the argument area, and on the callee side the parameter pointer value is directly bound to the position in the argument area. A lot of targets never use it for anything, because it's not considered beneficial for small arguments (like integers and small structs), and those targets pass large aggregates indirectly (as discussed above).

I see, since the usage of byval is very limited. So the problems we can solve by using byval is very limited.

Right.

You can't assume that move construction is equivalent to a memcpy. That's just not semantically correct. Only the frontend has this information.

Yeah, I need to investigate more. The problem here I found is the destruction process. If there isn't destruction, I think it is fun to use memcpy here.

(I assume you meant "fine".) That's true, no. It is *uncommon* for a type to have a copy/move operation that isn't semantically equivalent to memcpy without also having a non-trivial destructor, but it's absolutely possible, both formally and in practice. Practical examples include relative pointers or structs containing a field signed with address-diversified pointer authentication.

I am saying that there is a risk of miscompilation if LLVM applies assumptions that are valid in normal functions but aren't valid in unlowered coroutines. Probably the right solution is just to teach LLVM that those assumptions aren't valid in unlowered coroutines. You should probably bring it up on llvm-dev.

We may need to insert many guard codes in other passes if we want to teach LLVM the rules about coroutine. From my minds, the complexity of the codes becomes very high if we did so. I am not sure if it is the only right way to solve it. I would do more investigation. But I would be happy to bring it up to the llvm-dev first.

By the way, do you @rjmccall think this patch is acceptable when it doesn't fix all the problems? If yes, I would try to handle the case about inalloca and preallocated. If no, I would mark this with plan changes. From my point of view, I think we can work with it now at least it solves part of the problems.

Well, inalloca is a problem, because we actually only ever use inalloca when there's an argument type with a non-trivial copy/move/destroy operation. If you'd like to land a short-term fix just for byref, I think that might be okay while we figure out on llvm-dev what to do in general.

In D101980#2752775, @rjmccall wrote:

That are semantically passed by value but "physically" passed as a pointer to a temporary? Yes, absolutely, depending on the target. Example: https://godbolt.org/z/oK1Yadjvq

(I assume you meant "fine".) That's true, no. It is *uncommon* for a type to have a copy/move operation that isn't semantically equivalent to memcpy without also having a non-trivial destructor, but it's absolutely possible, both formally and in practice. Practical examples include relative pointers or structs containing a field signed with address-diversified pointer authentication.

Many thanks to the sharing! Really appreciate it.

Well, inalloca is a problem, because we actually only ever use inalloca when there's an argument type with a non-trivial copy/move/destroy operation. If you'd like to land a short-term fix just for byref, I think that might be okay while we figure out on llvm-dev what to do in general.

Does it mean byval there? I would like to land a short-term fix for byval since the other approaches would take a long time.

In D101980#2752963, @ChuanqiXu wrote:

In D101980#2752775, @rjmccall wrote:

That are semantically passed by value but "physically" passed as a pointer to a temporary? Yes, absolutely, depending on the target. Example: https://godbolt.org/z/oK1Yadjvq

(I assume you meant "fine".) That's true, no. It is *uncommon* for a type to have a copy/move operation that isn't semantically equivalent to memcpy without also having a non-trivial destructor, but it's absolutely possible, both formally and in practice. Practical examples include relative pointers or structs containing a field signed with address-diversified pointer authentication.

(I meant to type "That's *not* true", but I think you got the point.)

Many thanks to the sharing! Really appreciate it.

No problem.

Well, inalloca is a problem, because we actually only ever use inalloca when there's an argument type with a non-trivial copy/move/destroy operation. If you'd like to land a short-term fix just for byref, I think that might be okay while we figure out on llvm-dev what to do in general.

Does it mean byval there? I would like to land a short-term fix for byval since the other approaches would take a long time.

Okay. I can review this patch in the meantime.

In D101980#2749774, @rjmccall wrote:
In D101980#2743798, @ChuanqiXu wrote:

In D101980#2743665, @rjmccall wrote:

How is the situation with byval different from the situation with indirect arguments? It sounds like we generally need to perform this extra copy into the coroutine frame once it's allocated, and yes, that's something we need to emit in the frontend because it might be non-trivial.

Or is the problem that LLVM might be helpfully eliminating the copy when the source is a byval parameter? But that's also something it could be doing with indirect arguments, so again, treating byval specially seems suspect.

Sorry. May I ask what does indirect argument mean? I tried to copy all byval parameters since it looks like if the frontend needs to pass parameters by value for structs, it would emit byval attributes.

An indirect argument is one that's implicitly passed by reference. For example, most x86-64 ABIs will pass a struct like struct { char x[1024]; }; as a pointer to a temporary instead of on the stack. Normally, functions that take such arguments just bind the parameter variable directly to the pointer that's passed in; if that's a problem for coroutines, you're going to need to force a copy there as well.

Indirect arguments are also used (on pretty much every target except Windows) for non-trivially-copyable class types. That gets tricky because unfortunately you can't just emit a correct move-construction during coroutine lowering; we'll have to do it in the frontend, if we aren't already.

That brings up a related problem: there are LLVM optimizations that assume things about parameters during a function. For example, the pointers passed as indirect arguments are generally assumed not to alias anything, which means that we can generally eliminate redundant things like, say, copies out of them that are perfectly happy to just use the original memory. That's not a problem for copies emitted during coroutine lowering, since we'll immediately split the function and make the coroutine non-redundant from LLVM's perspective, but I am worried that we might trigger these optimizations on copies we emit earlier, and that could mess things up.

For example, consider code like this:
struct A {
  char buffer[32];
  A(const A &);
};

A::A(const A &other) {
  memcpy(buffer, other.buffer, sizeof(buffer));
}

my_coroutine foo(A a) {
  ...
}
If we emit the copy of a in the frontend — which I think we have no choice but to do — it'll generate IR something like this:
define ... @foo(%struct.A* %0) {
  %a = alloca %struct.A
  call void @llvm.memcpy(%a, %0, ...)
  ...
}
If the LLVM optimizer sees this, it very well might just remove the memcpy and use %0 everywhere instead of %a, and then we'll miscompile the coroutine. I'm not sure of the best way to fix this; maybe we just have to disable those optimizations on unlowered coroutines.

I was thinking if pursuing D100415 could solve this problem by marking the init function optnone until the frame is constructed.

In D101980#2753246, @rjmccall wrote:

(I meant to type "That's *not* true", but I think you got the point.)

Yeah, I get your point is that there are cases memcpy can't work for movable classes even if they doesn't have non-trivial destructor. And you gives an example. I haven't looked into the example yet.

I had sent a mail to llvm-dev. Here I made a mistake that I send this to cfe-dev and llvm-dev which may disturb some people. Let's see if there is any opinions.

In D101980#2753331, @ychen wrote:

I was thinking if pursuing D100415 could solve this problem by marking the init function optnone until the frame is constructed.

I think we may still met the problem even if we split the coroutine. The key problem here is the coroutine didn't copy the argument right.

Sorry for being late in reviewing this patch. I have been out of work for a few days.

The description of the problem is not accurate.
Even for byval parameters, their content will still be copied into the local stack for coroutines correctly. This is handled by the front-end.
So parameter copying is not the problem here.
The problem is exactly what @rjmccall said about the assumptions optimizations typically make about parameters.
The bug linked in this patch, is an example of such issue: for a byval parameter A, a memcpy from A to B, followed by another memcpy from B to C, will be optimized into a memcpy from A to C. And if the memcpy of B to C happens after a coroutine suspension, this leads to memory corruption because A would have died by that point.
D101510 is one attempt I tried to fix this, but it's too hacky.
D100415 is more long-term approach, but it's much harder than I thought and I don't know how long it will take.
So the idea for this patch is to see if we can find a way to fix this problem within the coroutine passes only.

In D101980#2754257, @lxfind wrote:

Sorry for being late in reviewing this patch. I have been out of work for a few days.

The description of the problem is not accurate.
Even for byval parameters, their content will still be copied into the local stack for coroutines correctly. This is handled by the front-end.
So parameter copying is not the problem here.
The problem is exactly what @rjmccall said about the assumptions optimizations typically make about parameters.
The bug linked in this patch, is an example of such issue: for a byval parameter A, a memcpy from A to B, followed by another memcpy from B to C, will be optimized into a memcpy from A to C. And if the memcpy of B to C happens after a coroutine suspension, this leads to memory corruption because A would have died by that point.
D101510 is one attempt I tried to fix this, but it's too hacky.
D100415 is more long-term approach, but it's much harder than I thought and I don't know how long it will take.
So the idea for this patch is to see if we can find a way to fix this problem within the coroutine passes only.

Thanks for the clarifying! It is sorry to make a mistake here to waste reviewers's time. I would take a closer look. It looks like the patch now is just a bad fix.

Here is an idea:
we could introduce an intrinsics that doesn't do anything but just server the purpose of telling the compiler to not optimize out a pointer (i.e. indicating maywrite to the pointer).
And in front end, for every copy we create from the parameter, we mark the local copy with that intrinsics. This will make sure that memcpyopt will never optimize it out.

In D101980#2757573, @lxfind wrote:

Here is an idea:
we could introduce an intrinsics that doesn't do anything but just server the purpose of telling the compiler to not optimize out a pointer (i.e. indicating maywrite to the pointer).
And in front end, for every copy we create from the parameter, we mark the local copy with that intrinsics. This will make sure that memcpyopt will never optimize it out.

It sounds like we still need to touch other passes other coroutine.

I spent some time to think about the solution we could do in the coroutine passes only. And I only get something really like workaround:

If we found the uses of byval argument in the spills
     we could emit an extra alloca and a copy to replace the byval argument.

The insight is the byval argument shouldn't come up in the coroutine frame. There should be an alloca to replace all the uses of the byval argument in the coroutine. If there isn't one, it means that the alloca must be optimized by some passes, so we need to emit it again.
But this solution looks really like workaround which tries to solve bug 48857. And I found this solution can't solve the the problem in all environment. Like https://godbolt.org/z/K63KP8eaz. Here should be the implicit argument mentioned by @rjmccall. There are cases where arguments are passed by value semantically but passed by reference actually. Then if we decide to go to this direction, the only solution I found is to emit another attribute, although this attribute has the same semantic with byval.

It sounds like we still need to touch other passes other coroutine.

No we don't need to. Just need to modify CGCoroutine

Something like this: D102465. Let me know what you think.

In D101980#2758780, @lxfind wrote:

Something like this: D102465. Let me know what you think.

Oh it is tricky. Since we add a new intrinsic use for the alloca and the intrinsic can't be analyzed by other passes. So the alloca we want wouldn't be optimized. The overall ideas looks good. For the details, I would like to emit the intrinsic for the arguments who are passed by value only. I think we could dive into it further.

In D101980#2758843, @ChuanqiXu wrote:

In D101980#2758780, @lxfind wrote:

Something like this: D102465. Let me know what you think.

Oh it is tricky. Since we add a new intrinsic use for the alloca and the intrinsic can't be analyzed by other passes. So the alloca we want wouldn't be optimized. The overall ideas looks good. For the details, I would like to emit the intrinsic for the arguments who are passed by value only. I think we could dive into it further.

Yeah and that's much easier to figure out in the front-end than in the IR.

In D101980#2758843, @ChuanqiXu wrote:

In D101980#2758780, @lxfind wrote:

Something like this: D102465. Let me know what you think.

Oh it is tricky. Since we add a new intrinsic use for the alloca and the intrinsic can't be analyzed by other passes. So the alloca we want wouldn't be optimized. The overall ideas looks good. For the details, I would like to emit the intrinsic for the arguments who are passed by value only. I think we could dive into it further.

by the way, I could also proceed with D102465 first while you figure out how to optimize it, if it's going to take a while. The sooner we fix the bug the better I think

In D101980#2767416, @lxfind wrote:

In D101980#2758843, @ChuanqiXu wrote:

In D101980#2758780, @lxfind wrote:

Something like this: D102465. Let me know what you think.

Oh it is tricky. Since we add a new intrinsic use for the alloca and the intrinsic can't be analyzed by other passes. So the alloca we want wouldn't be optimized. The overall ideas looks good. For the details, I would like to emit the intrinsic for the arguments who are passed by value only. I think we could dive into it further.

by the way, I could also proceed with D102465 first while you figure out how to optimize it, if it's going to take a while. The sooner we fix the bug the better I think

I think D102465 may be better than this patch to fix bug 48857. We could move on that one. Do you mean that it would take some time not to emit the intrinsic for who are not passed by value? Do you suggest that we should fix the bug first by mark all the arguments then we try to remove the unneeded marks in successive patch? If yes, I think we could try to look at it for about one week to find solutions. If we can't find, I am OK to mark all the arguments as a temporary fix. After all, correctness is most important.

adrianimboden added a subscriber: adrianimboden.Jun 3 2021, 11:33 AM

lxfind mentioned this in D104184: [Coroutine] Properly deal with byval and noalias parameters.Jun 13 2021, 8:54 AM

lxfind mentioned this in rG3522167efd80: [Coroutine] Properly deal with byval and noalias parameters.Jun 17 2021, 7:06 PM

ChuanqiXu abandoned this revision.Jul 13 2021, 4:20 AM

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Coroutines/

CoroFrame.cpp

52 lines

test/

Transforms/

Coroutines/

coro-alloca-byval-param.ll

59 lines

Diff 343798

llvm/lib/Transforms/Coroutines/CoroFrame.cpp

Show First 20 Lines • Show All 1,535 Lines • ▼ Show 20 Lines	for (BasicBlock &BB : F)
if (auto *PN = dyn_cast<PHINode>(&BB.front()))		if (auto *PN = dyn_cast<PHINode>(&BB.front()))
if (PN->getNumIncomingValues() > 1)		if (PN->getNumIncomingValues() > 1)
WorkList.push_back(&BB);		WorkList.push_back(&BB);

for (BasicBlock *BB : WorkList)		for (BasicBlock *BB : WorkList)
rewritePHIs(*BB);		rewritePHIs(*BB);
}		}

		/// Generates allocas for byval arguments to make sure they would be put in the
		/// frame.
		///
		/// Consider the following C++ example:
		/// ```
		/// void foo() {
		/// A a;
		/// // Some works with a
		/// coro_func(a);
		/// // Other works with a
		/// }
		/// ```
		///
		/// From the perspective of a C++ programmer, there would be a storage for A in
		/// the frame of `coro_func` if necessary. However, the IR generated in foo
		/// would be:
		/// ```
		/// void foo() {
		/// %a = Alloca A
		/// // Some works with %a
		/// %a.byval-temp = Alloca A
		/// memcpy(%a.byval-temp, %a)
		/// coro_func(%a.byval-temp);
		/// // Other works with %a
		/// }
		/// ```
		/// The type of the argument in `coro_func` would be a pointer instead of `A`,
		/// which would be put in the frame. It may cause many problems.
		///
		/// This function tries to solve the above problem by copying the byval
		/// arguments.
		/// FIXME: It would cause an extra copy. Try to eliminate it.
		static void rewriteByValArguments(Function &F) {
		IRBuilder<> Builder(F.getEntryBlock().getFirstNonPHI());
		const DataLayout &DL = F.getParent()->getDataLayout();

		for (Argument &Arg : F.args()) {
		if (!Arg.hasByValAttr())
		continue;

		Type *ByValType = Arg.getParamByValType();
		auto *ByValAlloca =
		Builder.CreateAlloca(ByValType, 0, nullptr, Arg.getName() + ".byval");
		Arg.replaceAllUsesWith(ByValAlloca);
		Value *Size = Builder.getInt64(DL.getTypeStoreSize(ByValType));
		Builder.CreateMemCpy(ByValAlloca, Arg.getParamAlign(), &Arg,
		Arg.getParamAlign(), Size);
		}
		}

// Check for instructions that we can recreate on resume as opposed to spill		// Check for instructions that we can recreate on resume as opposed to spill
// the result into a coroutine frame.		// the result into a coroutine frame.
static bool materializable(Instruction &V) {		static bool materializable(Instruction &V) {
return isa<CastInst>(&V) \|\| isa<GetElementPtrInst>(&V) \|\|		return isa<CastInst>(&V) \|\| isa<GetElementPtrInst>(&V) \|\|
isa<BinaryOperator>(&V) \|\| isa<CmpInst>(&V) \|\| isa<SelectInst>(&V);		isa<BinaryOperator>(&V) \|\| isa<CmpInst>(&V) \|\| isa<SelectInst>(&V);
}		}

// Check for structural coroutine intrinsics that should not be spilled into		// Check for structural coroutine intrinsics that should not be spilled into
▲ Show 20 Lines • Show All 641 Lines • ▼ Show 20 Lines	if (auto *AsyncEnd = dyn_cast<CoroAsyncEndInst>(CE)) {
splitAround(Call, "MustTailCall.Before.CoroEnd");		splitAround(Call, "MustTailCall.Before.CoroEnd");
}		}
}		}

// Transforms multi-edge PHI Nodes, so that any value feeding into a PHI will		// Transforms multi-edge PHI Nodes, so that any value feeding into a PHI will
// never has its definition separated from the PHI by the suspend point.		// never has its definition separated from the PHI by the suspend point.
rewritePHIs(F);		rewritePHIs(F);

		rewriteByValArguments(F);

// Build suspend crossing info.		// Build suspend crossing info.
SuspendCrossingInfo Checker(F, Shape);		SuspendCrossingInfo Checker(F, Shape);

IRBuilder<> Builder(F.getContext());		IRBuilder<> Builder(F.getContext());
FrameDataInfo FrameData;		FrameDataInfo FrameData;
SmallVector<CoroAllocaAllocInst*, 4> LocalAllocas;		SmallVector<CoroAllocaAllocInst*, 4> LocalAllocas;
SmallVector<Instruction*, 4> DeadInstructions;		SmallVector<Instruction*, 4> DeadInstructions;

▲ Show 20 Lines • Show All 93 Lines • Show Last 20 Lines

llvm/test/Transforms/Coroutines/coro-alloca-byval-param.ll

This file was added.

				; Check that we can handle the case when both alloc function and
				; the user body consume the same argument.
				; RUN: opt < %s -coro-split -S \| FileCheck %s
				; RUN: opt < %s -passes=coro-split -S \| FileCheck %s

				; using this directly (as it would happen under -O2)
				define i8* @f_direct(i64* byval(i64) align 8 %a) "coroutine.presplit"="1" {
				entry:
				%id = call token @llvm.coro.id(i32 0, i8* null, i8* null, i8* null)
				%size = call i32 @llvm.coro.size.i32()
				%alloc = call i8* @myAlloc(i64* %a, i32 %size)
				%hdl = call i8* @llvm.coro.begin(token %id, i8* %alloc)
				%0 = call i8 @llvm.coro.suspend(token none, i1 false)
				switch i8 %0, label %suspend [i8 0, label %resume
				i8 1, label %cleanup]
				resume:
				call void @print2(i64* %a)
				br label %cleanup

				cleanup:
				%mem = call i8* @llvm.coro.free(token %id, i8* %hdl)
				call void @free(i8* %mem)
				br label %suspend
				suspend:
				call i1 @llvm.coro.end(i8* %hdl, i1 0)
				ret i8* %hdl
				}

				; See if the content of %a would be put to the frame
				; CHECK: %f_direct.Frame = type { void (%f_direct.Frame), void (%f_direct.Frame), i64, i1 }

				; See that %a.byval is generated and initialized
				; CHECK-LABEL: define i8* @f_direct(i64* byval(i64) align 8 %a)
				; CHECK: %a.byval = alloca i64, align 8
				; CHECK: %0 = bitcast i64* %a.byval to i8*
				; CHECK: %1 = bitcast i64* %a to i8*
				; CHECK: call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 8 %0, i8* align 8 %1, i64 8, i1 false)

				; See that %a.byval was loaded from the frame
				; CHECK-LABEL: @f_direct.resume(
				; CHECK: %a.byval.reload.addr = getelementptr inbounds %f_direct.Frame, %f_direct.Frame* %FramePtr, i32 0, i32 2
				; CHECK: call void @print2(i64* %a.byval.reload.addr)
				; CHECK: ret void

				declare i8* @llvm.coro.free(token, i8*)
				declare i32 @llvm.coro.size.i32()
				declare i8 @llvm.coro.suspend(token, i1)
				declare void @llvm.coro.resume(i8*)
				declare void @llvm.coro.destroy(i8*)

				declare token @llvm.coro.id(i32, i8, i8, i8*)
				declare i1 @llvm.coro.alloc(token)
				declare i8* @llvm.coro.begin(token, i8*)
				declare i1 @llvm.coro.end(i8*, i1)

				declare noalias i8* @myAlloc(i64*, i32)
				declare double @print(double)
				declare void @print2(i64*)
				declare void @free(i8*)