This is an archive of the discontinued LLVM Phabricator instance.

[Coroutines] Handle overaligned frame allocation (2)
AbandonedPublic

Authored by ychen on Apr 18 2021, 10:39 PM.

Details

Summary

This is an alternative to D97915 which missed proper deallocation of the
over-allocated frame. This patch handles both allocations and deallocations.

Both user-defined promise/local variables and compiler synthesized local variables could cause the coroutine frame to be overaligned. There are some related descriptions in http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p2014r0.pdf (issue #3).

Contrary to D97915, this patch implements the over-allocation in the backend instead of the frontend since

  • The alloca of the raw frame pointer (suppose we insert it in the frontend) would be included in the non-overaligned frame if we don't teach CoroFrame how to elide it.
  • Only insert extra code when it is known that the frame is overaligned.
  • Simpler implementation.
  • Clients could turn it on/off by using llvm.coro.size.aligned instead of llvm.coro.size. It indicates to LLVM that it should handle overaligned frames.

Overall approach:

  • Overalignment handling is only performed when llvm.coro.size.aligned is used.
  • llvm.coro.size.aligned returns (sizeof(coroutine frame) + max(0, alignof(coroutine frame) - STDCPP_DEFAULT_NEW_ALIGNMENT), which, when coroutine is not overaligned, equal to llvm.coro.size.
  • In CoroFrame, immediately after the alignment of frame is known to be overaligned, extra code is emitted to 1) create a new alloca to store the malloc returned memory address. 2) emit dynamic alignment adjust code to make sure frame ptr address aligns correctly. 3) remember the frame index for the newly created alloca, use it later for deallocation
  • In coro::replaceCoroFree, when it is decided that heap allocation could not be elided, instead of free(<frame ptr>) do Value *v = gep <frame ptr>, 0, frame_ptr_addr_index; free(load(v))
  • Let Clang switch to use llvm.coro.size.aligned. When in the future, Clang gains support to handle this by itself, switch it back to use llvm.coro.size.

https://reviews.llvm.org/P8260 is a IR diff (llvm.coro.size vs llvm.coro.size.aligned) for the test case in llvm/test/Transforms/Coroutines/coro-padding.ll.

Diff Detail

Event Timeline

ychen created this revision.Apr 18 2021, 10:39 PM
ychen requested review of this revision.Apr 18 2021, 10:39 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptApr 18 2021, 10:39 PM
ychen updated this revision to Diff 338430.Apr 18 2021, 10:42 PM

fix comment

ychen updated this revision to Diff 338431.Apr 18 2021, 10:46 PM

Fix comment. Ready for review.

I hadn't looked into the details. I would try to make it.
But from my understanding to this problem, the correct solution should remain the previous behavior if the end user doesn't specify alignas for promise_type. I mean, it shouldn't make so many test cases fail.

The alloca of the raw frame pointer (suppose we insert it in the frontend) would be included in the non-overaligned frame if we don't teach CoroFrame how to elide it.

It is confusing to me, maybe I need to look into the codes. First, the raw frame pointer isn't inserted in the frontend. Do you mean coro.begin? Then, what's the relationship between eliding and over-aligned? It also makes me confused.

ychen added a comment.Apr 19 2021, 9:21 PM

I hadn't looked into the details. I would try to make it.
But from my understanding to this problem, the correct solution should remain the previous behavior if the end user doesn't specify alignas for promise_type. I mean, it shouldn't make so many test cases fail.

The failures mostly are due to the llvm.coro.size() -> llvm.coro.size(i1 alloc) intrinsic signature change. The rest are due to the align arg of llvm.coro.id is illegal 0 in most of the tests. This patch would read that value causing asssertions.

The alloca of the raw frame pointer (suppose we insert it in the frontend) would be included in the non-overaligned frame if we don't teach CoroFrame how to elide it.

It is confusing to me, maybe I need to look into the codes. First, the raw frame pointer isn't inserted in the frontend. Do you mean coro.begin? Then, what's the relationship between eliding and over-aligned? It also makes me confused.

Since we don't know if the frame is overaligned or not until after the frame type is decided in CoroFrame, *suppose* we do this (adding raw frame ptr alloca in frontend regardless of the presence of overalignment or not) in the frontend, the code generated from CGCoroutine would be like this:

%1 = alloc i8*
%2 = malloc(x)
store %2, %1
..
..
%11 = load %1
free(%11)

The alloca %1 must be there even if the frame is not aligned. Then in the CoroFrame, we have to check if the frame is really overaligned and if not, reverse the above patterns. Otherwise, the alloca %1 would be in the frame needlessly and the following optimizations could not reliably remove it.

This patch defers adding the alloca until CoroFrame where we know for sure that the frame is aligned.

ychen edited the summary of this revision. (Show Details)Apr 19 2021, 9:24 PM

I hadn't looked into the details. I would try to make it.
But from my understanding to this problem, the correct solution should remain the previous behavior if the end user doesn't specify alignas for promise_type. I mean, it shouldn't make so many test cases fail.

The failures mostly are due to the llvm.coro.size() -> llvm.coro.size(i1 alloc) intrinsic signature change. The rest are due to the align arg of llvm.coro.id is illegal 0 in most of the tests. This patch would read that value causing asssertions.

The alloca of the raw frame pointer (suppose we insert it in the frontend) would be included in the non-overaligned frame if we don't teach CoroFrame how to elide it.

It is confusing to me, maybe I need to look into the codes. First, the raw frame pointer isn't inserted in the frontend. Do you mean coro.begin? Then, what's the relationship between eliding and over-aligned? It also makes me confused.

Since we don't know if the frame is overaligned or not until after the frame type is decided in CoroFrame, *suppose* we do this (adding raw frame ptr alloca in frontend regardless of the presence of overalignment or not) in the frontend, the code generated from CGCoroutine would be like this:

%1 = alloc i8*
%2 = malloc(x)
store %2, %1
..
..
%11 = load %1
free(%11)

The alloca %1 must be there even if the frame is not aligned. Then in the CoroFrame, we have to check if the frame is really overaligned and if not, reverse the above patterns. Otherwise, the alloca %1 would be in the frame needlessly and the following optimizations could not reliably remove it.

This patch defers adding the alloca until CoroFrame where we know for sure that the frame is aligned.

The semantics of alloca and free in the example look similar with coro.begin and coro.end.

lxfind added inline comments.Apr 20 2021, 5:36 PM
clang/docs/LanguageExtensions.rst
2689 ↗(On Diff #338431)

Do we need to change __builtin_coro_size? The argument will always be 1, right?
It only starts to change in LLVM intrinsics, if I read the impl correctly.

ychen added inline comments.Apr 20 2021, 5:52 PM
clang/docs/LanguageExtensions.rst
2689 ↗(On Diff #338431)

Yeah, It is always 1 for Clang until the spec is fixed (then we could revert it back to 0). Other clients using __builtin_coro_size may use 0 if the client doesn't care about overaligned frame or it could handle overaligned frame by itself.

There is something I am still confused about these two patch. Maybe I don't get the problem right.
The example problem shows if user uses alignas to specify the alignment for promise_type, the actual alignment for the promise isn't correctly due to compiler didn't call new method with alignment which isn't specified in the spec.
Then these two patches are trying to add paddings to the frame type to make the alignment of promise_type right. Here is a gap, in fact the problem comes from the alignment promise. But we want to solve it by over align the frame. It is odd for me.
I wonder if it is possible to simply adjust the alignment for the alloca of promise, like:

%promise = alloca %promise_type, align 64
clang/docs/LanguageExtensions.rst
2689 ↗(On Diff #338431)

BTW, is it OK to edit the builtins directly? Since builtin is different with intrinsic which is only visible in the internal of compiler, builtin could be used by any end users. Although I know there should be little users who would use __builtin_coro APIs, I worry if there is any guide principle for editing the builtins.

llvm/docs/Coroutines.rst
958

Maybe I was missing something. I think these two intrinsic should take one i8* argument to specify the coroutine handle. Otherwise, it may be confusing if there are some coroutines get inlined into other coroutine.

ychen added inline comments.Apr 20 2021, 8:37 PM
clang/docs/LanguageExtensions.rst
2689 ↗(On Diff #338431)

BTW, is it OK to edit the builtins directly? Since builtin is different with intrinsic which is only visible in the internal of compiler, builtin could be used by any end users. Although I know there should be little users who would use __builtin_coro APIs, I worry if there is any guide principle for editing the builtins.

I think it is ok to change these if it is justified like anything else.

builtins/intrinsics are interfaces on different levels. I'm trying to make __builtin_coro_size consistent with llvm.coro.size because I don't have a good reason for not doing that. (assume that we keep this opt-in overaligned frame handling in LLVM even after the spec is fixed since it helps solve a practical problem and the maintenance cost is low)

ChuanqiXu added inline comments.Apr 20 2021, 8:51 PM
clang/docs/LanguageExtensions.rst
2689 ↗(On Diff #338431)

It doesn't make sense to me that we need to change the signature for __builtin_coro_size in this patch. In other words, why do we need to change __builtin_coro_size ? What are problems that can't be solved if we don't change __builtin_coro_size? At least, if it is necessary to change __builtin_coro_size, we could make it in successive patches.

lxfind added inline comments.Apr 21 2021, 2:38 PM
clang/docs/LanguageExtensions.rst
2689 ↗(On Diff #338431)

Yeah I agree with ChuanqiXu, there is no need to make the builtin to be exactly the same as the llvm intrinsics just because they have the same name. Many of them are different even though they have the same name.

Thanks for working on this.
I am still having a bit hard time understanding the solution.
A few questions:

  1. I assume this patch is to solve the problem where the promise object is not aligned according to its alignof annotation, right? The title/wording is a bit misleading. Usually "handling XXX" means XXX is a situation/problem that wasn't handle properly before, and it's being handled here. I don't really understand what "handle overaligned frame allocation" means. Isn't frame allocation under-aligned being the problem?
  2. What is the purpose of coro.align intrinsic?
  3. Could you provide some examples of what the IR might look like after this patch? Either that or a more detailed explanation of how this works in the summary.
  4. Do you think it might be cleaner to introduce a new variant of coro.size instead of adding arguments to it? For example, coro.size.aligned(). This way, you can avoid changing any test file for non-switch-lowering test files, but focus on all switch-lowering tests.
  5. Typically, coro.free is used by a comparison with nullptr. This is to enable CoroElide. See: https://llvm.org/docs/Coroutines.html#llvm-coro-free-intrinsic. So I don't think you can load from it directly.
ychen edited the summary of this revision. (Show Details)Apr 22 2021, 11:11 PM
ychen edited the summary of this revision. (Show Details)
ychen edited the summary of this revision. (Show Details)Apr 22 2021, 11:48 PM

What is the purpose of the builtin? Where is it being used? Typically you *can't* change the signature of a builtin because the builtin is itself a language feature that's documented to have a particular signature. If you've made a builtin purely for use in generated AST, that's pretty unfortunate, and you should consider whether you actually have to do that instead of e.g. synthesizing a call to an allocation function the same way that we do in new expressions.

ychen updated this revision to Diff 339900.Apr 22 2021, 11:49 PM
ychen edited the summary of this revision. (Show Details)
  • Address feebacks.

Thanks for working on this.
I am still having a bit hard time understanding the solution.
A few questions:

  1. I assume this patch is to solve the problem where the promise object is not aligned according to its alignof annotation, right? The title/wording is a bit misleading. Usually "handling XXX" means XXX is a situation/problem that wasn't handle properly before, and it's being handled here. I don't really understand what "handle overaligned frame allocation" means. Isn't frame allocation under-aligned being the problem?

Sorry for the confusion. I think either overaligned or under-aligned could be used here to describe the problem: either "Handle overaligned frame" or "Fix under-aligned frame". Since c++ spec defines the former but not the later (https://en.cppreference.com/w/cpp/language/object#Alignment), my first intuition was to use the term "overalign". Under-aligned is the undesired outcome that should be fixed (probably too late to handle I assume). Also the overaligned is a static property whereas 'under-aligned" is a runtime property. From the compiler's perspective, I think overaligned should be preferred. With that said, I don't feel strongly about this. I could switch to use "overaligned" if that feels more intuitive.

  1. What is the purpose of coro.align intrinsic?

To communicate frame alignment to the frontend. It shouldn't be needed for this patch, so I've removed it.

  1. Could you provide some examples of what the IR might look like after this patch? Either that or a more detailed explanation of how this works in the summary.

Yep, please see the updated description. And a new test is added.

  1. Do you think it might be cleaner to introduce a new variant of coro.size instead of adding arguments to it? For example, coro.size.aligned(). This way, you can avoid changing any test file for non-switch-lowering test files, but focus on all switch-lowering tests.

Agree, I've thought about variadic intrinsic and this new intrinsic, I think using new intrinsic is more flexible and avoids test fixup.

  1. Typically, coro.free is used by a comparison with nullptr. This is to enable CoroElide. See: https://llvm.org/docs/Coroutines.html#llvm-coro-free-intrinsic. So I don't think you can load from it directly.

Agree, I've changed to do it in coro::replaceCoroFree.

@ChuanqiXu @lxfind Thanks a lot for the feedback. I've updated the description and addressed the existing comments. Please take a look.

What is the purpose of the builtin? Where is it being used? Typically you *can't* change the signature of a builtin because the builtin is itself a language feature that's documented to have a particular signature. If you've made a builtin purely for use in generated AST, that's pretty unfortunate, and you should consider whether you actually have to do that instead of e.g. synthesizing a call to an allocation function the same way that we do in new expressions.

Well, the intention was not *only* use it in AST, it could be used by clients to ask LLVM to handle overaligned frame. I'm not sure how many use cases that could have, so in the updated patch, I've removed it.

ychen updated this revision to Diff 339902.Apr 23 2021, 12:00 AM

Fix typo.

ychen updated this revision to Diff 339906.Apr 23 2021, 12:14 AM

fix typo.

ychen edited the summary of this revision. (Show Details)Apr 23 2021, 12:16 AM
ChuanqiXu added a comment.EditedApr 23 2021, 3:23 AM

This is an alternative to D97915 which missed proper deallocation of the over-allocated frame. This patch handles both allocations and deallocations.

If D97915 is not needed, we should abandon it.

For the example shows in D97915, it says:

#include <experimental/coroutine>
#include <iostream>
#include <stdexcept>
#include <thread>
#include <cassert>

struct task{
  struct alignas(64) promise_type {
    task get_return_object() { return {}; }
    std::experimental::suspend_never initial_suspend() { return {}; }
    std::experimental::suspend_never final_suspend() noexcept { return {}; }
    void return_void() {}
    void unhandled_exception() {}
  };
  using handle = std::experimental::coroutine_handle<promise_type>;
};

auto switch_to_new_thread() {
  struct awaitable {
    bool await_ready() { return false; }
    void await_suspend(task::handle h) {
      auto i = reinterpret_cast<std::uintptr_t>(&h.promise());
      std::cout << i << std::endl;
      assert(i % 64 == 0);
    }
    void await_resume() {}
  };
  return awaitable{};
}

task resuming_on_new_thread() {
  co_await switch_to_new_thread();
}

int main() {
  resuming_on_new_thread();
}

The assertion would fail. If this is the root problem, I think we could adjust the align for the promise alloca like:

%promise = alloca %promise_type, align 8

into

%promise = alloca %promise_type, align 128

In other words, if this the problem we need to solve, I think we could make it in a simpler way.

Then I looked into the document you give in the summary. The issue#3 says the frontend can't do some work in the process of template instantiation due to the frontend doesn't know about the align and size of the coroutine. But from the implementation, it looks like not the problem this patch wants to solve.

I am really confused about the problem. Could you please restate your problem more in more detail? For example, would it make the alignment incorrect like the example above? Or does we want the frontend to get alignment information? Then what would be affected? From the title, I can guess the size of frame would get bigger. But how big would it be? Who would control and determine the final size?

Sorry for the confusion. I think either overaligned or under-aligned could be used here to describe the problem: either "Handle overaligned frame" or "Fix under-aligned frame". Since c++ spec defines the former but not the later (https://en.cppreference.com/w/cpp/language/object#Alignment), my first intuition was to use the term "overalign". Under-aligned is the undesired outcome that should be fixed (probably too late to handle I assume). Also the overaligned is a static property whereas 'under-aligned" is a runtime property. From the compiler's perspective, I think overaligned should be preferred. With that said, I don't feel strongly about this. I could switch to use "overaligned" if that feels more intuitive.

"Handle" is probably not the right word to be used here. What follows "handle" is typically a legit situation that already occurred but not current handled properly. Here "overaligned frame" doesn't already occur. From what I understand, you really just want to support promise object alignment. So why not just say that directly?
To add on that, I do think you need to describe the problem in more detail in the description. It's indeed still confusing.

ychen added a comment.EditedApr 23 2021, 2:36 PM

This is an alternative to D97915 which missed proper deallocation of the over-allocated frame. This patch handles both allocations and deallocations.

If D97915 is not needed, we should abandon it.

For the example shows in D97915, it says:

#include <experimental/coroutine>
#include <iostream>
#include <stdexcept>
#include <thread>
#include <cassert>

struct task{
  struct alignas(64) promise_type {
    task get_return_object() { return {}; }
    std::experimental::suspend_never initial_suspend() { return {}; }
    std::experimental::suspend_never final_suspend() noexcept { return {}; }
    void return_void() {}
    void unhandled_exception() {}
  };
  using handle = std::experimental::coroutine_handle<promise_type>;
};

auto switch_to_new_thread() {
  struct awaitable {
    bool await_ready() { return false; }
    void await_suspend(task::handle h) {
      auto i = reinterpret_cast<std::uintptr_t>(&h.promise());
      std::cout << i << std::endl;
      assert(i % 64 == 0);
    }
    void await_resume() {}
  };
  return awaitable{};
}

task resuming_on_new_thread() {
  co_await switch_to_new_thread();
}

int main() {
  resuming_on_new_thread();
}

The assertion would fail. If this is the root problem, I think we could adjust the align for the promise alloca like:

The problem is that any member of the coroutine frame could be overaligned (thus make the frame overaligned) including promise, local variables, spills. The problem is *not* specific to promise.

%promise = alloca %promise_type, align 8

into

%promise = alloca %promise_type, align 128

In other words, if this the problem we need to solve, I think we could make it in a simpler way.

This may not fix the problem.

Then I looked into the document you give in the summary. The issue#3 says the frontend can't do some work in the process of template instantiation due to the frontend doesn't know about the align and size of the coroutine. But from the implementation, it looks like not the problem this patch wants to solve.

I meant to use that as a reference to help describe the problem (but not the solution). The document itself includes both problem statements (issue#3) and solutions (frontend-based) which are totally unrelated to this patch. It looks like it is not that useful in this case so please disregard that.

I am really confused about the problem. Could you please restate your problem more in more detail? For example, would it make the alignment incorrect like the example above? Or does we want the frontend to get alignment information? Then what would be affected? From the title, I can guess the size of frame would get bigger. But how big would it be? Who would control and determine the final size?

understood.

There are two kinds of alignments: the alignment of a type/object at compile-time (ABI specified or user-specified), and the alignment the object of that type actually gets during runtime. The compiler assumes that the alignment of a struct is the maximal alignment of all its members. However, that assumption may not be true at runtime where the memory allocator may return a memory block that has insufficient alignment which causes some members aligned incorrectly.

For C++ coroutine, right now the default memory allocator could only return 16 bytes aligned memory block. When any member of the coroutine frame (promise, local variables, spills etc.) has alignment > 16, the frame becomes overaligned. This could only be fixed dynamically at runtime: by over-allocating memory and then adjust the frame start address so that it aligns correctly.

For example, suppose malloc returns 16 bytes aligned address 16, how do we make it 64 bytes aligned? align 16 up to an address that is 64 bytes aligned which is 64, so the adjustment amount is 64-16=48

Another similar example, suppose malloc returns 16 bytes aligned address 32, how do we make it 64 bytes aligned? align 32 up to an address that is 64 bytes aligned which is 64, so the adjustment amount is 64-32=32

Another similar example, suppose malloc returns 16 bytes aligned address 48, how do we make it 64 bytes aligned? align 48 up to an address that is 64 bytes aligned which is 64, so the adjustment amount is 64-48=16

Another similar example, suppose malloc returns 16 bytes aligned address 64, how do we make it 64 bytes aligned? align 64 up to an address that is 64 bytes aligned which is 64, so the adjustment amount is 64-64=0

So the mamximal adjustment amount is 64-16=48. We don't know until runtime if the malloc returned address X is (X % 64 == 0) or (X % 64 == 16) or (X % 64 == 32) or (X % 64 == 48), so we must emit extra code to deal with all cases (by bitwise operations).

ychen added a comment.Apr 23 2021, 2:44 PM

Sorry for the confusion. I think either overaligned or under-aligned could be used here to describe the problem: either "Handle overaligned frame" or "Fix under-aligned frame". Since c++ spec defines the former but not the later (https://en.cppreference.com/w/cpp/language/object#Alignment), my first intuition was to use the term "overalign". Under-aligned is the undesired outcome that should be fixed (probably too late to handle I assume). Also the overaligned is a static property whereas 'under-aligned" is a runtime property. From the compiler's perspective, I think overaligned should be preferred. With that said, I don't feel strongly about this. I could switch to use "overaligned" if that feels more intuitive.

Here "overaligned frame" doesn't already occur.

It does occur. FrameAlign > Shape.getSwitchCoroId()->getAlignment()) this check reflects the definition of C++ spec's definition of overalign.

From what I understand, you really just want to support promise object alignment. So why not just say that directly?

This patch does not deal with promise in any specific way. It treats promise just like any other frame members.

To add on that, I do think you need to describe the problem in more detail in the description. It's indeed still confusing.

Yep, will do that.

ychen added inline comments.Apr 25 2021, 6:55 PM
llvm/lib/Transforms/Coroutines/CoroFrame.cpp
854

This seems to align with what this mem argument is designed for: https://reviews.llvm.org/D66230#1631860

This is an alternative to D97915 which missed proper deallocation of the over-allocated frame. This patch handles both allocations and deallocations.

If D97915 is not needed, we should abandon it.

For the example shows in D97915, it says:

#include <experimental/coroutine>
#include <iostream>
#include <stdexcept>
#include <thread>
#include <cassert>

struct task{
  struct alignas(64) promise_type {
    task get_return_object() { return {}; }
    std::experimental::suspend_never initial_suspend() { return {}; }
    std::experimental::suspend_never final_suspend() noexcept { return {}; }
    void return_void() {}
    void unhandled_exception() {}
  };
  using handle = std::experimental::coroutine_handle<promise_type>;
};

auto switch_to_new_thread() {
  struct awaitable {
    bool await_ready() { return false; }
    void await_suspend(task::handle h) {
      auto i = reinterpret_cast<std::uintptr_t>(&h.promise());
      std::cout << i << std::endl;
      assert(i % 64 == 0);
    }
    void await_resume() {}
  };
  return awaitable{};
}

task resuming_on_new_thread() {
  co_await switch_to_new_thread();
}

int main() {
  resuming_on_new_thread();
}

The assertion would fail. If this is the root problem, I think we could adjust the align for the promise alloca like:

The problem is that any member of the coroutine frame could be overaligned (thus make the frame overaligned) including promise, local variables, spills. The problem is *not* specific to promise.

%promise = alloca %promise_type, align 8

into

%promise = alloca %promise_type, align 128

In other words, if this the problem we need to solve, I think we could make it in a simpler way.

This may not fix the problem.

Then I looked into the document you give in the summary. The issue#3 says the frontend can't do some work in the process of template instantiation due to the frontend doesn't know about the align and size of the coroutine. But from the implementation, it looks like not the problem this patch wants to solve.

I meant to use that as a reference to help describe the problem (but not the solution). The document itself includes both problem statements (issue#3) and solutions (frontend-based) which are totally unrelated to this patch. It looks like it is not that useful in this case so please disregard that.

I am really confused about the problem. Could you please restate your problem more in more detail? For example, would it make the alignment incorrect like the example above? Or does we want the frontend to get alignment information? Then what would be affected? From the title, I can guess the size of frame would get bigger. But how big would it be? Who would control and determine the final size?

understood.

There are two kinds of alignments: the alignment of a type/object at compile-time (ABI specified or user-specified), and the alignment the object of that type actually gets during runtime. The compiler assumes that the alignment of a struct is the maximal alignment of all its members. However, that assumption may not be true at runtime where the memory allocator may return a memory block that has insufficient alignment which causes some members aligned incorrectly.

For C++ coroutine, right now the default memory allocator could only return 16 bytes aligned memory block. When any member of the coroutine frame (promise, local variables, spills etc.) has alignment > 16, the frame becomes overaligned. This could only be fixed dynamically at runtime: by over-allocating memory and then adjust the frame start address so that it aligns correctly.

For example, suppose malloc returns 16 bytes aligned address 16, how do we make it 64 bytes aligned? align 16 up to an address that is 64 bytes aligned which is 64, so the adjustment amount is 64-16=48

Another similar example, suppose malloc returns 16 bytes aligned address 32, how do we make it 64 bytes aligned? align 32 up to an address that is 64 bytes aligned which is 64, so the adjustment amount is 64-32=32

Another similar example, suppose malloc returns 16 bytes aligned address 48, how do we make it 64 bytes aligned? align 48 up to an address that is 64 bytes aligned which is 64, so the adjustment amount is 64-48=16

Another similar example, suppose malloc returns 16 bytes aligned address 64, how do we make it 64 bytes aligned? align 64 up to an address that is 64 bytes aligned which is 64, so the adjustment amount is 64-64=0

So the mamximal adjustment amount is 64-16=48. We don't know until runtime if the malloc returned address X is (X % 64 == 0) or (X % 64 == 16) or (X % 64 == 32) or (X % 64 == 48), so we must emit extra code to deal with all cases (by bitwise operations).

Thanks for the explanation. I think I got the problem now. And my understanding for the solution of this patch is, if the align of the original frame is 64, then we allocate (64+48) space to the new frame now. And the original frame becomes an alloca with align 64 in the new frame. So the actually frame gets the right alignment now. Do I get the problem and solution right?

I am wondering if there is a simpler solution. For example, after we construct the frame, we can get the alignment requirement for the frame. Then if the alignment is bigger than 16, we could lower the coro.begin to aligned new instead of default new. It looks like that the implementation would be much simpler.

Yes, if you can dynamically choose to use an aligned allocator, that's clearly just much better.

Yes, if you can dynamically choose to use an aligned allocator, that's clearly just much better.

Right now:

Intrinsic::coro_size_aligned : overaligned frame: over-allocate, adjust start address; non-overaligned frame: no-op
Intrinsic::coro_size : overaligned frame: no-op; non-overaligned frame: no-op

Do you mean to remove Intrinsic::coro_size_aligned and make
Intrinsic::coro_size : overaligned frame: over-allocate, adjust start address; non-overaligned frame: no-op

that makes sense to me. Just want to confirm first.

Yes, if you can dynamically choose to use an aligned allocator, that's clearly just much better.

Right now:

Intrinsic::coro_size_aligned : overaligned frame: over-allocate, adjust start address; non-overaligned frame: no-op
Intrinsic::coro_size : overaligned frame: no-op; non-overaligned frame: no-op

Do you mean to remove Intrinsic::coro_size_aligned and make
Intrinsic::coro_size : overaligned frame: over-allocate, adjust start address; non-overaligned frame: no-op

that makes sense to me. Just want to confirm first.

That's not really what I meant, no. I meant it would be better to find a way to use an allocator that promises to return a well-aligned value when possible. We've talked about this before; that will require the C++ committee to update the design.

I think the cleanest design for coro_size/align would be that they reflect the unadjusted requirements, and the frontend is expected to emit code which satisfies those requirements. In the absence of an aligned allocator, that means generating code like:

if (llvm.coro.alloc()) {
  size_t size = llvm.coro.size(), align = llvm.coro.align();
  if (align > NEW_ALIGN)
    size += align - NEW_ALIGN + sizeof(void*);
  frame = operator new(size);
  if (align > NEW_ALIGN) {
    auto rawFrame = frame;
    frame = (frame + align - 1) & ~(align - 1);
    *(void**) (frame + size) = rawFrame;
  }
}

and then on the deallocation side:

if (llvm.coro.alloc()) {
  size_t size = llvm.coro.size(), align = llvm.coro.align();
  if (align > NEW_ALIGN)
    frame = *(void**) (frame + size);
  operator delete(frame);
}

That's all quite annoying, but it does extend quite nicely to cover the presence of an aligned allocator when the committee gets around to ratifying that this is what should happen:

if (llvm.coro.alloc()) {
  size_t size = llvm.coro.size(), align = llvm.coro.align();
  if (align > NEW_ALIGN)
    frame = operator new(std::align_val_t(align), size);
  else
    frame = operator new(size);
}

and then on the deallocation side:

if (llvm.coro.alloc()) {
  size_t size = llvm.coro.size(), align = llvm.coro.align();
  if (align > NEW_ALIGN)
    operator delete(frame, std::align_val_t(align));
  else
    operator delete(frame);
}
ychen added a comment.Apr 26 2021, 3:56 PM

Yes, if you can dynamically choose to use an aligned allocator, that's clearly just much better.

Right now:

Intrinsic::coro_size_aligned : overaligned frame: over-allocate, adjust start address; non-overaligned frame: no-op
Intrinsic::coro_size : overaligned frame: no-op; non-overaligned frame: no-op

Do you mean to remove Intrinsic::coro_size_aligned and make
Intrinsic::coro_size : overaligned frame: over-allocate, adjust start address; non-overaligned frame: no-op

that makes sense to me. Just want to confirm first.

That's not really what I meant, no. I meant it would be better to find a way to use an allocator that promises to return a well-aligned value when possible. We've talked about this before; that will require the C++ committee to update the design.

I think the cleanest design for coro_size/align would be that they reflect the unadjusted requirements, and the frontend is expected to emit code which satisfies those requirements. In the absence of an aligned allocator, that means generating code like:

if (llvm.coro.alloc()) {
  size_t size = llvm.coro.size(), align = llvm.coro.align();
  if (align > NEW_ALIGN)
    size += align - NEW_ALIGN + sizeof(void*);
  frame = operator new(size);
  if (align > NEW_ALIGN) {
    auto rawFrame = frame;
    frame = (frame + align - 1) & ~(align - 1);
    *(void**) (frame + size) = rawFrame;
  }
}

and then on the deallocation side:

if (llvm.coro.alloc()) {
  size_t size = llvm.coro.size(), align = llvm.coro.align();
  if (align > NEW_ALIGN)
    frame = *(void**) (frame + size);
  operator delete(frame);
}

That's all quite annoying, but it does extend quite nicely to cover the presence of an aligned allocator when the committee gets around to ratifying that this is what should happen:

if (llvm.coro.alloc()) {
  size_t size = llvm.coro.size(), align = llvm.coro.align();
  if (align > NEW_ALIGN)
    frame = operator new(std::align_val_t(align), size);
  else
    frame = operator new(size);
}

and then on the deallocation side:

if (llvm.coro.alloc()) {
  size_t size = llvm.coro.size(), align = llvm.coro.align();
  if (align > NEW_ALIGN)
    operator delete(frame, std::align_val_t(align));
  else
    operator delete(frame);
}

*(void**) (frame + size) = rawFrame; this means we always need the extra space to store the raw frame ptr. If either doing what the patch is currently doing or add another intrinsic say "llvm.coro.raw.frame.ptr.index" to do *(void**) (frame + llvm.coro.raw.frame.ptr.index()) = rawFrame;, it is likely that the extra pointer could reuse some existing paddings in the frame. There is an example of this in https://reviews.llvm.org/P8260. What do you think?

ychen added a comment.Apr 26 2021, 7:13 PM

This is an alternative to D97915 which missed proper deallocation of the over-allocated frame. This patch handles both allocations and deallocations.

If D97915 is not needed, we should abandon it.

For the example shows in D97915, it says:

#include <experimental/coroutine>
#include <iostream>
#include <stdexcept>
#include <thread>
#include <cassert>

struct task{
  struct alignas(64) promise_type {
    task get_return_object() { return {}; }
    std::experimental::suspend_never initial_suspend() { return {}; }
    std::experimental::suspend_never final_suspend() noexcept { return {}; }
    void return_void() {}
    void unhandled_exception() {}
  };
  using handle = std::experimental::coroutine_handle<promise_type>;
};

auto switch_to_new_thread() {
  struct awaitable {
    bool await_ready() { return false; }
    void await_suspend(task::handle h) {
      auto i = reinterpret_cast<std::uintptr_t>(&h.promise());
      std::cout << i << std::endl;
      assert(i % 64 == 0);
    }
    void await_resume() {}
  };
  return awaitable{};
}

task resuming_on_new_thread() {
  co_await switch_to_new_thread();
}

int main() {
  resuming_on_new_thread();
}

The assertion would fail. If this is the root problem, I think we could adjust the align for the promise alloca like:

The problem is that any member of the coroutine frame could be overaligned (thus make the frame overaligned) including promise, local variables, spills. The problem is *not* specific to promise.

%promise = alloca %promise_type, align 8

into

%promise = alloca %promise_type, align 128

In other words, if this the problem we need to solve, I think we could make it in a simpler way.

This may not fix the problem.

Then I looked into the document you give in the summary. The issue#3 says the frontend can't do some work in the process of template instantiation due to the frontend doesn't know about the align and size of the coroutine. But from the implementation, it looks like not the problem this patch wants to solve.

I meant to use that as a reference to help describe the problem (but not the solution). The document itself includes both problem statements (issue#3) and solutions (frontend-based) which are totally unrelated to this patch. It looks like it is not that useful in this case so please disregard that.

I am really confused about the problem. Could you please restate your problem more in more detail? For example, would it make the alignment incorrect like the example above? Or does we want the frontend to get alignment information? Then what would be affected? From the title, I can guess the size of frame would get bigger. But how big would it be? Who would control and determine the final size?

understood.

There are two kinds of alignments: the alignment of a type/object at compile-time (ABI specified or user-specified), and the alignment the object of that type actually gets during runtime. The compiler assumes that the alignment of a struct is the maximal alignment of all its members. However, that assumption may not be true at runtime where the memory allocator may return a memory block that has insufficient alignment which causes some members aligned incorrectly.

For C++ coroutine, right now the default memory allocator could only return 16 bytes aligned memory block. When any member of the coroutine frame (promise, local variables, spills etc.) has alignment > 16, the frame becomes overaligned. This could only be fixed dynamically at runtime: by over-allocating memory and then adjust the frame start address so that it aligns correctly.

For example, suppose malloc returns 16 bytes aligned address 16, how do we make it 64 bytes aligned? align 16 up to an address that is 64 bytes aligned which is 64, so the adjustment amount is 64-16=48

Another similar example, suppose malloc returns 16 bytes aligned address 32, how do we make it 64 bytes aligned? align 32 up to an address that is 64 bytes aligned which is 64, so the adjustment amount is 64-32=32

Another similar example, suppose malloc returns 16 bytes aligned address 48, how do we make it 64 bytes aligned? align 48 up to an address that is 64 bytes aligned which is 64, so the adjustment amount is 64-48=16

Another similar example, suppose malloc returns 16 bytes aligned address 64, how do we make it 64 bytes aligned? align 64 up to an address that is 64 bytes aligned which is 64, so the adjustment amount is 64-64=0

So the mamximal adjustment amount is 64-16=48. We don't know until runtime if the malloc returned address X is (X % 64 == 0) or (X % 64 == 16) or (X % 64 == 32) or (X % 64 == 48), so we must emit extra code to deal with all cases (by bitwise operations).

Thanks for the explanation. I think I got the problem now. And my understanding for the solution of this patch is, if the align of the original frame is 64, then we allocate (64+48) space to the new frame now. And the original frame becomes an alloca with align 64 in the new frame. So the actually frame gets the right alignment now. Do I get the problem and solution right?

I am wondering if there is a simpler solution. For example, after we construct the frame, we can get the alignment requirement for the frame. Then if the alignment is bigger than 16, we could lower the coro.begin to aligned new instead of default new. It looks like that the implementation would be much simpler.

Oh, right now C++ coroutine standard is written in the way that the aligned new is not searched by the frontend. The limitation will be lifted in C++23 (hopefully).

That's not really what I meant, no. I meant it would be better to find a way to use an allocator that promises to return a well-aligned value when possible. We've talked about this before; that will require the C++ committee to update the design.

I had a question. Does this mean we can't use aligned allocator until the C++ committee update the wording? For example, I know Clang/LLVM implement coroutine before it becomes the standard. I mean is it possible to use aligned-allocator to solve the problem here?

Oh, right now C++ coroutine standard is written in the way that the aligned new is not searched by the frontend. The limitation will be lifted in C++23 (hopefully).

I see. I am curious about the relationship of compiler implementation and language standard now. For example, Clang/LLVM implement coroutine before it becomes standard. The point I curious about is that should Clang/LLVM implement based on proposals accepted only?

Here are the options I think the committee might take:

  1. Always select an unaligned allocator and force implementors to dynamically align. This seems unlikely to me.
  2. Allow an aligned allocator to be selected. The issue here is that we cannot know until coroutine splitting whether the frame has a new-extended alignment. So there are some sub-options:

2a. Always use an aligned allocator if available, even if the frame ends up not being aligned. I think this is unlikely.
2b. Use the correct allocator for the frame alignment, using the alignment inferred from the immediate coroutine body. This would force implementations to avoid doing anything prior to coroutine splitting that would increase the frame's alignment beyond the derived limit. This would be quite annoying for implementors, and it would have strange performance cliffs, but it's theoretically simple.
2c. Use the correct allocator for the frame alignment; only that allocator is formally ODR-used. This would force implementations to not ODR-use the right allocator until coroutine lowering, which is not really reasonable; we should be able to dissuade the committee from picking this option.
2d. Use the correct allocator for the frame alignment; both allocators are (allowed to be) ODR-used, but only one would be dynamically used. This is what would be necessary for the implementation I suggested above. In reality there won't be any dynamic overhead because we should always be able to fold the branch after allocation.

I think 2d is obvious enough that we could go ahead and implement it pending standardization. There's still a question of what to do if the promise class only provides an unaligned operator new, but the only reasonable behavior is to dynamically align: unlike with new expressions, there's no way the promise class could be expected to know/satisfy the appropriate alignment requirement in general, so assuming alignment is not acceptable. (Neither is rejecting the program if it doesn't provide both — this wouldn't be compatible with existing behavior.)

*(void) (frame + size) = rawFrame; this means we always need the extra space to store the raw frame ptr. If either doing what the patch is currently doing or add another intrinsic say "llvm.coro.raw.frame.ptr.index" to do *(void) (frame + llvm.coro.raw.frame.ptr.index()) = rawFrame;, it is likely that the extra pointer could reuse some existing paddings in the frame. There is an example of this in https://reviews.llvm.org/P8260. What do you think?

You're right that there might be space in the frame we could re-use. I was thinking that it would be a shame to always add storage to the frame for the raw frame pointer, but maybe the contract of llvm.coro.raw.frame.ptr.offset could be that it's only meaningful if the frame has extended alignment. Coroutine splitting would determine if any of the frame members was over-aligned and add a raw-pointer field if so. We'd be stuck allocating space in the frame even when allocation was elided, but stack space is cheap.

It does need to be an offset instead of a type index, though; the frontend will be emitting a GEP and will not know the frame type.

ychen added a comment.Apr 26 2021, 9:59 PM

Here are the options I think the committee might take:

  1. Always select an unaligned allocator and force implementors to dynamically align. This seems unlikely to me.
  2. Allow an aligned allocator to be selected. The issue here is that we cannot know until coroutine splitting whether the frame has a new-extended alignment. So there are some sub-options:

2a. Always use an aligned allocator if available, even if the frame ends up not being aligned. I think this is unlikely.
2b. Use the correct allocator for the frame alignment, using the alignment inferred from the immediate coroutine body. This would force implementations to avoid doing anything prior to coroutine splitting that would increase the frame's alignment beyond the derived limit. This would be quite annoying for implementors, and it would have strange performance cliffs, but it's theoretically simple.
2c. Use the correct allocator for the frame alignment; only that allocator is formally ODR-used. This would force implementations to not ODR-use the right allocator until coroutine lowering, which is not really reasonable; we should be able to dissuade the committee from picking this option.
2d. Use the correct allocator for the frame alignment; both allocators are (allowed to be) ODR-used, but only one would be dynamically used. This is what would be necessary for the implementation I suggested above. In reality there won't be any dynamic overhead because we should always be able to fold the branch after allocation.

I think 2d is obvious enough that we could go ahead and implement it pending standardization. There's still a question of what to do if the promise class only provides an unaligned operator new, but the only reasonable behavior is to dynamically align: unlike with new expressions, there's no way the promise class could be expected to know/satisfy the appropriate alignment requirement in general, so assuming alignment is not acceptable. (Neither is rejecting the program if it doesn't provide both — this wouldn't be compatible with existing behavior.)

*(void) (frame + size) = rawFrame; this means we always need the extra space to store the raw frame ptr. If either doing what the patch is currently doing or add another intrinsic say "llvm.coro.raw.frame.ptr.index" to do *(void) (frame + llvm.coro.raw.frame.ptr.index()) = rawFrame;, it is likely that the extra pointer could reuse some existing paddings in the frame. There is an example of this in https://reviews.llvm.org/P8260. What do you think?

You're right that there might be space in the frame we could re-use. I was thinking that it would be a shame to always add storage to the frame for the raw frame pointer, but maybe the contract of llvm.coro.raw.frame.ptr.offset could be that it's only meaningful if the frame has extended alignment. Coroutine splitting would determine if any of the frame members was over-aligned and add a raw-pointer field if so. We'd be stuck allocating space in the frame even when allocation was elided, but stack space is cheap.

It does need to be an offset instead of a type index, though; the frontend will be emitting a GEP and will not know the frame type.

Sounds good to me. Thanks. I'll go ahead with llvm.coro.raw.frame.ptr.offset.

ychen abandoned this revision.Apr 29 2021, 12:09 AM

Reopened D97915 to address the feedbacks. Close this one.

ychen added a comment.Apr 29 2021, 8:25 PM

Oh, right now C++ coroutine standard is written in the way that the aligned new is not searched by the frontend. The limitation will be lifted in C++23 (hopefully).

I see. I am curious about the relationship of compiler implementation and language standard now. For example, Clang/LLVM implement coroutine before it becomes standard. The point I curious about is that should Clang/LLVM implement based on proposals accepted only?

Not a C++ language expert myself. But I think a proposal does not have to be formally accepted to kick-start the implementation (as long as the overall design is decided and the proposal is very likely to be accepted).

Oh, right now C++ coroutine standard is written in the way that the aligned new is not searched by the frontend. The limitation will be lifted in C++23 (hopefully).

I see. I am curious about the relationship of compiler implementation and language standard now. For example, Clang/LLVM implement coroutine before it becomes standard. The point I curious about is that should Clang/LLVM implement based on proposals accepted only?

Not a C++ language expert myself. But I think a proposal does not have to be formally accepted to kick-start the implementation (as long as the overall design is decided and the proposal is very likely to be accepted).

I see. I am still prefer to use the aligned allocator to solve the problems, although you and other reviewers prefer to use the over aligned frame.

Oh, right now C++ coroutine standard is written in the way that the aligned new is not searched by the frontend. The limitation will be lifted in C++23 (hopefully).

I see. I am curious about the relationship of compiler implementation and language standard now. For example, Clang/LLVM implement coroutine before it becomes standard. The point I curious about is that should Clang/LLVM implement based on proposals accepted only?

Not a C++ language expert myself. But I think a proposal does not have to be formally accepted to kick-start the implementation (as long as the overall design is decided and the proposal is very likely to be accepted).

I see. I am still prefer to use the aligned allocator to solve the problems, although you and other reviewers prefer to use the over aligned frame.

It will be a hybrid of both. Since when a non-aligned version is picked by the frontend due to the aligned version not available, we still have to use overaligend frame. I'll send a separate patch for the aligned allocator searching.