Page MenuHomePhabricator

[Coroutines] Handle overaligned frame allocation
Needs ReviewPublic

Authored by ychen on Mar 3 2021, 11:21 PM.

Details

Summary

by over-allocating and emitting alignTo code to adjust the frame start address.

Motivation: on a lot of machines, malloc returns >=std::max_align_t (usually just 16) aligned heap regardless of the coro frame's preferred alignment (usually specified using alignas() on the promise or some local variables). For non-coroutine-related context, this is handled by calling overloaded operator new where an alignment could be specified. For coroutine, spec here https://eel.is/c++draft/dcl.fct.def.coroutine#9.1 suggested that the alignment argument is not considered during name lookup.

Mathias Stearn and @lewissbaker suggested this is the proper workaround before the issue is addressed by the spec.

One example showing the issue: https://gcc.godbolt.org/z/rGzaco

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
ychen edited the summary of this revision. (Show Details)Mar 4 2021, 10:34 AM
  • Add docs for coro.align
  • clang-format
ychen added a comment.Mar 4 2021, 10:36 AM

Could you describe in more detail what problem this patch solves?

Yes, updated description.

rjmccall added inline comments.Mar 4 2021, 1:04 PM
clang/lib/CodeGen/CGBuiltin.cpp
4450

Okay, so you're implicitly increasing the coroutine size to allow you to round up to get an aligned frame. But how do you round back down to get the actual pointer that you need to delete? This just doesn't work.

You really ought to just be using the aligned operator new instead when the required alignment is too high. If that means checking the alignment "dynamically" before calling operator new / operator delete, so be it. In practice, it will not be dynamic because lowering will replace the coro.align call with a constant, at which point the branch will be foldable.

I don't know what to suggest if the aligned operator new isn't reliably available on the target OS. You could outline a function to pick the best allocator/deallocator, I suppose.

ychen added inline comments.Mar 4 2021, 1:26 PM
clang/lib/CodeGen/CGBuiltin.cpp
4450

Thanks for the review!

Okay, so you're implicitly increasing the coroutine size to allow you to round up to get an aligned frame. But how do you round back down to get the actual pointer that you need to delete? This just doesn't work.

Hmm, you're right that I missed the delete part, thanks. The adjusted amount is constant, I could just dial it down in coro.free, right?

You really ought to just be using the aligned operator new instead when the required alignment is too high. If that means checking the alignment "dynamically" before calling operator new / operator delete, so be it. In practice, it will not be dynamic because lowering will replace the coro.align call with a constant, at which point the branch will be foldable.

That's my intuition at first. But spec is written in a way suggesting (IMHO) that the aligned version should not be used? What if the user specify their own allocator, then which one they should use?

ychen added inline comments.Mar 4 2021, 1:31 PM
clang/lib/CodeGen/CGBuiltin.cpp
4450

Sorry, I meant the adjusted amount is coro.align - std::max_align_t, I could subtract it in coro.free . I think it should work?

rjmccall added inline comments.Mar 4 2021, 2:59 PM
clang/lib/CodeGen/CGBuiltin.cpp
4450

No, because the adjustment you have to do in coro.alloc isn't just an addition, it's an addition plus a mask, which isn't reversible. Suppose the frame needs to be 32-byte-aligned, but the allocator only promises 8-byte alignment. The problem is that when you go to free a frame pointer, and you see that it's 32-byte-aligned (which, again, it always will be), the pointer you got from the allocator might be the frame pointer minus any of 8, 16, or 24 — or it might be exactly the same. The only way to reverse that is to store some sort of cookie, either the amount to subtract or even just the original pointer.

Now, if you could change the entire coroutine ABI, you could make the frame handle that you pass around be the unadjusted pointer and then just repeat the adjustment every time you enter the coroutine. But that doesn't work because the ABI relies on things like the promise being at a reliable offset from the frame handle.

I think the best solution would be to figure out a way to use an aligned allocator, which at worst does this in a more systematic way and at best can actually just satisfy your requirement directly without any overhead. If you can't do that, adding an offset to the frame would be best; if you can't do that, doing it as a cookie is okay.

That's my intuition at first. But spec is written in a way suggesting (IMHO) that the aligned version should not be used? What if the user specify their own allocator, then which one they should use?

It seems like a spec bug that this doesn't use aligned allocators even when they're available. If there's an aligned allocator available, I think this should essentially do dynamically what it would normally do statically, i.e.:

void *allocation = alignment > __STDCPP_DEFAULT_NEW_ALIGNMENT__ ? operator new(size, align_val_t(alignment)) : operator new(size);

This would ODR-use both allocation functions, of course.

Maybe it's right to do this cookie thing if we can't rely on an aligned allocator, like if the promise class provides only an operator new(size_t).

ychen added inline comments.Mar 4 2021, 5:22 PM
clang/lib/CodeGen/CGBuiltin.cpp
4450

No, because the adjustment you have to do in coro.alloc isn't just an addition, it's an addition plus a mask, which isn't reversible. Suppose the frame needs to be 32-byte-aligned, but the allocator only promises 8-byte alignment. The problem is that when you go to free a frame pointer, and you see that it's 32-byte-aligned (which, again, it always will be), the pointer you got from the allocator might be the frame pointer minus any of 8, 16, or 24 — or it might be exactly the same. The only way to reverse that is to store some sort of cookie, either the amount to subtract or even just the original pointer.

I got myself confused. This makes perfect sense.

Now, if you could change the entire coroutine ABI, you could make the frame handle that you pass around be the unadjusted pointer and then just repeat the adjustment every time you enter the coroutine. But that doesn't work because the ABI relies on things like the promise being at a reliable offset from the frame handle.

I think the best solution would be to figure out a way to use an aligned allocator, which at worst does this in a more systematic way and at best can actually just satisfy your requirement directly without any overhead. If you can't do that, adding an offset to the frame would be best; if you can't do that, doing it as a cookie is okay.

This is very helpful. I'll explore the adding offset to the frame option first. If it is not plausible, I'll use the cookie method. Thanks!

I am a little confusing about the problem. The example in the link tells the align of the promise instead of the frame. The address of promise and frame is not same. It looks like you're trying to do:

+               +-----------------------------------+
|               |                                   |
+---------------+          frame                    |
| pedding       |                                   |
+               +-----------------------------------+
                ^
                |
                |
                |
                |
                |
                +

              The address of frame matches the offset of promise.

However, what we should do is:

+               +-----------------------------------+
|               |       +--------------+            |
+---------------+frame  | promise      |            |
| pedding       |       <--------------+            |
+               +-----------------------------------+
                ^       |
                |       |
                |       |
                |       |
                |       +
                |       This is what we really want
                +

              The address of frame matches the offset of promise.

If I get the problem problems, I think we can handle this problem in the middle end if the information for the promise remains.

clang/lib/CodeGen/CGBuiltin.cpp
16771–16772

Why we remove the anonymous namespace here?

ychen added a comment.Mar 5 2021, 10:06 AM

I am a little confusing about the problem. The example in the link tells the align of the promise instead of the frame. The address of promise and frame is not same. It looks like you're trying to do:

+               +-----------------------------------+
|               |                                   |
+---------------+          frame                    |
| pedding       |                                   |
+               +-----------------------------------+
                ^
                |
                |
                |
                |
                |
                +

              The address of frame matches the offset of promise.

However, what we should do is:

+               +-----------------------------------+
|               |       +--------------+            |
+---------------+frame  | promise      |            |
| pedding       |       <--------------+            |
+               +-----------------------------------+
                ^       |
                |       |
                |       |
                |       |
                |       +
                |       This is what we really want
                +

              The address of frame matches the offset of promise.

If I get the problem problems, I think we can handle this problem in the middle end if the information for the promise remains.

Not sure I follow. Inside the frame, the promise is in its desired position. It is not properly aligned because the frame start address is underaligned - malloc usually only returns 16 bytes aligned memory whereas alignas could make the preferred alignment larger than that.

clang/lib/CodeGen/CGBuiltin.cpp
16771–16772

I added a common/helper function that takes BuiltinAlignArgs as an argument. Need to move it out of the anonymous namespace to forward declare it.

Let's try to avoid adding a new builtin for what we acknowledge is a workaround. Builtins become part of the language supported by the compiler, so we shouldn't add them casually.

ychen added a comment.Mar 5 2021, 12:28 PM

Let's try to avoid adding a new builtin for what we acknowledge is a workaround. Builtins become part of the language supported by the compiler, so we shouldn't add them casually.

If we're going to use the aligned new in the future, do we still need this builtin, or something else is preferred?

Let's try to avoid adding a new builtin for what we acknowledge is a workaround. Builtins become part of the language supported by the compiler, so we shouldn't add them casually.

If we're going to use the aligned new in the future, do we still need this builtin, or something else is preferred?

Oh, sorry, for some reason I got the impression from the patch that we were adding a new Clang-level builtin. Adding a new LLVM intrinsic seems reasonable to me.

In any case, I don't think we should expose BuiltinAlignArgs outside of CGBuiltin.cpp. Seems like at most we need to add a convenience function on CGBuilderTy to do a pointer round-up-to-alignment operation.

lewissbaker added inline comments.Mar 5 2021, 5:41 PM
clang/lib/CodeGen/CGBuiltin.cpp
4450

There was a proposal to extend the coroutine specification with support for the align_val_t overloads of operator new() when allocating coroutine frames.
See http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2020/p2014r0.pdf

Unfortunately this was not adopted at the time as it was proposed late in the C++20 cycle and there was not yet any implementation experience.

So for now, if the compiler determines that the frame needs to be aligned to a value greater than the default alignment of global operator new it will need to overallocate, align the frame within that buffer and store the offset applied somewhere so that the it can reconstruct the address of the pointer returned from operator new() so that it can pass it to operator delete() on coroutine_handle::destroy().

Note that there was also a meeting to discuss ABI for coroutine frames with the intent that the major coroutines implementations would all (eventually) agree on a compatible coroutine ABI.

The results of this meeting was written up in the doc https://docs.google.com/document/d/1t53lAuQNd3rPqN-VByZabwL6PL2Oyl4zdJxm-gQlhfU/edit?usp=sharing

The end result was that we decided to place any padding needed to align the promise before the resume/destroy function pointers rather than place that padding in-between the function-pointers and the promise. The rationale here being that we can then calculate the address of the promise as a constant offset from the frame address (typically at an offset of two pointers into the frame) rather than the offset being variable depending on the promise type's alignment. This should help building of certain tooling / debuggers / walking async stack-traces etc. as we don't need to know the exact promise type to be able to determine the location of the promise.

The compiler should know exactly how many bytes of padding was added at the start of the frame allocation to get to the frame address and so should be able to translate the coroutine frame address back to the allocation address before destruction - however this also may have an interplay with the support for overaligned frames (which may be required due to overaligned local variables/args and not only based on the promise-type), so I'm mentioning it here.

rjmccall added inline comments.Mar 5 2021, 7:15 PM
clang/lib/CodeGen/CGBuiltin.cpp
4450

Note that there was also a meeting to discuss ABI for coroutine frames with the intent that the major coroutines implementations would all (eventually) agree on a compatible coroutine ABI.

Interesting! Did you consider reaching out to the Itanium C++ ABI group, which has prior expertise in the area of standardizing C++ ABIs?

The end result was that we decided to place any padding needed to align the promise before the resume/destroy function pointers rather than place that padding in-between the function-pointers and the promise.

That is an interesting choice. Is deriving the address of the promise within the frame without knowing what the promise type is actually something that clients need to do? It's not like coroutines carry any reflective information of the sort that exceptions do.

Anyway, okay. So the function pointers are supposed to be "right-justified" so that the promise comes immediately afterwards, and the address point is supposed to point at the first function pointer. That is not what Clang implements, or has ever implemented, but I don't foresee any serious problems in adjusting the LLVM coroutine frame layout code to honor that.

The compiler should know exactly how many bytes of padding was added at the start of the frame allocation to get to the frame address and so should be able to translate the coroutine frame address back to the allocation address before destruction - however this also may have an interplay with the support for overaligned frames (which may be required due to overaligned local variables/args and not only based on the promise-type), so I'm mentioning it here.

Yes, this is basically only true of the adjustment you're talking about for the frame header with overaligned promises. Barring miracles, the only reasonable way to allocate one of these overaligned-promise frames is to round down to the next promise-alignment boundary and then allocate that, and that offset is indeed static given only of the promise type's alignment. But the larger allocation can still exceed allocator alignment, whether from the promise type or just local coroutine state, and that extra offset down to the allocated pointer will be dynamic. Fortunately, I don't think going from the frame pointer to the allocated pointer needs is an ABI-exposed operation, since the frame can only be destroyed by the coroutine itself. That means the details of allocation are essentially entirely implementation-private, and that includes how we adjust the frame pointer for deallocation. Am I missing something?

Unfortunately this was not adopted at the time as it was proposed late in the C++20 cycle and there was not yet any implementation experience.

Hmm. The role of implementation experience here would have been to point out that you hadn't considered over-alignment in your specification. It sounds more like you were running out of time to write the specification and just punted on an issue in the interests of getting the proposal into the standard. Regardless, it seems to me that the obviously correct design is that, if both an aligned and and an unaligned allocation function is available, it's unspecified which one is called, as long as the matching deallocation function is then called later. Are you suggesting that we *must not* do that?

lewissbaker added inline comments.Mar 7 2021, 4:13 PM
clang/lib/CodeGen/CGBuiltin.cpp
4450

Interesting! Did you consider reaching out to the Itanium C++ ABI group, which has prior expertise in the area of standardizing C++ ABIs?

I don't recall if they were contacted, although I believe we did discuss doing so at the time.
Maybe @GorNishanov remembers?

There were LLVM devs (mainly from Google), GCC compiler devs and MS compiler team involved in the discussion.

Is deriving the address of the promise within the frame without knowing what the promise type is actually something that clients need to do?

This is something that I have found a desired to be able to do when implementing async stack trace walking.

At the moment I ended up having to store basically two pointers to the parent coroutine-frame - one a coroutine_handle<void> so I can resume the parent coroutine and another pointer to an AsyncStackFrame stored within the promise so that I can walk to the parent frame.

I would ideally only like to have to store the coroutine_handle and be able to determine from its address the address of the coroutine_handle pointing to the next coroutine-frame.
e.g. by assuming that the promise_type has the continuation as the first data-member.

At the moment I can't make this assumption because the offset from the coroutine_handle::address() to the promise might be variable depending on the concrete promise_type's alignment.

That means the details of allocation are essentially entirely implementation-private, and that includes how we adjust the frame pointer for deallocation. Am I missing something?

I agree with your analysis here.

The role of implementation experience here would have been to point out that you hadn't considered over-alignment in your specification.

Yes, this issue was identified, albeit fairly late, during the standardisation process as a result of implementation experience and raised and discussed.

It sounds more like you were running out of time to write the specification and just punted on an issue in the interests of getting the proposal into the standard

The proposal was already in the standard at the time the issue was identified.

The problem was more that there were a couple of options for how to specify it and we didn't have any implementation experience of either way to inform which way should be chosen. So, yes, we effectively punted the decision until later. My preference is the more complicated design of "Option 1" described in P2014.

it seems to me that the obviously correct design is that, if both an aligned and and an unaligned allocation function is available, it's unspecified which one is called, as long as the matching deallocation function is then called later

Yes, I think this is pretty close to what "Option 1" describes, although I think it describes a slightly more involved overload resolution for deallocation functions than just "call the matching deallocation function".

Are you suggesting that we *must not* do that?

The current specification in C++20 only says that operator new(size_t) overloads are called. So a change that caused it to call operator new(size_t, align_val_t) overloads would be a non-conforming extension to C++20, although possibly one that users would be happy to have.

I am not sure how this would work, maybe I am missing something.
But this patch tries to round up the frame pointer by looking at the difference between the alignment of new and the alignment of the frame.
The alignment of new only gives you the guaranteed alignment for new, but not necessarily the maximum alignment, e.g. if the alignment of new is 16, the returned pointer can still be a multiple 32. And that difference matters.

Let's consider a frame that only has the two pointers and a promise with alignment requirement of 64. The alignment of new is 16.
Now you will calculate the difference to be 48, and create a padding of 48 before the frame:
But if the returned pointer from new is actually a multiple of 32 (but not 64), the frame will no longer be aligned to 64 (but (32 + 48) % 64 = 16).
So from what I can tell, if we cannot pass alignment to new, we need to look at the address returned by new dynamically to decide the padding.

I am not sure how this would work, maybe I am missing something.
But this patch tries to round up the frame pointer by looking at the difference between the alignment of new and the alignment of the frame.
The alignment of new only gives you the guaranteed alignment for new, but not necessarily the maximum alignment, e.g. if the alignment of new is 16, the returned pointer can still be a multiple 32. And that difference matters.

Let's consider a frame that only has the two pointers and a promise with alignment requirement of 64. The alignment of new is 16.
Now you will calculate the difference to be 48, and create a padding of 48 before the frame:
But if the returned pointer from new is actually a multiple of 32 (but not 64), the frame will no longer be aligned to 64 (but (32 + 48) % 64 = 16).

48 is the maximal possible adjustment needed. For this particular case, EmitBuiltinAlignTo would make the real adjustment 32 since (32 + 32) % 64 == 0.

So from what I can tell, if we cannot pass alignment to new, we need to look at the address returned by new dynamically to decide the padding.

Indeed, that's what EmitBuiltinAlignTo is for.

ychen abandoned this revision.Fri, Apr 23, 3:12 PM

Pursue D100739 instead.

ychen reclaimed this revision.Thu, Apr 29, 12:03 AM
ychen updated this revision to Diff 341418.Thu, Apr 29, 12:04 AM
  • Handle deallocation.
  • Fix tests.

@rjmccall the patch is on the large side. I'll submit a separate patch for the Sema part about searching for two allocators.

ychen planned changes to this revision.Thu, Apr 29, 10:21 AM

Found a bug. Will fix.

ychen updated this revision to Diff 341591.Thu, Apr 29, 11:41 AM
  • fix a bug.

    ready for review.
ychen updated this revision to Diff 341592.Thu, Apr 29, 11:43 AM
  • fix typo

For coroutine f0 in test/CodeGenCoroutines/coro-alloc.cpp

The allocation looks like this:

; Function Attrs: noinline nounwind optnone mustprogress
define dso_local void @f0() #0 {
entry:
  %0 = alloca %struct.global_new_delete_tag, align 1
  %1 = alloca %struct.global_new_delete_tag, align 1
  %__promise = alloca %"struct.std::experimental::coroutine_traits<void, global_new_delete_tag>::promise_type", align 1
  %ref.tmp = alloca %struct.suspend_always, align 1
  %undef.agg.tmp = alloca %struct.suspend_always, align 1
  %agg.tmp = alloca %"struct.std::experimental::coroutine_handle", align 1
  %agg.tmp2 = alloca %"struct.std::experimental::coroutine_handle.0", align 1
  %undef.agg.tmp3 = alloca %"struct.std::experimental::coroutine_handle.0", align 1
  %ref.tmp4 = alloca %struct.suspend_always, align 1
  %undef.agg.tmp5 = alloca %struct.suspend_always, align 1
  %agg.tmp7 = alloca %"struct.std::experimental::coroutine_handle", align 1
  %agg.tmp8 = alloca %"struct.std::experimental::coroutine_handle.0", align 1
  %undef.agg.tmp9 = alloca %"struct.std::experimental::coroutine_handle.0", align 1
  %2 = bitcast %"struct.std::experimental::coroutine_traits<void, global_new_delete_tag>::promise_type"* %__promise to i8*
  %3 = call token @llvm.coro.id(i32 16, i8* %2, i8* null, i8* null)
  %4 = call i1 @llvm.coro.alloc(token %3)
  br i1 %4, label %coro.alloc, label %coro.init

coro.alloc:                                       ; preds = %entry
  %5 = call i64 @llvm.coro.size.i64()
  %6 = call i64 @llvm.coro.align.i64()
  %7 = sub nsw i64 %6, 16
  %8 = icmp sgt i64 %7, 0
  %9 = select i1 %8, i64 %7, i64 0
  %10 = add i64 %5, %9
  %call = call noalias nonnull i8* @_Znwm(i64 %10) #11
  br label %coro.check.align

coro.check.align:                                 ; preds = %coro.alloc
  %11 = call i64 @llvm.coro.align.i64()
  %12 = icmp ugt i64 %11, 16
  br i1 %12, label %coro.alloc.align, label %coro.init

coro.alloc.align:                                 ; preds = %coro.check.align
  %mask = sub i64 %11, 1
  %intptr = ptrtoint i8* %call to i64
  %over_boundary = add i64 %intptr, %mask
  %inverted_mask = xor i64 %mask, -1
  %aligned_intptr = and i64 %over_boundary, %inverted_mask
  %diff = sub i64 %aligned_intptr, %intptr
  %aligned_result = getelementptr inbounds i8, i8* %call, i64 %diff
  call void @llvm.assume(i1 true) [ "align"(i8* %aligned_result, i64 %11) ]
  %13 = call i32 @llvm.coro.raw.frame.ptr.offset.i32()
  %14 = getelementptr inbounds i8, i8* %aligned_result, i32 %13
  %15 = bitcast i8* %14 to i8**
  store i8* %call, i8** %15, align 8
  br label %coro.init

coro.init:                                        ; preds = %coro.alloc.align, %coro.check.align, %entry
  %16 = phi i8* [ null, %entry ], [ %call, %coro.check.align ], [ %aligned_result, %coro.alloc.align ]
  %17 = call i8* @llvm.coro.begin(token %3, i8* %16)
  call void @_ZNSt12experimental16coroutine_traitsIJv21global_new_delete_tagEE12promise_type17get_return_objectEv(%"struct.std::experimental::coroutine_traits<void, global_new_delete_tag>::promise_type"* nonnull dereferenceable(1) %__promise)
  call void @_ZNSt12experimental16coroutine_traitsIJv21global_new_delete_tagEE12promise_type15initial_suspendEv(%"struct.std::experimental::coroutine_traits<void, global_new_delete_tag>::promise_type"* nonnull dereferenceable(1) %__promise)
  %call1 = call zeroext i1 @_ZN14suspend_always11await_readyEv(%struct.suspend_always* nonnull dereferenceable(1) %ref.tmp) #2
  br i1 %call1, label %init.ready, label %init.suspend

The deallocation looks like this:

cleanup:                                          ; preds = %final.ready, %final.cleanup, %init.cleanup
  %cleanup.dest.slot.0 = phi i32 [ 0, %final.ready ], [ 2, %final.cleanup ], [ 2, %init.cleanup ]
  %22 = call i8* @llvm.coro.free(token %3, i8* %17)
  %23 = icmp ne i8* %22, null
  br i1 %23, label %coro.free, label %after.coro.free

coro.free:                                        ; preds = %cleanup
  %24 = call i64 @llvm.coro.align.i64()
  %25 = icmp ugt i64 %24, 16
  %26 = call i32 @llvm.coro.raw.frame.ptr.offset.i32()
  %27 = getelementptr inbounds i8, i8* %22, i32 %26
  %28 = bitcast i8* %27 to i8**
  %29 = load i8*, i8** %28, align 8
  %30 = select i1 %25, i8* %29, i8* %22
  call void @_ZdlPv(i8* %30) #2
  br label %after.coro.free

after.coro.free:                                  ; preds = %cleanup, %coro.free
  switch i32 %cleanup.dest.slot.0, label %unreachable [
    i32 0, label %cleanup.cont
    i32 2, label %coro.ret
  ]

cleanup.cont:                                     ; preds = %after.coro.free
  br label %coro.ret

coro.ret:                                         ; preds = %cleanup.cont, %after.coro.free, %final.suspend, %init.suspend
  %31 = call i1 @llvm.coro.end(i8* null, i1 false)
  ret void

unreachable:                                      ; preds = %after.coro.free
  unreachable
}
Harbormaster completed remote builds in B101696: Diff 341592.

May I ask a question may be too simple? What if the user specify the alignment for promise (or any other local variables) to 128 or even 256? Since it looks like all the discuss before assumes that the largest alignment requirement is 64.

ychen added a comment.Thu, Apr 29, 8:11 PM

May I ask a question may be too simple? What if the user specify the alignment for promise (or any other local variables) to 128 or even 256? Since it looks like all the discuss before assumes that the largest alignment requirement is 64.

64 is one example. Bitwise operations (coro.alloc.align block in the attached example) should handle all valid alignment numbers.

ychen updated this revision to Diff 341756.Thu, Apr 29, 8:13 PM
  • Add missed Shape.CoroRawFramePtrOffsets.clear();

May I ask a question may be too simple? What if the user specify the alignment for promise (or any other local variables) to 128 or even 256? Since it looks like all the discuss before assumes that the largest alignment requirement is 64.

64 is one example. Bitwise operations (coro.alloc.align block in the attached example) should handle all valid alignment numbers.

Thanks for the example. And I recommended to add comment for the corresponding code. The code for bit-operation and the example confused me. I would look into this and the other part later.

This code snippets confused me before:

coro.alloc.align:                                 ; preds = %coro.check.align
  %mask = sub i64 %11, 1
  %intptr = ptrtoint i8* %call to i64
  %over_boundary = add i64 %intptr, %mask
  %inverted_mask = xor i64 %mask, -1
  %aligned_intptr = and i64 %over_boundary, %inverted_mask
  %diff = sub i64 %aligned_intptr, %intptr
  %aligned_result = getelementptr inbounds i8, i8* %call, i64 %diff

This code implies that %diff > 0. Formally, given Align = 2^m, m > 4 and Address=16n, we need to prove that:

(Address + Align -16)&(~(Align-1)) >= Address

&(~Align-1) would make the lowest m bit to 0. And Align-16 equals to 2^m - 16, which is 16*(2^(m-4)-1). Then Address + Align -16 could be 16*(n+2^(m-4)-1).
Then we call X for the value of the lowest m bit of Address + Align -16.
Because X has m bit, so X <= 2^m - 1. Noticed that X should be 16 aligned, so the lowest 4 bit should be zero.
Now,

X <= 2^m - 1 -1 - 2 - 4 - 8 = 2^m - 16

So the inequality we need prove now should be:

16*(n+2^(m-4)-1) - X >= 16n

Given X has the largest value wouldn't affect the inequality, so:

16*(n+2^(m-4)-1) - 2^m + 16 >= 16n

which is very easy now.

The overall prove looks non-travel to me. I spent some time to figure it out. I guess there must be some other people who can't get it immediately. I strongly recommend to add comment and corresponding prove for this code.

ychen added a comment.Mon, May 3, 9:51 AM

This code snippets confused me before:

coro.alloc.align:                                 ; preds = %coro.check.align
  %mask = sub i64 %11, 1
  %intptr = ptrtoint i8* %call to i64
  %over_boundary = add i64 %intptr, %mask
  %inverted_mask = xor i64 %mask, -1
  %aligned_intptr = and i64 %over_boundary, %inverted_mask
  %diff = sub i64 %aligned_intptr, %intptr
  %aligned_result = getelementptr inbounds i8, i8* %call, i64 %diff

This code implies that %diff > 0. Formally, given Align = 2^m, m > 4 and Address=16n, we need to prove that:

(Address + Align -16)&(~(Align-1)) >= Address

&(~Align-1) would make the lowest m bit to 0. And Align-16 equals to 2^m - 16, which is 16*(2^(m-4)-1). Then Address + Align -16 could be 16*(n+2^(m-4)-1).
Then we call X for the value of the lowest m bit of Address + Align -16.
Because X has m bit, so X <= 2^m - 1. Noticed that X should be 16 aligned, so the lowest 4 bit should be zero.
Now,

X <= 2^m - 1 -1 - 2 - 4 - 8 = 2^m - 16

So the inequality we need prove now should be:

16*(n+2^(m-4)-1) - X >= 16n

Given X has the largest value wouldn't affect the inequality, so:

16*(n+2^(m-4)-1) - 2^m + 16 >= 16n

which is very easy now.

The overall prove looks non-travel to me. I spent some time to figure it out. I guess there must be some other people who can't get it immediately. I strongly recommend to add comment and corresponding prove for this code.

The code is equivalent to

(Address + Align -1)&(~(Align-1)) >= Address

which should be correct. It is implemented by CodeGenFunction::EmitBuiltinAlignTo.

ychen planned changes to this revision.Wed, May 5, 5:54 PM

Plan to rebase this together with the following patch for two lookups (aligned and non-aligned new/delete, and generate code accordingly)

ychen updated this revision to Diff 343956.EditedSun, May 9, 7:41 PM
  • Rebase on D102145.
  • Dynamically adjust the alignment for allocation and deallocation if the selected allocator does not have std::align_val_t argument. Otherwise, use the aligned allocation/deallocation function.
ychen updated this revision to Diff 343959.Sun, May 9, 8:25 PM
  • Rebase

which should be correct. It is implemented by CodeGenFunction::EmitBuiltinAlignTo.

I agree it is correct. I just want to say we should comment it to avoid confusing.

Since the patch could handle the case if the frontend tries to search ::operator new(size_t, align_val_t), this patch should be based on D102147.

ychen updated this revision to Diff 344602.Tue, May 11, 5:11 PM
  • Rebase on updated D102145 (use llvm.coro.raw.frame.ptr.addr during allocation)
ychen added a comment.Tue, May 11, 5:23 PM

which should be correct. It is implemented by CodeGenFunction::EmitBuiltinAlignTo.

I agree it is correct. I just want to say we should comment it to avoid confusing.

Happy to do it in a separate patch since this patch does not change the implementation of CodeGenFunction::EmitBuiltinAlignTo.

Since the patch could handle the case if the frontend tries to search ::operator new(size_t, align_val_t), this patch should be based on D102147.

This patch *could* handle both aligned and normal new/delete, so it doesn't need D102147 to work correctly?
D102147 depends on this patch since it may find a non-aligned new/delete for overaligned frame. In such a case, this patch is required.

ChuanqiXu added inline comments.Wed, May 12, 5:00 AM
clang/include/clang/AST/StmtCXX.h
356–359 ↗(On Diff #344602)

Can't we merge these?

clang/lib/CodeGen/CGCoroutine.cpp
436–450

It looks like it would emit a deallocate first, and emit an alignedDeallocate, which is very odd. Although I can find that the second deallocate wouldn't be emitted due to the check LastCoroFreeUsedForDealloc, it is still very odd to me. If the second deallocate wouldn't come up all the way, what's the reason we need to write emit(deallocate) twice?

441–479

This code would only work if we use ::operator new(size_t, align_val_t), which is implemented in another patch. I would suggest to move this into that one.

593

Since hasAlignArg is called only once, I suggested to make it a lambda here which would make the code more easy to read.

595–597

I recommend to add a detailed comment here to tell the story why we need to over allocate the frame. It is really hard to understand for people who are new to this code. Otherwise, I think they need to use git blame to find the commit id and this review page to figure the reasons out.

599–621

It may be better to organize it as:

if (!HasAlignArg) {
   if (auto *RetOnAllocFailure = S.getReturnStmtOnAllocFailure()) {
       auto *Cond = Builder.CreateICmpNE(AlignedAllocateCall, NullPtr);
       AlignAllocBB2 = createBasicBlock("coro.alloc.align2");
       Builder.CreateCondBr(Cond, AlignAllocBB2, RetOnFailureBB);
       EmitBlock(AlignAllocBB2);
   }
   auto *CoroAlign = Builder.CreateCall(
        CGM.getIntrinsic(llvm::Intrinsic::coro_align, SizeTy));
   ...
}
733

Is it possible that it would return a nullptr value?

ychen marked an inline comment as done.Wed, May 12, 9:59 PM
ychen added inline comments.
clang/include/clang/AST/StmtCXX.h
356–359 ↗(On Diff #344602)

I'm not sure about the "merge" here. Could you be more explicit?

clang/lib/CodeGen/CGCoroutine.cpp
436–450

Agree that LastCoroFreeUsedForDealloc is a bit confusing. It makes sure deallocation and aligned deallocation share one coro.free. Otherwise, AFAIK, there would be two coro.free get codegen'd.

%mem = llvm.coro.free()
br i1 <overalign> , label <aligend-dealloc>, label <dealloc>

aligend-dealloc:
    use %mem

dealloc:
    use %mem

what's the reason we need to write emit(deallocate) twice?

John wrote a code snippet here: https://reviews.llvm.org/D100739#2717582. I think it would be helpful to look at the changed tests below to see the patterns.

Basically, for allocation, it looks like below; for deallocation, it would be similar.

void *rawFrame =nullptr;
...
if (llvm.coro.alloc()) {
  size_t size = llvm.coro.size(), align = llvm.coro.align();
  if (align > NEW_ALIGN) {
#if <an allocation function without std::align_val_t argument is selected by Sema>
    size += align - NEW_ALIGN + sizeof(void*);
    frame = operator new(size);
    rawFrame = frame;
    frame = (frame + align - 1) & ~(align - 1);
#else
    // If an aligned allocation function is selected.
    frame = operator new(size, align);
#endif
  } else {
    frame = operator new(size);
  }
}

The true branch of the #if directive is equivalent to "coro.alloc.align" block (and "coro.alloc.align2" if get_return_object_on_allocation_failure is defined), the false branch is equivalent to "coro.alloc" block.
The above pattern handles both aligned/normal allocation/deallocation so it is independent of D102147.

441–479

It handles both aligned and normal new/delete.

593

will do

595–597

will do.

733

Not that I know of. Because there is an early return

if (!CoroFree) {
  CGF.CGM.Error(Deallocate->getBeginLoc(),
                "Deallocation expressoin does not refer to coro.free");
  return;
}
ychen updated this revision to Diff 345053.Wed, May 12, 10:58 PM
ychen marked an inline comment as done.
  • Address feedbacks.
ychen marked 3 inline comments as done.Wed, May 12, 10:58 PM
ChuanqiXu added inline comments.Wed, May 12, 11:06 PM
clang/include/clang/AST/StmtCXX.h
356–359 ↗(On Diff #344602)

Sorry. I mean if we can merge Allocate with AlignedAllocate and merge Deallocate with AlignedDeallocate. Since from the implementation, it looks like the value of Allocate and AlignedAllocate (so as Deallocate and AlignedDeallocate) are the same.

clang/lib/CodeGen/CGCoroutine.cpp
436–450

Thanks. I get the reason why I am thinking the code isn't natural. Since I think ::operator new(size_t, align_val_t) shouldn't come up in this patch which should be available after D102147 applies. Here you said this patch is independent with D102147, I believe this patch could work without D102147. But it contains the codes which would work only if we applies the successor patch, so I think it is dependent on D102147.

The ideally relationship for me is to merge D102145 into this one (Otherwise it is weird for me that D102145 only introduces some intrinsics which wouldn't be used actually). Then this patch should handle the alignment for variables in coroutine frame without introducing ::new(size_t, align_val_t). Then the final patch could do the job that searching and generating code for ::new(size_t, align_val_t).

Maybe it is a little bit hard to rebase again and again. But I think it is better.

733

Do you think it is better to merge this check here?

 if (CurCoro.Data && CurCoro.Data->LastCoroFreeUsedForDealloc) {
     if (!CoroFree) {
          CGF.CGM.Error(Deallocate->getBeginLoc(),
                "Deallocation expressoin does not refer to coro.free");
          return something;
     }
    return RValue::get(CurCoro.Data->LastCoroFree);
}
ychen added inline comments.Wed, May 12, 11:19 PM
clang/lib/CodeGen/CGCoroutine.cpp
436–450

I think I know where the confusion comes from. AlignedDeallocate is not guaranteed to be an aligned allocator. In this patch in SemaCoroutine.cpp, it is set to Deallocate in which case we always dynamically adjust frame alignment. Once D102147 is landed. AlignedDeallocate may or may not be an aligned allocator.

The ideally relationship for me is to merge D102145 into this one (Otherwise it is weird for me that D102145 only introduces some intrinsics which wouldn't be used actually). Then this patch should handle the alignment for variables in coroutine frame without introducing ::new(size_t, align_val_t). Then the final patch could do the job that searching and generating code for ::new(size_t, align_val_t).

I was worried about the size of the patch if this is merged with D102145 but if that is preferred by more than one reviewer, I'll go ahead and do that. D102145 is pretty self-contained in that it does not contain clients of the added intrinsics but the introduced test should cover the expected intrinsic lowering.

ychen added inline comments.Wed, May 12, 11:22 PM
clang/include/clang/AST/StmtCXX.h
356–359 ↗(On Diff #344602)

Oh, this is to set the path for D102147 where Allocate and AlignedAllocate could be different. If I do this in D102147, it will also touch the CGCoroutine.cpp which I'm trying to avoid` since it is intended to be a Sema only patch.

ychen added inline comments.Wed, May 12, 11:29 PM
clang/lib/CodeGen/CGCoroutine.cpp
436–450

Naming is hard. I had a hard time figuring out a better name. AlignedDeallocate/AlignedAllocate is intended to refer to allocator/deallocator used for handling overaligned frame. Not that they are referring to allocator/deallocator with std::align_val_t argument.

ChuanqiXu added inline comments.Wed, May 12, 11:39 PM
clang/include/clang/AST/StmtCXX.h
356–359 ↗(On Diff #344602)

Yeah, this is the key different point between us. I think that D102147 could and should to touch the CodeGen part.

clang/lib/CodeGen/CGCoroutine.cpp
436–450

I think it is better for me to merge D102145 into this one to understand this patch. For example, the test cases in D102145 looks weird to me since it doesn't do over alignment at all like we discussed in that thread. Maybe my understanding is not right, but I think it isn't pretty self-contained. I am OK to wait for opinions from other reviewers.