This is an archive of the discontinued LLVM Phabricator instance.

[Cloning] handle blockaddress array clone in the same module
Needs ReviewPublic

Authored by ychen on Aug 17 2022, 5:12 PM.

Details

Summary

CloneFunction.cpp currently says "It is only legal to clone a function if a block address within that function is never referenced outside of the function.".
This is only the case for cloning into the same module (this patch handles this); for cloning/moving into a different module, it is actually okay.

I'm still working on a unit test but want to hear the reviewer's opinion earlier since I'm not sure why it is not handled before, no valid use case, or some constraints.

Fixes https://github.com/llvm/llvm-project/issues/56436

Diff Detail

Event Timeline

ychen created this revision.Aug 17 2022, 5:12 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 17 2022, 5:12 PM
Herald added a subscriber: hiraditya. · View Herald Transcript
ychen requested review of this revision.Aug 17 2022, 5:12 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 17 2022, 5:12 PM
ychen edited the summary of this revision. (Show Details)
ychen added a subscriber: ChuanqiXu.
ychen updated this revision to Diff 453488.Aug 17 2022, 5:49 PM
  • add a test case for discussion.
  • TODO: check blockaddress cloning is done correctly in verifier.

For the coroutine examples, I think is not a perfect solution. Consider the case the initial_suspend is not std::suspend_always and we can edit the dispatch_table during the function: https://godbolt.org/z/rx37TaPde

In this case, it is not so right if there are multiple dispatch tables for one coroutine (The destroy functions could access the dispatch table in theory). There are consistency problems. There may be other tricky cases.

Besides the coroutine example, it will be problematic if there are multiple functions share one dispatch_table.

So at least I think it is OK to disable the use in coroutines since the address of label is not a standard feature.

llvm/lib/Transforms/Utils/CloneFunction.cpp
225

This assertion looks not stable. It is easy to generate code to break the assumption.

In general, you can't clone when indirectbrs are involved: since the blockaddress is a constant, it needs to be valid for all the cloned versions of the indirectbr. And you can't make different basic blocks with the same address.

I'm not sure why you need to clone code in this context, but it's possible to lower indirectbrs to switch instructions if necessary (at the cost of slightly slower code at runtime). See IndirectBrExpandPass.

In general, you can't clone when indirectbrs are involved: since the blockaddress is a constant, it needs to be valid for all the cloned versions of the indirectbr. And you can't make different basic blocks with the same address.

I mean, there are existing VMap for new blockaddress and old blockaddress. So each cloned indirectbr has its own set of blockaddress. For this case, the cloning logic does not apply this VMap to GlobalValue that uses blockaddress. no?

I'm not sure why you need to clone code in this context, but it's possible to lower indirectbrs to switch instructions if necessary (at the cost of slightly slower code at runtime). See IndirectBrExpandPass.

Oh, this is helpful. I'll take a look. Thanks.

In general, you can't clone when indirectbrs are involved: since the blockaddress is a constant, it needs to be valid for all the cloned versions of the indirectbr. And you can't make different basic blocks with the same address.

I mean, there are existing VMap for new blockaddress and old blockaddress. So each cloned indirectbr has its own set of blockaddress. For this case, the cloning logic does not apply this VMap to GlobalValue that uses blockaddress. no?

And the side effect is that any GlobalValue referencing blockaddress needs to be cloned too. Most of the time, this is just an array of blockaddress.

In general, you can't clone when indirectbrs are involved: since the blockaddress is a constant, it needs to be valid for all the cloned versions of the indirectbr. And you can't make different basic blocks with the same address.

I mean, there are existing VMap for new blockaddress and old blockaddress. So each cloned indirectbr has its own set of blockaddress. For this case, the cloning logic does not apply this VMap to GlobalValue that uses blockaddress. no?

And the side effect is that any GlobalValue referencing blockaddress needs to be cloned too. Most of the time, this is just an array of blockaddress.

Never mind, I don't think it is legal to clone any GlobalValue here. Only GlobalValue that contains only blockaddress.

In general, you can't clone when indirectbrs are involved: since the blockaddress is a constant, it needs to be valid for all the cloned versions of the indirectbr. And you can't make different basic blocks with the same address.

I mean, there are existing VMap for new blockaddress and old blockaddress. So each cloned indirectbr has its own set of blockaddress. For this case, the cloning logic does not apply this VMap to GlobalValue that uses blockaddress. no?

And the side effect is that any GlobalValue referencing blockaddress needs to be cloned too. Most of the time, this is just an array of blockaddress.

The problem isn't a mechanical "how to I clone a value".

The problem is this: suppose you have a function with an indirectbr. Then you clone it. Now you have two indirectbrs, each with a different destination. So you need to rewrite each use of a blockaddress to use the right value, depending on which indirectbr it's going to be passed to. But in general, you can't determine that at the point of use of the indirectbr; you can't tell until the indirectbr is about to be executed. blockaddresses can be stored in global variables, passed to functions, returned from functions, etc.

There are only two ways out of this, in general: either avoid cloning, or transform the indirectbr into a switch (or equivalent). We avoid the indirectbr->switch transform where possible because it adds an extra table lookup to code which is probably performance-sensitive. (We use IndirectBrExpandPass on targets where basic blocks don't have addresses, like wasm.)

ychen added a comment.Aug 18 2022, 1:14 PM

In general, you can't clone when indirectbrs are involved: since the blockaddress is a constant, it needs to be valid for all the cloned versions of the indirectbr. And you can't make different basic blocks with the same address.

I mean, there are existing VMap for new blockaddress and old blockaddress. So each cloned indirectbr has its own set of blockaddress. For this case, the cloning logic does not apply this VMap to GlobalValue that uses blockaddress. no?

And the side effect is that any GlobalValue referencing blockaddress needs to be cloned too. Most of the time, this is just an array of blockaddress.

Never mind, I don't think it is legal to clone any GlobalValue here. Only GlobalValue that contains only blockaddress.

In general, you can't clone when indirectbrs are involved: since the blockaddress is a constant, it needs to be valid for all the cloned versions of the indirectbr. And you can't make different basic blocks with the same address.

I mean, there are existing VMap for new blockaddress and old blockaddress. So each cloned indirectbr has its own set of blockaddress. For this case, the cloning logic does not apply this VMap to GlobalValue that uses blockaddress. no?

And the side effect is that any GlobalValue referencing blockaddress needs to be cloned too. Most of the time, this is just an array of blockaddress.

The problem isn't a mechanical "how to I clone a value".

The problem is this: suppose you have a function with an indirectbr. Then you clone it. Now you have two indirectbrs, each with a different destination. So you need to rewrite each use of a blockaddress to use the right value, depending on which indirectbr it's going to be passed to. But in general, you can't determine that at the point of use of the indirectbr; you can't tell until the indirectbr is about to be executed. blockaddresses can be stored in global variables, passed to functions, returned from functions, etc.

There are only two ways out of this, in general: either avoid cloning, or transform the indirectbr into a switch (or equivalent). We avoid the indirectbr->switch transform where possible because it adds an extra table lookup to code which is probably performance-sensitive. (We use IndirectBrExpandPass on targets where basic blocks don't have addresses, like wasm.)

Thanks, I see your point now. The issue is indeed not solvable in general for the reason you mentioned, but for certain commonly used patterns like global blockaddress array, it is legal to clone since each element of such array could not be shared by functions. How about we handle this use case in CloneFunction and let IndirectBrExpandPass handle other use cases? No matter what, the current situation of silently creating broken IR when cloning blockaddress is not ideal.

Thanks, I see your point now. The issue is indeed not solvable in general for the reason you mentioned, but for certain commonly used patterns like global blockaddress array, it is legal to clone since each element of such array could not be shared by functions.

You'd have to prove that values loaded from the array don't escape the function, which seems difficult to prove in cases where I'd expect to see blockaddress used.

the current situation of silently creating broken IR when cloning blockaddress is not ideal.

I'm not sure what invariant we can enforce here. Maybe something related to the CloneFunctionChangeType?

ychen added a comment.EditedAug 19 2022, 10:48 AM

Thanks, I see your point now. The issue is indeed not solvable in general for the reason you mentioned, but for certain commonly used patterns like global blockaddress array, it is legal to clone since each element of such array could not be shared by functions.

You'd have to prove that values loaded from the array don't escape the function, which seems difficult to prove in cases where I'd expect to see blockaddress used.

That's right. Maybe just handle the case where blockaddress is been written at all?

the current situation of silently creating broken IR when cloning blockaddress is not ideal.

I'm not sure what invariant we can enforce here. Maybe something related to the CloneFunctionChangeType?

That's hard in general and feels insufficient to just handle some cases but not the others. How about emitting a warning of sorts all the way back to the frontend if any blockaddress is written? Also, add a boolean flag for CloneFunction API to optionally call IndirectBrExpandPass's logic.

the current situation of silently creating broken IR when cloning blockaddress is not ideal.

I'm not sure what invariant we can enforce here. Maybe something related to the CloneFunctionChangeType?

That's hard in general and feels insufficient to just handle some cases but not the others. How about emitting a warning of sorts all the way back to the frontend if any blockaddress is written? Also, add a boolean flag for CloneFunction API to optionally call IndirectBrExpandPass's logic.

Oh, I was just talking about the API invariant. Clearly coroutine lowering needs to do something different here. And probably clang should warn or error if indirect goto can't generate the expected code.

I forget, why do we have to clone the coroutine body? I think I read an explanation at some point, but I'm not remembering where it was; it seems like it should be possible to avoid cloning.

ychen added a comment.Aug 19 2022, 4:00 PM

the current situation of silently creating broken IR when cloning blockaddress is not ideal.

I'm not sure what invariant we can enforce here. Maybe something related to the CloneFunctionChangeType?

That's hard in general and feels insufficient to just handle some cases but not the others. How about emitting a warning of sorts all the way back to the frontend if any blockaddress is written? Also, add a boolean flag for CloneFunction API to optionally call IndirectBrExpandPass's logic.

Oh, I was just talking about the API invariant. Clearly coroutine lowering needs to do something different here. And probably clang should warn or error if indirect goto can't generate the expected code.

Add a new boolean flag HandleIndirectBranch defaulted to true: if a caller can know the blockaddress is safe to clone, they could set this to false(I expect this to be false most of the time). When HandleIndirectBranch is true, special-casing the case like global blockaddress array for function F where F has no blockaddress captured/escaped(https://llvm.org/docs/LangRef.html#pointer-capture) by cloning that array; for other cases, call IndirectBrExpandPass. Regardless of HandleIndirectBranch value, if a function references any global value and blockaddress maybe captured, we emit the warning/error because IndirectBrExpandPass can not handle all cases correctly.

I forget, why do we have to clone the coroutine body? I think I read an explanation at some point, but I'm not remembering where it was; it seems like it should be possible to avoid cloning.

Since the instructions executed before the first co_await call need to stay at the original function(ramp function). But for patterns like the below, this could not be decided statically so cloning is needed.

task corofoo(){
  while(...){
    if(..) co_await();
    else <code>;
  }
}

PS: I think it is helpful to add bound check for indirectbr. But I'm not sure where to put it, a new clang flag?

the current situation of silently creating broken IR when cloning blockaddress is not ideal.

I'm not sure what invariant we can enforce here. Maybe something related to the CloneFunctionChangeType?

That's hard in general and feels insufficient to just handle some cases but not the others. How about emitting a warning of sorts all the way back to the frontend if any blockaddress is written? Also, add a boolean flag for CloneFunction API to optionally call IndirectBrExpandPass's logic.

Oh, I was just talking about the API invariant. Clearly coroutine lowering needs to do something different here. And probably clang should warn or error if indirect goto can't generate the expected code.

Add a new boolean flag HandleIndirectBranch defaulted to true: if a caller can know the blockaddress is safe to clone, they could set this to false(I expect this to be false most of the time). When HandleIndirectBranch is true, special-casing the case like global blockaddress array for function F where F has no blockaddress captured/escaped(https://llvm.org/docs/LangRef.html#pointer-capture) by cloning that array; for other cases, call IndirectBrExpandPass. Regardless of HandleIndirectBranch value, if a function references any global value and blockaddress maybe captured, we emit the warning/error because IndirectBrExpandPass can not handle all cases correctly.

I suspect the special case for "global blockaddress array" won't really trigger in practice; most uses of indirectbr involve the value escaping somehow, because otherwise there isn't really any value over just using a switch. (I don't think your testcase is representative.)

IndirectBrExpandPass can handle all cases correctly; it's just not efficient, like I mentioned before. And you probably don't want to do it without the caller explicitly asking for it.

I forget, why do we have to clone the coroutine body? I think I read an explanation at some point, but I'm not remembering where it was; it seems like it should be possible to avoid cloning.

Since the instructions executed before the first co_await call need to stay at the original function(ramp function). But for patterns like the below, this could not be decided statically so cloning is needed.

task corofoo(){
  while(...){
    if(..) co_await();
    else <code>;
  }
}

Cloning is one way to implement that, sure. I don't think it's the only way, though; you could move the body into a helper function, and have both coroutine entry-points call it. Maybe at some cost to performance. It might make sense to implement a scheme like this in case we run into other situations where we want to avoid cloning.

PS: I think it is helpful to add bound check for indirectbr. But I'm not sure where to put it, a new clang flag?

We could add a check to -fsanitize=undefined (https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html) to check that the address passed to an indirectbr is one of the expected addresses? Normally the undefined-behavior sanitizers insert the code early, though, in clang IR generation, so we can give good diagnostics. Not sure it would have caught a bug in a transform, though.

Alternatively, you could implement something more like CFI (https://clang.llvm.org/docs/ControlFlowIntegrity.html), which inserts checks late. So when the indirect-goto sanitizer is enabled, clang adds a function attribute, then the backend adds some checks just before isel, or something like that.

you could move the body into a helper function, and have both coroutine entry-points call it

that's the solution I used in the end to workaround this issue, by the way. I adjusted my C++ program accordingly, see https://godbolt.org/z/17G8jGbse

ychen added a comment.Aug 19 2022, 5:55 PM

I suspect the special case for "global blockaddress array" won't really trigger in practice; most uses of indirectbr involve the value escaping somehow, because otherwise there isn't really any value over just using a switch. (I don't think your testcase is representative.)

Could you share a test case that escapes or some example URL, please? I could not find many by googling computed gogo. Probably there are better keywords.

IndirectBrExpandPass can handle all cases correctly; it's just not efficient, like I mentioned before. And you probably don't want to do it without the caller explicitly asking for it.

One test case I was thinking of is

bool doInterpret(Instruction*& nextIntruction, int& out, int x) {
    static void* dispatch_table[] = {&&inc, &&suspend, &&stop};
    #define DISPATCH() goto *dispatch_table[*nextIntruction++]

    if(x%2){ // x==7: store &&suspend, x==9: store &&inc
       dispatch_table[0] = dispatch_table[x%3];
       return false;
    } else {
       DISPATCH();
    }

inc:
    ++out;
    DISPATCH();
suspend:
    return false;
stop:
    return true;
}

I don't think and am not sure if this is representative, but I guess it is valid?

If this is cloned even after IndirectBrExpandPass (not necessarily for coroutine, but in general), it may or may not do the right thing depending on how the two instances are called.

Cloning is one way to implement that, sure. I don't think it's the only way, though; you could move the body into a helper function, and have both coroutine entry-points call it. Maybe at some cost to performance. It might make sense to implement a scheme like this in case we run into other situations where we want to avoid cloning.

Sounds good, I'll look into it.

PS: I think it is helpful to add bound check for indirectbr. But I'm not sure where to put it, a new clang flag?

We could add a check to -fsanitize=undefined (https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html) to check that the address passed to an indirectbr is one of the expected addresses? Normally the undefined-behavior sanitizers insert the code early, though, in clang IR generation, so we can give good diagnostics. Not sure it would have caught a bug in a transform, though.

Alternatively, you could implement something more like CFI (https://clang.llvm.org/docs/ControlFlowIntegrity.html), which inserts checks late. So when the indirect-goto sanitizer is enabled, clang adds a function attribute, then the backend adds some checks just before isel, or something like that.

Sounds great. CFI approach seems more promising to me too. I'll look into it.

I suspect the special case for "global blockaddress array" won't really trigger in practice; most uses of indirectbr involve the value escaping somehow, because otherwise there isn't really any value over just using a switch. (I don't think your testcase is representative.)

Could you share a test case that escapes or some example URL, please? I could not find many by googling computed gogo. Probably there are better keywords.

It's really not that frequently used in general; people usually only go for it to squeeze out the last bit of performance out of an interpreter.

Maybe take a look at OCaml: https://github.com/ocaml/ocaml/blob/trunk/runtime/interp.c

IndirectBrExpandPass can handle all cases correctly; it's just not efficient, like I mentioned before. And you probably don't want to do it without the caller explicitly asking for it.

One test case I was thinking of is

bool doInterpret(Instruction*& nextIntruction, int& out, int x) {
    static void* dispatch_table[] = {&&inc, &&suspend, &&stop};
    #define DISPATCH() goto *dispatch_table[*nextIntruction++]

    if(x%2){ // x==7: store &&suspend, x==9: store &&inc
       dispatch_table[0] = dispatch_table[x%3];
       return false;
    } else {
       DISPATCH();
    }

inc:
    ++out;
    DISPATCH();
suspend:
    return false;
stop:
    return true;
}

I don't think and am not sure if this is representative, but I guess it is valid?

If this is cloned even after IndirectBrExpandPass (not necessarily for coroutine, but in general), it may or may not do the right thing depending on how the two instances are called.

The way IndirectBrExpandPass works is that it rewrites away all "blockaddress" constants. Once the IR no longer contains any "blockaddress", whether or not you clone is irrelevant.

jberdine added a subscriber: jberdine.EditedAug 22 2022, 2:20 AM

I'm not sure I have all the context, but it sounds like perhaps interpret.c from pr3120 could be a useful test case.

arsenm added a subscriber: arsenm.Jan 16 2023, 3:14 PM

I suspect this is related to the broken behavior for llvm-reduce mentioned in D140909

I suspect this is related to the broken behavior for llvm-reduce mentioned in D140909

The llvm-reduce path has an easy workaround which is to stop using CloneModule and only go through bitcode serialization (which is actually what we do for the multithread path anyway to get a unique context for each)

I suspect this is related to the broken behavior for llvm-reduce mentioned in D140909

The llvm-reduce path has an easy workaround which is to stop using CloneModule and only go through bitcode serialization (which is actually what we do for the multithread path anyway to get a unique context for each)

I agree with the summary in D140909. But I am confused that we can fix the current problem by the method in D140909.

All the discussion here is centered around cloning individual functions, and inserting the result in the same module. If you're using CloneModule, the concerns are different. I have no idea if that codepath works, but it's implementable in a straightforward manner.