Download Raw Diff

Details

Reviewers

aaron.ballman
jdoerfert
nlopes
mysterymath

Commits

rG70792cd4f837: [LangRef] Add description for nocallback attribute

Summary

This patch adds the description for nocallback attribute
that is implemented in https://reviews.llvm.org/D90275.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

gulfem created this revision.Aug 10 2022, 3:45 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 10 2022, 3:45 PM

Herald added a subscriber: jdoerfert. · View Herald Transcript

gulfem requested review of this revision.Aug 10 2022, 3:45 PM

Herald added a project: Restricted Project. · View Herald TranscriptAug 10 2022, 3:45 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

gulfem added reviewers: aaron.ballman, jdoerfert, nlopes.Aug 10 2022, 3:47 PM

gulfem mentioned this in D90275: [clang][IR] Add support for leaf attribute.Aug 10 2022, 5:34 PM

Harbormaster completed remote builds in B180547: Diff 451661.Aug 10 2022, 6:50 PM

Does the semantics imply that the following is UB? I assume so.

define @fn() nocallback {
  call @foo()  ; missing nocallback
}

What's the usefulness of this attribute? Is there any optimization that is enabled by this? It's not obvious to me, so it would be nice to document that.

This should map to gcc's leaf attribute, right?

gcc docs say:

Calls to external functions with this attribute must return to the current compilation unit only by return or by exception handling. In particular, a leaf function is not allowed to invoke callback functions passed to it from the current compilation unit, directly call functions exported by the unit, or longjmp into the unit. Leaf functions might still call functions from other compilation units and thus they are not necessarily leaf in the sense that they contain no function calls at all.

The attribute is intended for library functions to improve dataflow analysis. The compiler takes the hint that any data not escaping the current compilation unit cannot be used or modified by the leaf function. For example, the sin function is a leaf function, but qsort is not.

Note that leaf functions might indirectly run a signal handler defined in the current compilation unit that uses static variables. Similarly, when lazy symbol resolution is in effect, leaf functions might invoke indirect functions whose resolver function or implementation function is defined in the current compilation unit and uses static variables. There is no standard-compliant way to write such a signal handler, resolver function, or implementation function, and the best that you can do is to remove the leaf attribute or mark all such static variables volatile. Lastly, for ELF-based systems that support symbol interposition, care should be taken that functions defined in the current compilation unit do not unexpectedly interpose other symbols based on the defined standards mode and defined feature test macros; otherwise an inadvertent callback would be added.

The attribute has no effect on functions defined within the current compilation unit. This is to allow easy merging of multiple compilation units into one, for example, by using the link-time optimization. For this reason the attribute is not allowed on types to annotate indirect calls.

So we expect same additional optimizations from nocallback? But for now, LLVM ignores nocallback, or..?

Thanks for the reference.

An important issue is that LLVM IR does not have the notion of compilation unit, as we can merge .bc/.ll files from different compilation units and the optimizers can't tell the difference.
So I'm a bit concerned with the semantics we are trying to give here. It can't be operationalized in LLVM IR AFAICT.
To implement an attribute equivalent to gcc's leaf we would need to capture the set of internal variables of each compilation unit in the attribute somehow.

We should use the wording of inaccessiblememonly here too:

inaccessiblememonly

    This attribute indicates that the function may only access memory that is not accessible by the module being compiled before return from the function. This is a weaker form of readnone. If the function reads or writes other memory, the behavior is undefined.

The idea is the same. So something like:

This attribute indicates that the function may only execute code that is not part of the module being compiled before return from the function.

I would further restrict it to be only applicable to function declarations and call sites of such.

What's the usefulness of this attribute? Is there any optimization that is enabled by this? It's not obvious to me, so it would be nice to document that.

We use it in the Attributor to build a better "call graph". "Default intrinsics" carry it, see https://reviews.llvm.org/D118680. This helps us to do proper reachability queries when we optimize memory accesses interprocedurally but have some intrinsic calls flying around. We also annotate external functions in the OpenMP GPU runtime for the same reason, see https://reviews.llvm.org/D112153.

In D131628#3716252, @jdoerfert wrote:

The idea is the same. So something like:

I'm fine with your description as well, so that the wording is more consistent between attributes. If there is no objection, I can make the change.

@nlopes and @xbolva00,
It seems like @jdoerfert answered your questions. Please let me know if you have further questions.

In D131628#3716252, @jdoerfert wrote:
The idea is the same. So something like:
This attribute indicates that the function may only execute code that is not part of the module being compiled before return from the function.

But this wording is not useful to lower gcc/clang's leaf attribute. The semantics don't match then. The semantics on LTO is different.
The gcc attribute only guarantees that a leaf fn call doesn't change any global in the present file. Doesn't guarantee that it won't change a global in another file. The proposed wording doesn't match this semantics.

I only see 2 ways forward:

Drop the lowering from clang's leaf attribute. We've no data to support that there's any performance benefit
Someone provides performance data justifying the effort and we take time to properly design this. A simple hack would be to drop all nocallback attributes when linking IR files. That would fix the LTO case.

What's the usefulness of this attribute? Is there any optimization that is enabled by this? It's not obvious to me, so it would be nice to document that.

We use it in the Attributor to build a better "call graph". "Default intrinsics" carry it, see https://reviews.llvm.org/D118680. This helps us to do proper reachability queries when we optimize memory accesses interprocedurally but have some intrinsic calls flying around. We also annotate external functions in the OpenMP GPU runtime for the same reason, see https://reviews.llvm.org/D112153.

Thanks.
I see the point now. You can decide that a variable whose address has escaped won't be changed by the fn call. Unclear what's the perf benefit without data though.

Anyway, the semantics issue pointed out above must be resolved or else the patch must be reverted as currently the lowering of clang is wrong and will miscompile programs.

This revision now requires changes to proceed.Aug 12 2022, 7:38 AM

In D131628#3719136, @nlopes wrote:

Anyway, the semantics issue pointed out above must be resolved or else the patch must be reverted as currently the lowering of clang is wrong and will miscompile programs.

Let's just resolve this now and here.

Drop the lowering from clang's leaf attribute. We've no data to support that there's any performance benefit

Someone provides performance data justifying the effort and we take time to properly design this. [...]

I figured us using it implies we need it. However, here is some data: https://tianshilei.me/wp-content/uploads/2022/06/ipdps-2022.pdf
Note section IV-B2 which talks about inter-procedural reachability, that is only possible with nocallback on intrinsics (and leaf on malloc and free in some cases).
GridMini is a really simple test case, anything with more complexity (which causes the kernel to be not completely inlined) benefits explicitly from this.

[...] A simple hack would be to drop all nocallback attributes when linking IR files. That would fix the LTO case.

nocallback follows, as I mentioned before, the same idea as inaccesiblememory. You are not wrong about LTO, but we can apply the same handling to both which is to drop the attribute during a llvm-link step. We only need to do that if we link in the definition. That is why I said this should only be valid on declarations. (inaccessiblememory should also have that restriction).

In D131628#3719212, @jdoerfert wrote:

In D131628#3719136, @nlopes wrote:

Anyway, the semantics issue pointed out above must be resolved or else the patch must be reverted as currently the lowering of clang is wrong and will miscompile programs.

Let's just resolve this now and here.

I figured us using it implies we need it. However, here is some data: https://tianshilei.me/wp-content/uploads/2022/06/ipdps-2022.pdf

My question is normal and reasonable. I ask it every time someone wants to add something new to the IR.
The LLVM community must know why it has to deal with extra complexity forever. I will refuse any IR feature that gives 0.00001% speedup for 1 benchmark.

Note section IV-B2 which talks about inter-procedural reachability, that is only possible with nocallback on intrinsics (and leaf on malloc and free in some cases).

No numbers are provided. My question still stands.
Plus, malloc/free have the inaccessiblememonly attribute, which means they don't need the nocallback attribute. These attributes are transitive, so it's a given that malloc cannot call any function that doesn't have the inaccessiblememonly attribute.
What's the impact of this attribute in your reachability analysis?

[...] A simple hack would be to drop all nocallback attributes when linking IR files. That would fix the LTO case.

nocallback follows, as I mentioned before, the same idea as inaccesiblememory. You are not wrong about LTO, but we can apply the same handling to both which is to drop the attribute during a llvm-link step. We only need to do that if we link in the definition. That is why I said this should only be valid on declarations. (inaccessiblememory should also have that restriction).

It's not the same thing as nocallback is given by users and inaccessiblememonly is not. LLVM adds the latter only to libcalls. It's not always wrong to do LTO with it; only if a lib function calls another function and they share the same globals -- not sure it exists in practice.
Nevertheless, we cannot introduce a new IR feature that we know that is buggy and without a plan to fix it. Absolutely no way! The patch must be reverted first. Then you can propose a plan to reintroduce the feature in a sound way if it's justified. We need numbers please.

In D131628#3721285, @nlopes wrote:

In D131628#3719212, @jdoerfert wrote:

In D131628#3719136, @nlopes wrote:

Anyway, the semantics issue pointed out above must be resolved or else the patch must be reverted as currently the lowering of clang is wrong and will miscompile programs.

Let's just resolve this now and here.

I figured us using it implies we need it. However, here is some data: https://tianshilei.me/wp-content/uploads/2022/06/ipdps-2022.pdf

My question is normal and reasonable.

My answers are too :). One thing that is missing from this conversation is an example of a breakage. Or did I miss that?

I ask it every time someone wants to add something new to the IR.
The LLVM community must know why it has to deal with extra complexity forever. I will refuse any IR feature that gives 0.00001% speedup for 1 benchmark.

That is great, but not the case here.

Note section IV-B2 which talks about inter-procedural reachability, that is only possible with nocallback on intrinsics (and leaf on malloc and free in some cases).

No numbers are provided. My question still stands.

There are numbers in the paper, both for the impact of the entire inter-procedural store-load forwarding as well as the different parts.
All OpenMP GPU benchmarks benefit from this substantially, especially once all paper parts have been upstreamed.
That is why we did the study (=paper) and prototype. Cleaning it up and upstreaming simply takes time.

Plus, malloc/free have the inaccessiblememonly attribute, which means they don't need the nocallback attribute. These attributes are transitive, so it's a given that malloc cannot call any function that doesn't have the inaccessiblememonly attribute.

That is easy to say and hard to do. If you set analysis up composable it's hard to embed "cross query" arguments, e.g., no local memory is accesses imply no local code can be executed.
At the end of the day you never know why some user might ask for reachability, or a call graph. The logic you propose doesn't work to deduce norecurse for the caller, for example.
That said, we are about to replace malloc/free on the GPU with custom implementations (=definitions) this is necessary for AMD anyway and the NVIDIA impl is slow, so replacing it makes sense.
As that change dropped, neither malloc nor free will be inaccessiblememonly anymore. Most other libc functions we are about to provide would be annotated as well, incl. the ones that go to the host with rpc calls.

What's the impact of this attribute in your reachability analysis?

It allows to avoid spurious call edges due to intrinsics and annotated functions (see above) instead of treating intrinsics as if they would be implicitly nocallback.

Below, we want to know if g() can be reached by f(). We can only be answered with "no" if we have nocallback on llvm.memset.

void kernel() { 
 g();
 f();
}
static void f() {
  llvm.memset(...) 
}
static void g() { ... }

Now you might argue that LLVM already builds a call graph where that can be determined, but that is only because it erroneously assumes nocallback (effectively) to any intrinsic.
Let's look at the following example (https://godbolt.org/z/nef9n6qob)

define dso_local void @test(ptr %A) {
entry:
  call void @llvm.memset.p0.i64(ptr %A, i8 0, i64 100, i1 false)
  call void @llvm.unknown.new.intrinsic(ptr @bar)
  ret void
}

declare void @bar()

declare void @llvm.memset.p0.i64(ptr nocapture writeonly, i8, i64, i1 immarg)

declare void @llvm.unknown.new.intrinsic(ptr)

@test is calling two functions and the unknown intrinsic might just as well call @bar. However, the call graph doesn't reflect either:

Call graph node <<null function>><<0x555c5607bb10>>  #uses=0
  CS<None> calls function 'test'
  CS<None> calls function 'bar'
  CS<None> calls function 'llvm.memset.p0.i64'
  CS<None> calls function 'llvm.unknown.new.intrinsic'

Call graph node for function: 'bar'<<0x555c5607bc50>>  #uses=1
  CS<None> calls external node

Call graph node for function: 'llvm.memset.p0.i64'<<0x555c5607bcd0>>  #uses=1

Call graph node for function: 'llvm.unknown.new.intrinsic'<<0x555c5607be00>>  #uses=1

Call graph node for function: 'test'<<0x555c5607bbd0>>  #uses=1

Compiler returned: 0

[...] A simple hack would be to drop all nocallback attributes when linking IR files. That would fix the LTO case.

nocallback follows, as I mentioned before, the same idea as inaccesiblememory. You are not wrong about LTO, but we can apply the same handling to both which is to drop the attribute during a llvm-link step. We only need to do that if we link in the definition. That is why I said this should only be valid on declarations. (inaccessiblememory should also have that restriction).

It's not the same thing as nocallback is given by users and inaccessiblememonly is not.

True. But that might be subject to change as more people want to expose more attributes to the user. Different discussion though.

LLVM adds the latter only to libcalls. It's not always wrong to do LTO with it; only if a lib function calls another function and they share the same globals -- not sure it exists in practice.

That is not true and not what llvm-link does. We need to drop either attribute if we link in a definition. Let's run an example:

a.ll

define i32 @main() {
entry:
  %r1 = call i32 @foo1()
  %r2 = call i32 @foo2()
  ret i32 %r2
}

declare i32 @foo1() inaccessiblememonly
declare i32 @foo2() inaccessiblememonly

b.ll

@g = internal global i32 0

define i32 @foo1() noinline {
entry:
  %l = load i32, i32* @g
  %a = add i32 %l, 1
  store i32 %a, i32* @g
  ret i32 %l
}
define i32 @foo2() {
entry:
  %l = load i32, i32* @g
  %a = add i32 %l, 1
  store i32 %a, i32* @g
  ret i32 %l
}

Now we run llvm-link a.ll b.ll -o c.ll, note that no globals are shared, a.ll has no globals after all.

c.ll

@g = internal global i32 0

define i32 @main() {
entry:
  %r1 = call i32 @foo1()
  %r2 = call i32 @foo2()
  ret i32 %r2
}

; Function Attrs: noinline
define i32 @foo1() #0 {
entry:
  %l = load i32, i32* @g
  %a = add i32 %l, 1
  store i32 %a, i32* @g
  ret i32 %l
}

define i32 @foo2() {
entry:
  %l = load i32, i32* @g
  %a = add i32 %l, 1
  store i32 %a, i32* @g
  ret i32 %l
}

attributes #0 = { noinline }

inaccessiblememonly is gone (which is correct and necessary).

Nevertheless, we cannot introduce a new IR feature that we know that is buggy and without a plan to fix it. Absolutely no way! The patch must be reverted first. Then you can propose a plan to reintroduce the feature in a sound way if it's justified. We need numbers please.

The feature is not buggy. LTO works fine as the inaccessiblememonly is actually not hardcoded but a side-effect.
Let's do the test:
a.ll

define i32 @main() {
entry:
  %r1 = call i32 @foo1()
  %r2 = call i32 @foo2()
  ret i32 %r2
}

declare i32 @foo1() inaccessiblememonly nocallback
declare i32 @foo2() inaccessiblememonly nocallback

b.ll

@g = internal global i32 0

define i32 @foo1() noinline {
entry:
  %l = load i32, i32* @g
  %a = add i32 %l, 1
  store i32 %a, i32* @g
  ret i32 %l
}
define i32 @foo2() {
entry:
  %l = load i32, i32* @g
  %a = add i32 %l, 1
  store i32 %a, i32* @g
  call i32 @foo1()
  ret i32 %l
}

Now we run LTO:
llvm-lto a.bc b.bc --save-linked-module -o linked
and look at the merged module (effectively llvm-linked module)
opt -S linked.linked.bc

@g = internal global i32 0

define internal i32 @main() {
entry:
  %r1 = call i32 @foo1()
  %r2 = call i32 @foo2()
  ret i32 %r2
}

; Function Attrs: noinline
define internal i32 @foo1() #0 {
entry:
  %l = load i32, ptr @g, align 4
  %a = add i32 %l, 1
  store i32 %a, ptr @g, align 4
  ret i32 %l
}

define internal i32 @foo2() {
entry:
  %l = load i32, ptr @g, align 4
  %a = add i32 %l, 1
  store i32 %a, ptr @g, align 4
  ret i32 %l
}

attributes #0 = { noinline }

All attributes we needed to drop are gone.

In D131628#3719136, @nlopes wrote:

Anyway, the semantics issue pointed out above must be resolved or else the patch must be reverted as currently the lowering of clang is wrong and will miscompile programs.

@nlopes could you please provide us examples of miscompilations and we can all be on the same page. I'm not clear why and how nocallback is causing miscompilations.

In D131628#3723580, @jdoerfert wrote:

In D131628#3721285, @nlopes wrote:

In D131628#3719212, @jdoerfert wrote:

In D131628#3719136, @nlopes wrote:

Anyway, the semantics issue pointed out above must be resolved or else the patch must be reverted as currently the lowering of clang is wrong and will miscompile programs.

Let's just resolve this now and here.

I figured us using it implies we need it. However, here is some data: https://tianshilei.me/wp-content/uploads/2022/06/ipdps-2022.pdf

My question is normal and reasonable.

My answers are too :). One thing that is missing from this conversation is an example of a breakage. Or did I miss that?

No. Your answer is on the line of "trust me, I know what I'm doing". That's not an ok answer. We have the right for a proper answer with numbers.

Regarding the bug, you said yourself and I quote "You are not wrong about LTO", so I assumed you understood the issue. I left the sketch of the bug in the other thread nevertheless.

I ask it every time someone wants to add something new to the IR.
The LLVM community must know why it has to deal with extra complexity forever. I will refuse any IR feature that gives 0.00001% speedup for 1 benchmark.

That is great, but not the case here.

Numbers please?

Note section IV-B2 which talks about inter-procedural reachability, that is only possible with nocallback on intrinsics (and leaf on malloc and free in some cases).

No numbers are provided. My question still stands.

There are numbers in the paper, both for the impact of the entire inter-procedural store-load forwarding as well as the different parts.
All OpenMP GPU benchmarks benefit from this substantially, especially once all paper parts have been upstreamed.
That is why we did the study (=paper) and prototype. Cleaning it up and upstreaming simply takes time.

I would appreciate if you could summarize the impact of this feature. There are no numbers in the section IV-B2 you pointed out.

What's the impact of this attribute in your reachability analysis?

It allows to avoid spurious call edges due to intrinsics and annotated functions (see above) instead of treating intrinsics as if they would be implicitly nocallback.

Below, we want to know if g() can be reached by f(). We can only be answered with "no" if we have nocallback on llvm.memset.
void kernel() { 
 g();
 f();
}
static void f() {
  llvm.memset(...) 
}
static void g() { ... }

The address of the functions must have been stored to memory, otherwise the functions are not callable indirectly.
memset is writeonly, so it can't call anything as it can't read any pointer from memory. Problem solved. memcpy is argmemonly.
I know the argument doesn't apply everywhere, but how many programs write functions pointers to memory and call some intrinsic where you cannot deduce it won't read those pointers? It's a very tight condition.
Hence I ask again: how impactful is this feature?

Nevertheless, we cannot introduce a new IR feature that we know that is buggy and without a plan to fix it. Absolutely no way! The patch must be reverted first. Then you can propose a plan to reintroduce the feature in a sound way if it's justified. We need numbers please.

The feature is not buggy. LTO works fine as the inaccessiblememonly is actually not hardcoded but a side-effect.
Let's do the test:
a.ll
define i32 @main() {
entry:
  %r1 = call i32 @foo1()
  %r2 = call i32 @foo2()
  ret i32 %r2
}

declare i32 @foo1() inaccessiblememonly nocallback
declare i32 @foo2() inaccessiblememonly nocallback
b.ll
@g = internal global i32 0

define i32 @foo1() noinline {
entry:
  %l = load i32, i32* @g
  %a = add i32 %l, 1
  store i32 %a, i32* @g
  ret i32 %l
}
define i32 @foo2() {
entry:
  %l = load i32, i32* @g
  %a = add i32 %l, 1
  store i32 %a, i32* @g
  call i32 @foo1()
  ret i32 %l
}
Now we run LTO:
All attributes we needed to drop are gone.

Thank you for the example!
That shows the linker already drops the nocallback attribute. However, it seems the linker only drops the attribute when linking in the definition. But I don't think this is sufficient. nocallback means you don't call any function in that file. Once you bring in more functions you can't ensure you won't call those extra functions. You'll get the call graph potentially wrong.
The linker must always drop nocallback.

I also suggest that the documentation should mention the attributed must be dropped during linking.

In D131628#3724062, @nlopes wrote:

That shows the linker already drops the nocallback attribute. However, it seems the linker only drops the attribute when linking in the definition. But I don't think this is sufficient. nocallback means you don't call any function in that file. Once you bring in more functions you can't ensure you won't call those extra functions. You'll get the call graph potentially wrong.
The linker must always drop nocallback.

I also suggest that the documentation should mention the attributed must be dropped during linking.

Here's a more concrete example:
main.ll:

define i32 @main() {
entry:
  call void @foo()
  ret i32 0
}
declare void @foo() nocallback

foo.ll:

define void @foo() {
entry:
  call @bar()
  ret void
}

bar.ll:

define void @bar() {
entry:
  ret void
}

Now, if you merge main.ll and bar.ll together, you get:

; ModuleID = 'merged.bc'
source_filename = "llvm-link"

define i32 @main() {
entry:
  call void @foo()
  ret i32 0
}

; Function Attrs: nocallback
declare void @foo() #0

define void @bar() {
entry:
  ret void
}

attributes #0 = { nocallback }

Then, the nocallback attribute on @foo in the merged module is incorrect, since the implementation calls @bar, which is now in the same module.

Do we actually know what GCC's LTO behavior is in this case? It's interesting that part of the description of __attribute__((leaf)) says it's there "so files can easily be merged", but it seems like merging modules in GCC would have the same issue, unless they do provenance tracking of which functions came from which original translation unit, or if they do all decision-making that involves __attribute__((leaf)) before the modules are merged.

The linker must always drop nocallback.

Yes, let's do that. We'll create a patch. Also for inaccessiblememonly and friends, see below.

@mysterymath Great example. inaccessiblememonly is broken in the same way :)

; main.ll
define i32 @main() {
entry:
  call void @foo()
  %r = call i32 @bar()
  ret i32 %r
}
declare void @foo() inaccessiblememonly
declare i32 @bar()

; foo.ll
@g = global i32 0

define void @foo() {
entry:
  store i32 42, i32* @g
  ret void
}

; bar.ll
@g = external global i32

define i32 @bar() {
entry:
  %r = load i32, i32* @g
  ret i32 %r
}

llvm-link main.ll bar.ll -o merged.ll

; merged.ll
@g = external global i32

define i32 @main() {
entry:
  call void @foo()
  %r = call i32 @bar()
  ret i32 %r
}

; Function Attrs: inaccessiblememonly
declare void @foo() #0

define i32 @bar() {
entry:
  %r = load i32, i32* @g
  ret i32 %r
}

attributes #0 = { inaccessiblememonly }

jhuber6 added a subscriber: jhuber6.Aug 15 2022, 1:59 PM

In D131628#3724307, @jdoerfert wrote:

The linker must always drop nocallback.

Yes, let's do that. We'll create a patch. Also for inaccessiblememonly and friends, see below.

If inaccessiblememonly is generated internally with the expectation of a per-translation-unit semantics, then it makes sense to summarily drop it.
But unless someone already has, we really should verify that __attribute__((leaf)) actually does have per-translation-unit semantics in GCC LTO, not per LTO-unit. I wouldn't be surprised if this feature predates LTO, and some kind of naive module merging in GCC may give it identical semantics to the one we have today.

There's even a case to be made for the LTO unit interpretation; it allows you to declare a function to effectively be a leaf if it can't call anything that could possible be included in the current LTO unit. This would allow pruning the reachability graph more thoroughly in LTO than the other interpretation, and I'd argue that this attribute is primarily useful for functions that are actually or approximately leaf functions.

In D131628#3724395, @mysterymath wrote:

In D131628#3724307, @jdoerfert wrote:

The linker must always drop nocallback.

Yes, let's do that. We'll create a patch. Also for inaccessiblememonly and friends, see below.

If inaccessiblememonly is generated internally with the expectation of a per-translation-unit semantics, then it makes sense to summarily drop it.
But unless someone already has, we really should verify that __attribute__((leaf)) actually does have per-translation-unit semantics in GCC LTO, not per LTO-unit. I wouldn't be surprised if this feature predates LTO, and some kind of naive module merging in GCC may give it identical semantics to the one we have today.

There's even a case to be made for the LTO unit interpretation; it allows you to declare a function to effectively be a leaf if it can't call anything that could possible be included in the current LTO unit. This would allow pruning the reachability graph more thoroughly in LTO than the other interpretation, and I'd argue that this attribute is primarily useful for functions that are actually or approximately leaf functions.

For both, inaccessiblememonly and nocallback I agree. I also expect our use case to "opt-out" of the default dropping by the linker. We really want/need this for "external" functions, like LTO module external, e.g., provided by the runtime system.

That shows the linker already drops the nocallback attribute. However, it seems the linker only drops the attribute when linking in the definition. But I don't think this is sufficient. nocallback means you don't call any function in that file. Once you bring in more functions you can't ensure you won't call those extra functions. You'll get the call graph potentially wrong.

I don't think nocallback means you don't call any function in that file. Functions with leaf attributes can still call other functions, even functions in other translation units based on GCC spec.

Leaf functions might still call functions from other compilation units and thus they are not necessarily leaf in the sense that they contain no function calls at all.

Are we on the same page on that?

In D131628#3724731, @gulfem wrote:
That shows the linker already drops the nocallback attribute. However, it seems the linker only drops the attribute when linking in the definition. But I don't think this is sufficient. nocallback means you don't call any function in that file. Once you bring in more functions you can't ensure you won't call those extra functions. You'll get the call graph potentially wrong.

I don't think nocallback means you don't call any function in that file. Functions with leaf attributes can still call other functions, even functions in other translation units based on GCC spec.
Leaf functions might still call functions from other compilation units and thus they are not necessarily leaf in the sense that they contain no function calls at all.
Are we on the same page on that?

They can call functions on other translation units. But the problem is when you merge translation units. What was a fine call before, may not be anymore after linking. Before linking. the leaf function would call a callback from another file (which is ok), after linking it suddenly calls a callback in the same file (not ok).

We can do 2 things:

drop nocalback and inacc... whenever we link as the current "module-based" wording might be violated in any link step.
rephrase it to be "lto-unit" based. So it needs to keep holding even after every step of (IR-)linking.

I believe 2) is what we actually want, thoughts?

In D131628#3732352, @jdoerfert wrote:

We can do 2 things:

drop nocalback and inacc... whenever we link as the current "module-based" wording might be violated in any link step.

rephrase it to be "lto-unit" based. So it needs to keep holding even after every step of (IR-)linking.

I believe 2) is what we actually want, thoughts?

GIven that the semantics of gcc's leaf attribute is per file, there's no other options than dropping the attribute when linking.

In D131628#3738046, @nlopes wrote:

GIven that the semantics of gcc's leaf attribute is per file, [...]

I finally got around to compiling up a GCC instance and adding enough printfs to the tree dumper to see what it does on the main/foo/bar example I gave above.
As far as I can tell, GCC doesn't drop __attribute__((leaf)) when merging modules in LTO, as the call to foo in main is internally marked as leaf, even when LTOed against bar, keeping the foo module out of the LTO unit.

If this behavior is intentional (and I didn't mess up the measurement), then "compilation unit" in the GCC doc must refer to LTO unit in the case of LTO. If it's unintentional, then there may be a bug to file against both GCC and LLVM.

I'm fairly novice with GCC, so it's reasonably likely that I screwed something up. I'd welcome someone with more experience trying to reproduce or falsify these results.

If these results are correct, then my suggestion would be to explicitly codify this behavior, since it provides the most utility for the feature in case of LTO, and it's consistent with GCC's actual behavior and one reading of it's somewhat ambiguous feature description.

In D131628#3740602, @mysterymath wrote:

In D131628#3738046, @nlopes wrote:

GIven that the semantics of gcc's leaf attribute is per file, [...]

I finally got around to compiling up a GCC instance and adding enough printfs to the tree dumper to see what it does on the main/foo/bar example I gave above.
As far as I can tell, GCC doesn't drop __attribute__((leaf)) when merging modules in LTO, as the call to foo in main is internally marked as leaf, even when LTOed against bar, keeping the foo module out of the LTO unit.

If this behavior is intentional (and I didn't mess up the measurement), then "compilation unit" in the GCC doc must refer to LTO unit in the case of LTO. If it's unintentional, then there may be a bug to file against both GCC and LLVM.

I'm fairly novice with GCC, so it's reasonably likely that I screwed something up. I'd welcome someone with more experience trying to reproduce or falsify these results.

If these results are correct, then my suggestion would be to explicitly codify this behavior, since it provides the most utility for the feature in case of LTO, and it's consistent with GCC's actual behavior and one reading of it's somewhat ambiguous feature description.

Thank you!
Do you mind filing a bug report with gcc, please? Either the code or the documentation needs fixing.

In D131628#3742309, @nlopes wrote:

In D131628#3740602, @mysterymath wrote:

In D131628#3738046, @nlopes wrote:

GIven that the semantics of gcc's leaf attribute is per file, [...]

I finally got around to compiling up a GCC instance and adding enough printfs to the tree dumper to see what it does on the main/foo/bar example I gave above.
As far as I can tell, GCC doesn't drop __attribute__((leaf)) when merging modules in LTO, as the call to foo in main is internally marked as leaf, even when LTOed against bar, keeping the foo module out of the LTO unit.

If this behavior is intentional (and I didn't mess up the measurement), then "compilation unit" in the GCC doc must refer to LTO unit in the case of LTO. If it's unintentional, then there may be a bug to file against both GCC and LLVM.

I'm fairly novice with GCC, so it's reasonably likely that I screwed something up. I'd welcome someone with more experience trying to reproduce or falsify these results.

If these results are correct, then my suggestion would be to explicitly codify this behavior, since it provides the most utility for the feature in case of LTO, and it's consistent with GCC's actual behavior and one reading of it's somewhat ambiguous feature description.

Thank you!
Do you mind filing a bug report with gcc, please? Either the code or the documentation needs fixing.

Good idea; I've filed https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106725

nikic mentioned this in D136939: [X86] Use default attributes for intrinsics.Oct 28 2022, 5:05 AM

How should we proceed on this? AFAICT, we did not get a clear answer about the semantics of leaf attribute for LTO from LTO semantics for __attribute__((leaf)).
Would that be ok to drop nocalback attribute during linking? There is an LLVM bug now (Missing LangRef documentation for nocallback), and leaf attribute support will be reverted if we cannot come up with the description.

In D131628#3897684, @gulfem wrote:

How should we proceed on this? AFAICT, we did not get a clear answer about the semantics of leaf attribute for LTO from LTO semantics for __attribute__((leaf)).
Would that be ok to drop nocalback attribute during linking? There is an LLVM bug now (Missing LangRef documentation for nocallback), and leaf attribute support will be reverted if we cannot come up with the description.

I spent a little more time on this today, and I was able to come up with a concrete case where GCC will optimize as if the semantics of attribute leaf were LTO-unit, as was implemented in LLVM. I've posted this on the GCC bug.
Unfortunately, my reproducer has different behavior between the equivalent of GCC's full LTO and thin LTO modes, so I've no idea what they'll do in response to this.

At the very least, the semantics currently implemented in LLVM seem no worse than those than those currently implemented in GCC, but both appear to disagree with their docs and the expectations of those on the GCC bug thread. Personally, I'd think that we could add documentation for the current behavior and call out the controversy, but that call seems a bit above my paygrade. It does seem unlikely that the controversy will be resolved in the immediate future.

In D131628#3897684, @gulfem wrote:

How should we proceed on this? AFAICT, we did not get a clear answer about the semantics of leaf attribute for LTO from LTO semantics for __attribute__((leaf)).
Would that be ok to drop nocalback attribute during linking? There is an LLVM bug now (Missing LangRef documentation for nocallback), and leaf attribute support will be reverted if we cannot come up with the description.

I suggest making things transparent for LTO, as that's less surprising.
So, dropping nocallback during linking seems the way to go (and adjust the docs accordingly). Once you link, you can do inter-proc analysis and recover that info if needed anyway.

nhaehnle added a subscriber: nhaehnle.Nov 1 2022, 7:42 AM

In D131628#3898605, @nlopes wrote:

So, dropping nocallback during linking seems the way to go (and adjust the docs accordingly). Once you link, you can do inter-proc analysis and recover that info if needed anyway.

This would make Clang the only implementation of __attribute__((leaf)) to do so.
It's also not generally possible to recover the nocallback information about functions external to the LTO unit, which is the only thing __attribute__((leaf)) is actually good for in full LTO.

In D131628#3899670, @mysterymath wrote:

In D131628#3898605, @nlopes wrote:

So, dropping nocallback during linking seems the way to go (and adjust the docs accordingly). Once you link, you can do inter-proc analysis and recover that info if needed anyway.

This would make Clang the only implementation of __attribute__((leaf)) to do so.
It's also not generally possible to recover the nocallback information about functions external to the LTO unit, which is the only thing __attribute__((leaf)) is actually good for in full LTO.

Well, I haven't seen any numbers regarding how useful this attribute is, so I can't comment on that.
I meant to drop the attribute for the functions that are linked in. I agree it doesn't have to be dropped for the external functions.

In D131628#3899687, @nlopes wrote:

I meant to drop the attribute for the functions that are linked in. I agree it doesn't have to be dropped for the external functions.

EDIT:
Hmm, I gave this a quick try with llvm-link, and it looks like the nocallback attribute is already dropped from the declaration when a definition is linked in.

@nlopes, can we create a separate issue for finding the right semantics, and move forward by making this change describe as accurately as possible the current state of affairs?
Given that the original patch has been in LLVM for a long time, spanning many versions, it seems like we'd want to rectify the undocumentedness of the attribute ASAP, then work towards changing its semantics afterwards.

In D131628#3900066, @mysterymath wrote:

@nlopes, can we create a separate issue for finding the right semantics, and move forward by making this change describe as accurately as possible the current state of affairs?
Given that the original patch has been in LLVM for a long time, spanning many versions, it seems like we'd want to rectify the undocumentedness of the attribute ASAP, then work towards changing its semantics afterwards.

I think the effort of doing it right once is lower than doing it twice, but do as you see fit. As long as things get fixed..
Google doesn't have a good track record in fixing "temporary" hacks. The first time I approved one of such hacks was more than 10 years ago and the code is still there and google refuses to remove it. The other one had to be one of my students to fix it. Hence I'm a bit skeptical about fixing things "afterwards".

In D131628#3901837, @nlopes wrote:

In D131628#3900066, @mysterymath wrote:

@nlopes, can we create a separate issue for finding the right semantics, and move forward by making this change describe as accurately as possible the current state of affairs?
Given that the original patch has been in LLVM for a long time, spanning many versions, it seems like we'd want to rectify the undocumentedness of the attribute ASAP, then work towards changing its semantics afterwards.

I think the effort of doing it right once is lower than doing it twice, but do as you see fit. As long as things get fixed..
Google doesn't have a good track record in fixing "temporary" hacks. The first time I approved one of such hacks was more than 10 years ago and the code is still there and google refuses to remove it. The other one had to be one of my students to fix it. Hence I'm a bit skeptical about fixing things "afterwards".

It would be good to have a action plan like “if not fixed before llvm 16 branch date, revert it fully” to avoid keeping such stuff live for years (learn from previous mistakes, as you said).

mysterymath added inline comments.Nov 2 2022, 11:34 AM

llvm/docs/LangRef.rst
1812	The GCC docs specify that the called function can return to the caller's "compilation unit" via either return or exception handling. The current uses of nocallback in LLVM seem to also allow exception handling control flow out of the callback, so we should mention this here too.

mysterymath added a reviewer: mysterymath.Nov 2 2022, 11:34 AM

gulfem mentioned this in D137360: [Linker] Remove nocallback attribute while linking.Nov 3 2022, 2:25 PM

Please see https://reviews.llvm.org/D137360 that removes nocallback attribute while linking modules.

nhaehnle added inline comments.Nov 3 2022, 10:28 PM

llvm/docs/LangRef.rst
1812	"Translation unit" isn't an LLVM IR concept, so most likely this should say "module" instead. I glimpsed some earlier discussion that seems related, but admittedly didn't get to read all of the history, so my apologies if this is redundant. And I agree with @mysterymath that it should be explicitly stated that jumping into the caller's module via a return or exception is fine.