Page MenuHomePhabricator

[nofree] Refine concurrency requirements
Needs RevisionPublic

Authored by nhaehnle on May 1 2021, 2:47 PM.

Details

Summary

(Triggered by discussion on https://reviews.llvm.org/D100141, this is
a counter-proposal to https://reviews.llvm.org/D100676)

Declare that a nofree function cannot arrange for memory to be freed
that was dereferenceable before the call -- whether by (transitive) call
or by communication with another thread.

This is a simplification from the perspective of using the information
provided by the function attribute when optimizing the caller, at the
cost of complicating the inference of nofree.

This change arguably increases the expressive power of the IR, since we
can now have functions that are (usefully) nofree without being nosync.
Before this change, nofree on a not-nosync function does not do
anything.

The attributor is updated so that a function containing a volatile
memory operation or a release (or stronger) atomic operations is not
nofree. The reasoning behind only checking for release ordering is
explained in a code comment.

@jdoerfert: I have zero knowledge of the attributor infrastructure, so if
this direction is taken I'd appreciate a review specifically of those parts.

Diff Detail

Event Timeline

nhaehnle created this revision.May 1 2021, 2:47 PM
nhaehnle requested review of this revision.May 1 2021, 2:47 PM
Herald added a reviewer: baziotis. · View Herald Transcript
Herald added a project: Restricted Project. · View Herald Transcript
Herald added a subscriber: bbn. · View Herald Transcript
reames requested changes to this revision.May 1 2021, 6:41 PM

At a minimum, you need to update FunctionAttrs.cpp.

This revision now requires changes to proceed.May 1 2021, 6:41 PM

LGTM for the LangRef changes.
I agree the proposed semantics is more useful this way. Plus I agree with the argument that the IR is more expressive this way, as it then supports nofree functions that do atomic stuff.

nikic added a subscriber: nikic.May 3 2021, 11:19 AM

+1 for the semantics change from me as well.

reames added a comment.May 3 2021, 4:40 PM

I am neutral on the proposed change, but I would very much like to hear from @sstefan1, the original author of the nofree attribute.

I am neutral on the proposed change, but I would very much like to hear from @sstefan1, the original author of the nofree attribute.

I'm not actually the one who introduced it, but I think this sounds good. Attributor changes look good as well.

llvm/docs/LangRef.rst
1619–1620

If I'm not wrong, this part hasn't been implemented yet? We won't infer nofree for a function that frees memory it allocated.

llvm/lib/Transforms/IPO/AttributorAttributes.cpp
1408

Maybe make this static, like AANoSyncImpl does?

nhaehnle updated this revision to Diff 343532.May 6 2021, 4:34 PM
nhaehnle marked an inline comment as done.

Address review feedback: Adding the required change to FunctionAttrs that @reames pointed out; in the new version, I also made sure to really go through all test cases.

Curiously, there is a test case in Analysis/ValueTracking/memory-dereferenceable.ll which already seemed to assume the semantics introduced with this change. At least, the comment seems to follow the logic that readonly implies nofree, which implies (*without* nosync) that dereferenceable on function arguments survives throughout the body. (Or perhaps the comment there could be interpreted as the belief that readonly should imply nosync? But that doesn't seem right to me.)

llvm/docs/LangRef.rst
1619–1620

Well, we do infer nofree for functions that contain alloca, which is implicitly freed at the end, though you're probably right about malloc and friends. In any case, that doesn't contradict what's written here.

llvm/lib/Transforms/IPO/AttributorAttributes.cpp
1408

I'm moving this helper around in the second version, so the point hopefully becomes moot :)

nhaehnle retitled this revision from [RFC][nofree] Refine concurrency requirements to [nofree] Refine concurrency requirements.May 6 2021, 4:37 PM
sstefan1 added inline comments.May 7 2021, 7:56 AM
llvm/docs/LangRef.rst
1619–1620

Yeah, that was just an observation. Maybe we have a TODO in AANoFree for that case?

reames requested changes to this revision.May 7 2021, 8:27 AM
reames added inline comments.
llvm/lib/Transforms/IPO/FunctionAttrs.cpp
1303

This is wrong. A thread can coordinate using only acquire ordering within 'f'.

Example:
g = o; // the object being potentially freed
if (g_acquire)

return; // then *caller* does release store

use o;

The other thread does:
while (!g) {

g_acquire = true;
while (!g_release) {}
free(g);

}

Please exactly match the existing nosync logic in this file by using isOrderedAtomic helper in this file. We can be more aggressive later if desired.

This revision now requires changes to proceed.May 7 2021, 8:27 AM

I'm late for the party, apologies.

Let me put down some thoughts:

  1. What is the motivation here? I mean, which functions, except very few intrinsic, would be "strong" nofree but not nosync?
  2. Doesn't this just mean we would check for nosync in order to derive nofree? What would be different?
  3. Have you considered the pointer nofree and if it does not suffices for your use case:
%ptr = not_captured i8*
call @completely_unknown(i8* nofree nocapture)  ; <-- doesn't free %ptr even though it could free other stuff.
llvm/lib/Transforms/IPO/AttributorAttributes.cpp
1452

You would ask AANoSync for the status on all those instructions, and that is the crucial point why I don't even see the benefit here. See main comment.

Let me put down some thoughts:

  1. What is the motivation here? I mean, which functions, except very few intrinsic, would be "strong" nofree but not nosync?

My main motivation here is to make the nofree attribute less surprising and easier to consume. I find having an attribute in the IR that is called "no free" but means "can free, unless some other attribute is present" pretty surprising.

It comes down to the ability to think about callees as abstracted black boxes: I care about the side effects of calling the black box; I don't care (and don't want to have to care) about how those side effects come to be.

  1. Doesn't this just mean we would check for nosync in order to derive nofree? What would be different?

That is one option, though I believe it is slightly more conservative than it needs to be. See the inline discussion with @reames. (There may still be non-correctness arguments for going down the conservative path of course.)

llvm/lib/Transforms/IPO/FunctionAttrs.cpp
1303

This is an interesting example, but I don't think it shows what you want it to show? Without the release store in the caller, the other thread cannot proceed to free(g). In other words, the point at which g/o stops being dereferenceable is the release store in the caller, not the acquire load in the callee.

The acquire load is part of some coordination with the other thread, but it's redundant as far as the free concerned. At least it looks that way to me.

To attack this from a different angle, consider a slight variation of the example where the caller does a g_release store before calling f. In that case, putting dereferenceable on o is incorrect, because the other thread could free o before f is even entered.

From yet another different angle, which part of the reasoning in the comment in the AANoFreeImpl is wrong?

reames added inline comments.May 9 2021, 9:21 AM
llvm/lib/Transforms/IPO/FunctionAttrs.cpp
1303

I'm not going to debate concurrency in this review. My (blocking, mandatory) request is that you exactly match existing behavior in this review.

If you wish to open a follow up review to refine the concurrency logic, we can do so. Warning: Concurrency is *hard* and we will need to loop in some experts who are not on this thread, and should not have to care about this review.

Let me put down some thoughts:

  1. What is the motivation here? I mean, which functions, except very few intrinsic, would be "strong" nofree but not nosync?

My main motivation here is to make the nofree attribute less surprising and easier to consume. I find having an attribute in the IR that is called "no free" but means "can free, unless some other attribute is present" pretty surprising.

But all our attributes work that way so changing nofree would make it an outlier.
Some examples:

willreturn + nounwind -> comes back and executes the next instruction after the call
argmemonly/inaccessiblemem_only + nosync -> there is no synchronization that revealed new values for globals

Depending on where we did fall wrt. synchronizing intrinsics you really cannot even use readnone/readonly
without nosync (maybe restricted to convergent functions).

The reason we want this is that attributes should compose. Almost always they also help on their own.
nofree without nosync can in fact be used for optimization. However, if we force it to go hand
in hand we loose that. Let me show you by example:

a = malloc
b = malloc
unknown(a);
nofree_nocapture_only(a, b);

Here, I can do heap2stack for b but not for a because nofree_nocapture_only could synchronize with someone to free a which escaped before.

It comes down to the ability to think about callees as abstracted black boxes: I care about the side effects of calling the black box; I don't care (and don't want to have to care) about how those side effects come to be.

I do not understand how this would be affected. The attributes limit the effects of the black box, regardless if there is one or two.
Could you elaborate why the black box view would be impaired if we do not apply this change?

  1. Doesn't this just mean we would check for nosync in order to derive nofree? What would be different?

That is one option, though I believe it is slightly more conservative than it needs to be. See the inline discussion with @reames. (There may still be non-correctness arguments for going down the conservative path of course.)

The only more conservative part I can see is arguably out of range for the static analyses we have right now.
That is, we could add "strong nofree" if we show that all synchronizing parties do not free. Other than that
I fail to see how this is not just nofree + nosync packaged as nofree, and that does not make sense to
me at all. Composability is useful, as shown above.

nlopes added a comment.May 9 2021, 3:17 PM

@jdoerfert a function that syncs with another thread may not free anything itself and the other thread may not free anything either.
If you require an optimization to have nofree + nosync in a call to preserve dereferenceability then you can't optimize this case.

That said, I had a thought last night and it isn't clear to me what to do with escaped pointers. On one hand it's nice for the optimizers if nofree implies escaped pointers aren't freed. This, however, makes inferring nofree harder. Functions that sync with other threads can't really be marked nofree without IPA. Essentially it degenerates in your concern, that nofree then implies nosync (in practice). However, it allows improved precision in theory (in case someone cares and implements an IPA).

Another possibility is to restrict nofree to unescaped pointers. But that degenerates in tagging function arguments with nofree, as unescaped can't be freed anyway.

The current proposal leaves room for improvement in the future if someone cares about multi-threaded programs and implements an IPA. Since it doesn't seem to impose any major pain in the implementation, it seems to be the best option.

@jdoerfert a function that syncs with another thread may not free anything itself and the other thread may not free anything either.
If you require an optimization to have nofree + nosync in a call to preserve dereferenceability then you can't optimize this case.

I disagree, and I think this is the conceptual differences we have here.

If one does only look at a function in isolation then I agree, nofree alone is not sufficient to make statements about all call sites and arguments.
However, if you look at a call site of a nofree function the situation is different because you have context information about the arguments.

Let's go back to my last example:

a = malloc
b = malloc
unknown(a);                  // no attributes 0-> escapes
nofree_nocapture_only(a, b); // `nofree`, `nocapture` on args

I know that at the end of this sequence b has not been freed, even though there is no nosync in sight.
I can also derive nofree for nofree_nocapture_only without worrying about atomics, for example.
(Here is the sequence in action with heap2stack kicking in for b: https://godbolt.org/z/hjYrhaEvW)

So, with nofree and nosync being completely separated, as it was in the very beginning, you can:

  1. Optimize even if only one is present, given additional call site information.
  2. Derive nofree even if nosync is not present.

All the "strong nofree" effectively gives us is that it combines the to attributes.
Let's introduce a helper:
bool cannotFreeAtAll(Function &F) { return F.hasAttr(NoFree) && F.hasAttr(NoSync); }
you get your guarantee that F will not, in any way` free something. That said,
I find the call site specific reasoning is the way to go anyway but if some generic
"for all call sites and arguments" handling is required we can go with the helper instead.

@jdoerfert a function that syncs with another thread may not free anything itself and the other thread may not free anything either.
If you require an optimization to have nofree + nosync in a call to preserve dereferenceability then you can't optimize this case.

I disagree, and I think this is the conceptual differences we have here.

If one does only look at a function in isolation then I agree, nofree alone is not sufficient to make statements about all call sites and arguments.
However, if you look at a call site of a nofree function the situation is different because you have context information about the arguments.

Let's go back to my last example:

a = malloc
b = malloc
unknown(a);                  // no attributes 0-> escapes
nofree_nocapture_only(a, b); // `nofree`, `nocapture` on args

I know that at the end of this sequence b has not been freed, even though there is no nosync in sight.
I can also derive nofree for nofree_nocapture_only without worrying about atomics, for example.
(Here is the sequence in action with heap2stack kicking in for b: https://godbolt.org/z/hjYrhaEvW)

So, with nofree and nosync being completely separated, as it was in the very beginning, you can:

  1. Optimize even if only one is present, given additional call site information.
  2. Derive nofree even if nosync is not present.

Your examples are around nofree arguments, not nofree functions, which is what this patch is about.
nofree in functions is useful for pointers that are *not* passed as argument, like this:

f(i8* %p) {
  load %p  ; implies it's dereferenceable
  call @g() nofree
  ; is %p still dereferenceable or not?
}

This patch states that yes, %p is still dereferenceable after the call to @g as it's tagged with nofree. That's it.

@jdoerfert a function that syncs with another thread may not free anything itself and the other thread may not free anything either.
If you require an optimization to have nofree + nosync in a call to preserve dereferenceability then you can't optimize this case.

I disagree, and I think this is the conceptual differences we have here.

If one does only look at a function in isolation then I agree, nofree alone is not sufficient to make statements about all call sites and arguments.
However, if you look at a call site of a nofree function the situation is different because you have context information about the arguments.

Let's go back to my last example:

a = malloc
b = malloc
unknown(a);                  // no attributes 0-> escapes
nofree_nocapture_only(a, b); // `nofree`, `nocapture` on args

I know that at the end of this sequence b has not been freed, even though there is no nosync in sight.
I can also derive nofree for nofree_nocapture_only without worrying about atomics, for example.
(Here is the sequence in action with heap2stack kicking in for b: https://godbolt.org/z/hjYrhaEvW)

So, with nofree and nosync being completely separated, as it was in the very beginning, you can:

  1. Optimize even if only one is present, given additional call site information.
  2. Derive nofree even if nosync is not present.

Your examples are around nofree arguments, not nofree functions, which is what this patch is about.

nofree functions imply nofree arguments so it very much makes a difference if we make it harder to infer
and annotate the former. As we already moved to a more semantic-based definition of escaped we can actually
look at an example where this makes a difference. Take pointers passed via memory to see how you cannot just
move the nofree to account for the change made by this patch:

struct { int *a } s;
s.a = malloc()               // this is a store.
nofree_nocapture_only(&s)    // we can potentially show `s.a` is not captured in the function but we cannot
                             // annotate that fact as a `nofree` argument without argument promotion, which
                             // we cannot always do.
// s.a is still known to be dereferenceable even without `nosync`. The function `nofree` is needed, `nofree`
// argument attributes are not involved.

nofree in functions is useful for pointers that are *not* passed as argument, like this:

f(i8* %p) {
  load %p  ; implies it's dereferenceable
  call @g() nofree
  ; is %p still dereferenceable or not?
}

This patch states that yes, %p is still dereferenceable after the call to @g as it's tagged with nofree. That's it.

Let's not ignore the arguments that are somehow passed though. Anyway, this patch does not actually add anything you
need for you example. If g is nofree & nosync you get exactly the same effect already. I hope we agree on that.

If we allow user feedback you can also make your @f nosync and get the effect you want even if @g is not nosync,
This is especially useful if @g is an intrinsic users cannot actually annotate.

In addition to the examples I brought up that are impacted, let me reiterate what I said in my first post:
Other attributes work in the same composable way:

void foo(int * __restrict__ flag, char* p) {
  argmemonly(flag);                 // Says argmemonly and p is not passed nor does p alias flag, still p is potentially written.
                                    // This is like a call to a `nofree` function which might still free pointers via synchronization.
                                    // `nosync` is a orthogonal component to these attributes as we can otherwise not derive them.
                                    // However, as with `nofree`, we can use the attributes even without `nosync` in some situations.
}

Hi Johannes,

before I give a more complete response, I'd like to understand your position more fully. To that end:

nofree functions imply nofree arguments so it very much makes a difference if we make it harder to infer
and annotate the former. As we already moved to a more semantic-based definition of escaped we can actually
look at an example where this makes a difference. Take pointers passed via memory to see how you cannot just
move the nofree to account for the change made by this patch:

struct { int *a } s;
s.a = malloc()               // this is a store.
nofree_nocapture_only(&s)    // we can potentially show `s.a` is not captured in the function but we cannot
                             // annotate that fact as a `nofree` argument without argument promotion, which
                             // we cannot always do.
// s.a is still known to be dereferenceable even without `nosync`. The function `nofree` is needed, `nofree`
// argument attributes are not involved.

Is that actually true? Regardless of the nocapture flag, s.a may be passed to another thread which then frees it. (nocapture only applies to &s in the first place, and even then, it does not say anything about temporary copies of the pointer that may be passed to other threads.)

In addition to the examples I brought up that are impacted, let me reiterate what I said in my first post:
Other attributes work in the same composable way:

void foo(int * __restrict__ flag, char* p) {
  argmemonly(flag);                 // Says argmemonly and p is not passed nor does p alias flag, still p is potentially written.
                                    // This is like a call to a `nofree` function which might still free pointers via synchronization.
                                    // `nosync` is a orthogonal component to these attributes as we can otherwise not derive them.
                                    // However, as with `nofree`, we can use the attributes even without `nosync` in some situations.
}

I don't think this is true. LangRef certainly doesn't support that claim as far as I can tell (the definition of argmemonly talks about there being no side effects at all, without any exception for side effects triggered via other threads), and it turns out that due to confusion about this in the attributor, I can actually provoke a miscompilation fairly easily. So there may be a bigger issue at play there. If you do think that there is consensus and/or documentation about argmemonly working the way you think it should work, can you please point to it?

nofree functions imply nofree arguments so it very much makes a difference if we make it harder to infer
and annotate the former. As we already moved to a more semantic-based definition of escaped we can actually
look at an example where this makes a difference. Take pointers passed via memory to see how you cannot just
move the nofree to account for the change made by this patch:

struct { int *a } s;
s.a = malloc()               // this is a store.
nofree_nocapture_only(&s)    // we can potentially show `s.a` is not captured in the function but we cannot
                             // annotate that fact as a `nofree` argument without argument promotion, which
                             // we cannot always do.
// s.a is still known to be dereferenceable even without `nosync`. The function `nofree` is needed, `nofree`
// argument attributes are not involved.

Is that actually true? Regardless of the nocapture flag, s.a may be passed to another thread which then frees it. (nocapture only applies to &s in the first place, and even then, it does not say anything about temporary copies of the pointer that may be passed to other threads.)

I realize now that I was partially wrong about this: the definition of the nocapture attribute does say that the function cannot send the pointer to another thread, but the pointer in question is &s. So it still seems to me that in the example, the callee is allowed to send s.a to another thread which then frees it (given today's definition of nofree).

If this example does fall like I think it does, is there still an optimization benefit to having the function attribute nofree as defined today vs. as defined in this patch? As I see it, there were three arguments:

  1. Inference doesn't have to worry about synchronization.
  2. Other attributes work the same way; however, this is false at least for argmemonly and arguably also for inaccessiblememonly, except that one presumably never gets inferred?
  3. Today's nofree enables annotations and optimizations that the proposed change would no longer allow. I can't say for certain whether this is false, but at least the examples given so far seem to fail (the first one because you can just use the nofree argument attribute instead, and the second is discussed above in this comment).

That leaves as a single argument the complexity of inference, but how strong of an argument is that, really?

P.S.: The example I have in mind where confusion about argmemonly can be used to trigger a miscompile is at https://godbolt.org/z/dEG8n6W3E. Current main merges the two loads from %q, which is incorrect since @setup could arrange for a second thread to be created which changes its value in between.

nhaehnle updated this revision to Diff 344603.May 11 2021, 5:18 PM

Split up the patch as per @reames' request

jdoerfert added a comment.EditedMay 11 2021, 5:37 PM

[EDIT: Added response to newest comment in the end.]

Hi Johannes,

before I give a more complete response, I'd like to understand your position more fully. To that end:

nofree functions imply nofree arguments so it very much makes a difference if we make it harder to infer
and annotate the former. As we already moved to a more semantic-based definition of escaped we can actually
look at an example where this makes a difference. Take pointers passed via memory to see how you cannot just
move the nofree to account for the change made by this patch:

struct { int *a } s;
s.a = malloc()               // this is a store.
nofree_nocapture_only(&s)    // we can potentially show `s.a` is not captured in the function but we cannot
                             // annotate that fact as a `nofree` argument without argument promotion, which
                             // we cannot always do.
// s.a is still known to be dereferenceable even without `nosync`. The function `nofree` is needed, `nofree`
// argument attributes are not involved.

Is that actually true? Regardless of the nocapture flag, s.a may be passed to another thread which then frees it. (nocapture only applies to &s in the first place, and even then, it does not say anything about temporary copies of the pointer that may be passed to other threads.)

I am not aware of a way to pass to another thread that does not capture a pointer. Please feel free to show me one.

In addition to the examples I brought up that are impacted, let me reiterate what I said in my first post:
Other attributes work in the same composable way:

void foo(int * __restrict__ flag, char* p) {
  argmemonly(flag);                 // Says argmemonly and p is not passed nor does p alias flag, still p is potentially written.
                                    // This is like a call to a `nofree` function which might still free pointers via synchronization.
                                    // `nosync` is a orthogonal component to these attributes as we can otherwise not derive them.
                                    // However, as with `nofree`, we can use the attributes even without `nosync` in some situations.
}

I don't think this is true. LangRef certainly doesn't support that claim as far as I can tell (the definition of argmemonly talks about there being no side effects at all, without any exception for side effects triggered via other threads), and it turns out that due to confusion about this in the attributor, I can actually provoke a miscompilation fairly easily. So there may be a bigger issue at play there. If you do think that there is consensus and/or documentation about argmemonly working the way you think it should work, can you please point to it?

So you are arguing armemonly is also implying nosync? Then we get into the same problems we have here. You don't actually gain expressiveness, after all you can check if both attribute X and nosync are present, but you loose the ability to derive and utilize them separately. One conceptual thing to consider is that almost all of this predates nosync in the first place. That said, I don't see how you would interpret the text as there are no side effects by other threads:

argmemonly

    This attribute indicates that the only memory accesses inside function are loads and stores from objects pointed to by its pointer-typed arguments, with arbitrary offsets. Or in other words, all memory operations in the function can refer to memory only using pointers based on its function arguments.

    Note that argmemonly can be used together with readonly attribute in order to specify that function reads only from its arguments.

    If an argmemonly function reads or writes memory other than the pointer arguments, or has other side-effects, the behavior is undefined.

What in the above definition prevents me from doing this:

void foo(atomic flag) { atomic_write(flag, 1); while(atomic_read(flag) == 1); atomic_write(flag, 2); }

Inside the function all memory that is accesses are loads/stores from objects pointed by pointer-type arguments (with 0 offset). Do you disagree?

FWIW, this basically breaks down with stuff like malloc which we assume is inaccessiblememonly but it is in fact *not*, and cannot be, nosync. So if we require attributes to imply nosync, malloc cannot be inaccessiblememonly anymore. (see https://reviews.llvm.org/D98605#2625243)


nofree functions imply nofree arguments so it very much makes a difference if we make it harder to infer
and annotate the former. As we already moved to a more semantic-based definition of escaped we can actually
look at an example where this makes a difference. Take pointers passed via memory to see how you cannot just
move the nofree to account for the change made by this patch:

struct { int *a } s;
s.a = malloc()               // this is a store.
nofree_nocapture_only(&s)    // we can potentially show `s.a` is not captured in the function but we cannot
                             // annotate that fact as a `nofree` argument without argument promotion, which
                             // we cannot always do.
// s.a is still known to be dereferenceable even without `nosync`. The function `nofree` is needed, `nofree`
// argument attributes are not involved.

Is that actually true? Regardless of the nocapture flag, s.a may be passed to another thread which then frees it. (nocapture only applies to &s in the first place, and even then, it does not say anything about temporary copies of the pointer that may be passed to other threads.)

I realize now that I was partially wrong about this: the definition of the nocapture attribute does say that the function cannot send the pointer to another thread, but the pointer in question is &s. So it still seems to me that in the example, the callee is allowed to send s.a to another thread which then frees it (given today's definition of nofree).

As stated in my text on the right, we can potentially derive s.a is not captured and we can derive, or read from attributes, &s is not captured. If we cannot derive the either, we can certainly expect someone to free it s.a. However, if we can derive both, we cannot actually put the nofree on the s.a pointer to manifest the information. So there is no place to put it and with a stronger nofree on the function we have less chance to put that one. [While right now there is no place for the "not captured" of the &s.a pointer, we are already working on such ways, e.g., https://reviews.llvm.org/D93189 (and there is an RFC thread).]

If this example does fall like I think it does, is there still an optimization benefit to having the function attribute nofree as defined today vs. as defined in this patch? As I see it, there were three arguments:

  1. Inference doesn't have to worry about synchronization.

Agreed. Split the problem. Though, to be fair, it is trivial to require nosync for nofree to be deducible in the first place.

  1. Other attributes work the same way; however, this is false at least for argmemonly and arguably also for inaccessiblememonly, except that one presumably never gets inferred?

I don't think it is.

  1. Today's nofree enables annotations and optimizations that the proposed change would no longer allow. I can't say for certain whether this is false, but at least the examples given so far seem to fail (the first one because you can just use the nofree argument attribute instead, and the second is discussed above in this comment).

I think once I manage to explain the examples you will reconsider this point. To give you another one:

__attribute__((nosync)) // or __attribute__((assume("nosync")) or #pragma omp assume ext_no_synchronization
void user_code() {
  a = malloc();
  b = malloc();
  unknown_function_with_a_now_nosync_call_site(a, b);
  ...
  free(a);
}

That leaves as a single argument the complexity of inference, but how strong of an argument is that, really?

I think there are a lot of reasons to keep it separated, the examples above still stand. That said, why would
we combine it? It saves one attribute lookup?

P.S.: The example I have in mind where confusion about argmemonly can be used to trigger a miscompile is at https://godbolt.org/z/dEG8n6W3E. Current main merges the two loads from %q, which is incorrect since @setup could arrange for a second thread to be created which changes its value in between.

Agreed this is a bug. Though, it's in instcombine (or probably MemorySSA) not the Attributor. If you disagree you have to tell me which part of the argmemonly definition is violated by release: "This attribute indicates that the only memory accesses inside function are loads and stores from objects pointed to by its pointer-typed arguments, with arbitrary offsets. Or in other words, all memory operations in the function can refer to memory only using pointers based on its function arguments."

nlopes accepted this revision.May 12 2021, 10:47 AM
  1. Today's nofree enables annotations and optimizations that the proposed change would no longer allow. I can't say for certain whether this is false, but at least the examples given so far seem to fail (the first one because you can just use the nofree argument attribute instead, and the second is discussed above in this comment).

I think once I manage to explain the examples you will reconsider this point. To give you another one:

__attribute__((nosync)) // or __attribute__((assume("nosync")) or #pragma omp assume ext_no_synchronization
void user_code() {
  a = malloc();
  b = malloc();
  unknown_function_with_a_now_nosync_call_site(a, b);
  ...
  free(a);
}

This is a good example, that shows that if the user annotates a function with nosync then inferring nofree becomes easier. So the proposed semantics doesn't require nosync, but inference can be aided by its presence (not in this patch yet).

I think all the reviewers' concerns have been addresses at this point. Please let us know if there's some remaining concern regarding the proposed semantics.
The semantics proposed in this patch makes a lot of sense to me. And to make things clear, this proposal *does not* make nofree imply nosync (nor the other way around).

jdoerfert requested changes to this revision.May 12 2021, 11:05 AM
  1. Today's nofree enables annotations and optimizations that the proposed change would no longer allow. I can't say for certain whether this is false, but at least the examples given so far seem to fail (the first one because you can just use the nofree argument attribute instead, and the second is discussed above in this comment).

I think once I manage to explain the examples you will reconsider this point. To give you another one:

__attribute__((nosync)) // or __attribute__((assume("nosync")) or #pragma omp assume ext_no_synchronization
void user_code() {
  a = malloc();
  b = malloc();
  unknown_function_with_a_now_nosync_call_site(a, b);
  ...
  free(a);
}

This is a good example, that shows that if the user annotates a function with nosync then inferring nofree becomes easier. So the proposed semantics doesn't require nosync, but inference can be aided by its presence (not in this patch yet).

This is not at all what the example shows. The example shows that we cannot infer nofree for unknown_function_with_a_now_nosync_call_site with the new semantics but with the old semantics we could.
Only with the existing semantics you can optimize this example, not with the new one. I would have assumed this is implied given that I provided the example and I am arguing the proposed semantics are useless and prevent optimizations.

I think all the reviewers' concerns have been addresses at this point. Please let us know if there's some remaining concern regarding the proposed semantics.

My concerns have not changed at all. Unsure why you would assume they have been addressed.

The semantics proposed in this patch makes a lot of sense to me. And to make things clear, this proposal *does not* make nofree imply nosync (nor the other way around).

For deduction purposes it effectively does imply/require nosync. If you disagree, please provide an example where we can derive nofree without nosync under the proposed semantics.

This revision now requires changes to proceed.May 12 2021, 11:05 AM
  1. Today's nofree enables annotations and optimizations that the proposed change would no longer allow. I can't say for certain whether this is false, but at least the examples given so far seem to fail (the first one because you can just use the nofree argument attribute instead, and the second is discussed above in this comment).

I think once I manage to explain the examples you will reconsider this point. To give you another one:

__attribute__((nosync)) // or __attribute__((assume("nosync")) or #pragma omp assume ext_no_synchronization
void user_code() {
  a = malloc();
  b = malloc();
  unknown_function_with_a_now_nosync_call_site(a, b);
  ...
  free(a);
}

This is a good example, that shows that if the user annotates a function with nosync then inferring nofree becomes easier. So the proposed semantics doesn't require nosync, but inference can be aided by its presence (not in this patch yet).

This is not at all what the example shows. The example shows that we cannot infer nofree for unknown_function_with_a_now_nosync_call_site with the new semantics but with the old semantics we could.
Only with the existing semantics you can optimize this example, not with the new one. I would have assumed this is implied given that I provided the example and I am arguing the proposed semantics are useless and prevent optimizations.

For the example above, I don't see how you can deduce nofree for user_code or for unknown_function_with_a_now_nosync_call_site without further information that is not given in this example. Without knowing anything else, unknown_function_with_a_now_nosync_call_site may free some pointer stored in a global var. Who knows?
If my reading is wrong, please make the example more explicit and explain things more slowly, otherwise I'm unable to understand it. These things are hard and nothing is implied or implicit.

The semantics proposed in this patch makes a lot of sense to me. And to make things clear, this proposal *does not* make nofree imply nosync (nor the other way around).

For deduction purposes it effectively does imply/require nosync. If you disagree, please provide an example where we can derive nofree without nosync under the proposed semantics.

I already gave an example that shows that the semantics proposed in this patch is more expressive that the old one. Please don't mix semantics and some inference algorithm you have in mind (that hasn't been shared with us).
A simple example: consider a program with 2 threads, a main one and a helper thread. The helper thread doesn't free anything. The main thread may sync with the helper thread, so some function are not nosync, but you can still tag them as nofree if they don't free up anything.
So the proposed semantics is strictly more expressive than the old one. Plus it doesn't make it any harder to infer than before.

Even if the current patch cannot infer nofree for the example I've provided, there's nothing preventing someone writing an algorithm that could infer that in future. Plus, using your own argument, users may provide that information.

So you are arguing armemonly is also implying nosync? Then we get into the same problems we have here. You don't actually gain expressiveness, after all you can check if both attribute X and nosync are present, but you loose the ability to derive and utilize them separately. One conceptual thing to consider is that almost all of this predates nosync in the first place. That said, I don't see how you would interpret the text as there are no side effects by other threads:

argmemonly

    This attribute indicates that the only memory accesses inside function are loads and stores from objects pointed to by its pointer-typed arguments, with arbitrary offsets. Or in other words, all memory operations in the function can refer to memory only using pointers based on its function arguments.

    Note that argmemonly can be used together with readonly attribute in order to specify that function reads only from its arguments.

    If an argmemonly function reads or writes memory other than the pointer arguments, or has other side-effects, the behavior is undefined.

What in the above definition prevents me from doing this:

void foo(atomic flag) { atomic_write(flag, 1); while(atomic_read(flag) == 1); atomic_write(flag, 2); }

Inside the function all memory that is accesses are loads/stores from objects pointed by pointer-type arguments (with 0 offset). Do you disagree?

The part of the definition that I would argue prevents you from doing this is the "or has other side-effects".

FWIW, this basically breaks down with stuff like malloc which we assume is inaccessiblememonly but it is in fact *not*, and cannot be, nosync. So if we require attributes to imply nosync, malloc cannot be inaccessiblememonly anymore. (see https://reviews.llvm.org/D98605#2625243)

There are two directions in which one can take this.

First, your desired path forward appears to be that argmemonly on a callee can only be taken at face value if the callee is also nosync. Presumably, you would want the same treatment to apply to inaccessiblememonly. So then if malloc cannot be nosync, what's the point of making it inaccessiblememonly? What would that still allow you to deduce that you cannot deduce even without the inaccessiblememonly?

Second, perhaps malloc can be nosync. I see the quote from the C standard, but it's unclear to me whether that language is good for anything other than ensuring no data races in a formal model, and surely there are other means to achieve the same goal, e.g. saying that a re-allocated portion of memory is a different memory location as far as the memory model is concerned. The only way I can see to perform meaningful synchronization through free and malloc would be to compare the result of malloc with some pointer that was allocated earlier (if they're equal, this implies that the earlier allocation was freed, and we have a synchronization edge that can be relied upon). It may be possible to make this undefined somehow, perhaps involving the non-determinism trick that is proposed for pointer provenance can be applied here as well.

[...]

  1. Today's nofree enables annotations and optimizations that the proposed change would no longer allow. I can't say for certain whether this is false, but at least the examples given so far seem to fail (the first one because you can just use the nofree argument attribute instead, and the second is discussed above in this comment).

I think once I manage to explain the examples you will reconsider this point. To give you another one:

__attribute__((nosync)) // or __attribute__((assume("nosync")) or #pragma omp assume ext_no_synchronization
void user_code() {
  a = malloc();
  b = malloc();
  unknown_function_with_a_now_nosync_call_site(a, b);
  ...
  free(a);
}

Not sure what you're saying here. Do you want the unknown function to be nofree but it *does* contain synchronization, except that then the external nosync "overrides" that fact and we pray that the programmer knew what they were doing?

That leaves as a single argument the complexity of inference, but how strong of an argument is that, really?

I think there are a lot of reasons to keep it separated, the examples above still stand. That said, why would
we combine it? It saves one attribute lookup?

To me the fundamental problem is having an attribute that says "calling this function doesn't free memory" on a call that does, in fact, free memory. Why would/should the caller care about whether the freeing happens in the same thread or not? I think Nuno's line of argument about separating the semantics from the implementation goes in a similar direction.

This doesn't mean that we can't have the more structural (as opposed to semantic) attribute you want to have, but I'd be in favor of naming it more accurately, e.g. nothreadlocalfree.

P.S.: The example I have in mind where confusion about argmemonly can be used to trigger a miscompile is at https://godbolt.org/z/dEG8n6W3E. Current main merges the two loads from %q, which is incorrect since @setup could arrange for a second thread to be created which changes its value in between.

Agreed this is a bug. Though, it's in instcombine (or probably MemorySSA) not the Attributor. If you disagree you have to tell me which part of the argmemonly definition is violated by release: "This attribute indicates that the only memory accesses inside function are loads and stores from objects pointed to by its pointer-typed arguments, with arbitrary offsets. Or in other words, all memory operations in the function can refer to memory only using pointers based on its function arguments."

As stated above, it's at the very least the part that says there can be no other side effects. Synchronizing with another thread which then modifies other memory is surely a side effect.

So you are arguing armemonly is also implying nosync? Then we get into the same problems we have here. You don't actually gain expressiveness, after all you can check if both attribute X and nosync are present, but you loose the ability to derive and utilize them separately. One conceptual thing to consider is that almost all of this predates nosync in the first place. That said, I don't see how you would interpret the text as there are no side effects by other threads:

argmemonly

    This attribute indicates that the only memory accesses inside function are loads and stores from objects pointed to by its pointer-typed arguments, with arbitrary offsets. Or in other words, all memory operations in the function can refer to memory only using pointers based on its function arguments.

    Note that argmemonly can be used together with readonly attribute in order to specify that function reads only from its arguments.

    If an argmemonly function reads or writes memory other than the pointer arguments, or has other side-effects, the behavior is undefined.

What in the above definition prevents me from doing this:

void foo(atomic flag) { atomic_write(flag, 1); while(atomic_read(flag) == 1); atomic_write(flag, 2); }

Inside the function all memory that is accesses are loads/stores from objects pointed by pointer-type arguments (with 0 offset). Do you disagree?

The part of the definition that I would argue prevents you from doing this is the "or has other side-effects".

This reading defines FunctionAttr (and others) are wrong right now when deducing almost any attribute in the absence of nosync.
Further, it should not be "other side-effects" as a general statement anyway. It would subsume nounwind and interact badly with infinite loops (D106749), both of which would interact badly with C++ and pure/const functions, as an example.

Long story short, there is still no good reason for attributes not to be composed instead of somehow defined with overlapping semantics.

FWIW, this basically breaks down with stuff like malloc which we assume is inaccessiblememonly but it is in fact *not*, and cannot be, nosync. So if we require attributes to imply nosync, malloc cannot be inaccessiblememonly anymore. (see https://reviews.llvm.org/D98605#2625243)

There are two directions in which one can take this.

First, your desired path forward appears to be that argmemonly on a callee can only be taken at face value if the callee is also nosync. Presumably, you would want the same treatment to apply to inaccessiblememonly. So then if malloc cannot be nosync, what's the point of making it inaccessiblememonly? What would that still allow you to deduce that you cannot deduce even without the inaccessiblememonly?

As I mentioned before, (1) this comes from a time where we did not think about synchronization much (BuildLibCalls.cpp does not once set nosync to this day)
and it was ignored widely, and (2) it helps you at the call site if you have nosync there. I provided call site examples before, most of them apply again.
If you combine call site information with nofree for the function, or inaccessiblememonly for the function you can actually derive things. If nofree,
inaccessiblememonly, readonly, ... cannot be derived in the presence of nosync we loose that ability. Let's make a new example though.

Let's assume this is our entire module:

static int *X = nullptr;

int readX() { return *X; }

void foo() {
  if (!X)
    X = malloc(4);  // let it be known in the module.
}

void bar() {
  *X = 1;
  unknown_inaccessible_mem_only_but_not_nosync();   // e.g. malloc
  argmemonly_nofree_but_not_nosync(X);              // can be derived from the definition of the function and attribute can be made available through HTO or thinLTO.
  // X is still deref here because no thread could have freed it.
}

X is in the entire module deref_or_null(4) but only because we have inaccessible_mem_only (malloc) and argmemonly + nofree (definition + HTO/thinLTO).
nosync is not set for any of the functions.

Second, perhaps malloc can be nosync. I see the quote from the C standard, but it's unclear to me whether that language is good for anything other than ensuring no data races in a formal model, and surely there are other means to achieve the same goal, e.g. saying that a re-allocated portion of memory is a different memory location as far as the memory model is concerned. The only way I can see to perform meaningful synchronization through free and malloc would be to compare the result of malloc with some pointer that was allocated earlier (if they're equal, this implies that the earlier allocation was freed, and we have a synchronization edge that can be relied upon). It may be possible to make this undefined somehow, perhaps involving the non-determinism trick that is proposed for pointer provenance can be applied here as well.

You basically described the side-channel that makes malloc/free not nosync, didn't you?

T0: foo() /* nosync */; while(1) { p = malloc(); atomic_relaxed_record(p); free(p); }
T1: do { p = malloc(); found = atomic_relaxed_lookup(p); free(p); } while (!found); print("foo is done!");

[...]

  1. Today's nofree enables annotations and optimizations that the proposed change would no longer allow. I can't say for certain whether this is false, but at least the examples given so far seem to fail (the first one because you can just use the nofree argument attribute instead, and the second is discussed above in this comment).

I think once I manage to explain the examples you will reconsider this point. To give you another one:

__attribute__((nosync)) // or __attribute__((assume("nosync")) or #pragma omp assume ext_no_synchronization
void user_code() {
  a = malloc();
  b = malloc();
  unknown_function_with_a_now_nosync_call_site(a, b);
  ...
  free(a);
}

Not sure what you're saying here. Do you want the unknown function to be nofree but it *does* contain synchronization, except that then the external nosync "overrides" that fact and we pray that the programmer knew what they were doing?

I'm not sure about pray but I am sure we already assume annotations by the user are correct. We also explicitly query them for
annotations (https://openmp.llvm.org/docs/remarks/OMP133.html, loop vectorizer remarks, ...) and I strongly expect this to become more frequent in the future.

I can also see us, and others, annotate libraries that use things like volatile/atomic accesses and malloc to employ nosync annotations
if we know they are sound from the callers perspective.

Finally, we started to track threads explicitly already, partially using domain knowledge, which allows us to reason about the interaction between threads
(https://reviews.llvm.org/D106397#C2702144NL1110). So even in the presence of synchronizations (atomics, barriers, etc), we can use other attributes
(argmemonly, nofree, ...) and such thread tracking to make useful deductions. This is not possible if we interleave the argmemonly/nofree/... semantics
with nosync. The above optimization is a real thing for a very common scenario on GPUs and also CPUs:

run_in_parallel {
  if (threadid == 0)
    effect();
  barrier();

  ... parallel stuff

  if (threadid == 0)
    effect();
  barrier();
 ...
}

That leaves as a single argument the complexity of inference, but how strong of an argument is that, really?

I think there are a lot of reasons to keep it separated, the examples above still stand. That said, why would
we combine it? It saves one attribute lookup?

To me the fundamental problem is having an attribute that says "calling this function doesn't free memory" on a call that does, in fact, free memory. Why would/should the caller care about whether the freeing happens in the same thread or not? I think Nuno's line of argument about separating the semantics from the implementation goes in a similar direction.

While I understand your argument it breaks a lot of things not just nofree. So if we wanted to do this it had to be way more involved or it is just inconsistent.
We had inaccessible_mem_only earlier or argmemonly. As noted before, the definition talks about the "memory operations in the function". We also do not annotate nosync
for any known builtin, etc. There is soo many things that would need to happen to do this transition in a consistent way. Instead, this picks nofree and argues it is
somewhat special even though it is not. See next paragraph.

This doesn't mean that we can't have the more structural (as opposed to semantic) attribute you want to have, but I'd be in favor of naming it more accurately, e.g. nothreadlocalfree.

Almost all our attributes are "local" not "global". They are local because we can derive them only locally in a reasonable way and some of them make sense only locally anyway as they have
an explicit scope, most often the function. Local information can be combined with call site information, e.g., in a subsequent pass or via HTO/thinLTO, to create actionable information (see
my examples in this thread and above). Making local deduction harder/impossible is not helpful as it prevents us to combine it with other information sources, e.g., domain knowledge or explicit
thread tracking.

P.S.: The example I have in mind where confusion about argmemonly can be used to trigger a miscompile is at https://godbolt.org/z/dEG8n6W3E. Current main merges the two loads from %q, which is incorrect since @setup could arrange for a second thread to be created which changes its value in between.

Agreed this is a bug. Though, it's in instcombine (or probably MemorySSA) not the Attributor. If you disagree you have to tell me which part of the argmemonly definition is violated by release: "This attribute indicates that the only memory accesses inside function are loads and stores from objects pointed to by its pointer-typed arguments, with arbitrary offsets. Or in other words, all memory operations in the function can refer to memory only using pointers based on its function arguments."

As stated above, it's at the very least the part that says there can be no other side effects. Synchronizing with another thread which then modifies other memory is surely a side effect.

The function writes pointer arguments, which is explicitly allowed in the text multiple times before it describes what it shall not do.
It writes them atomically, but that is not disallowed in any way. As I stated above, I see your point but this patch is not even close to what would be required to bake synchronization into everything. You can
read "no synchronization" into some of the lang ref wording but you have to admit argmemonly does talk about "accesses inside the function" twice and does not mention synchronization at all. Further,
you attribute the @setup side-effect to @release, so you argue that the @release function can have a side-effect on arbitrary memory including %q even though "the write to %q" is arguably not in @release.

Instead, the model that I see us using already and which we should embrace is:
@release is not nosync and therefore we can see effects at the call site that were performed by other functions with which @release synchronized.
That attributes the effect properly to the "other function", makes local deduction possible, and composes the attributes we have.

Only with such a model we can hope to build and use a "synchronizes-with graph" that compose well with other arguments.
As https://reviews.llvm.org/D106397#C2702144NL1110 shows, once we stop treating synchronization as a complete black box it is crucial the rest of the system
does not fallback to a pessimistic/unknown state if synchronization may be present.

Quick clarification:

Let's assume this is our entire module:

static int *X = nullptr;

int readX() { return *X; }

void foo() {
  if (!X)
    X = malloc(4);  // let it be known in the module.
}

void bar() {
  *X = 1;
  unknown_inaccessible_mem_only_but_not_nosync();   // e.g. malloc
  argmemonly_nofree_but_not_nosync(X);              // can be derived from the definition of the function and attribute can be made available through HTO or thinLTO.
  // X is still deref here because no thread could have freed it.
}

X is in the entire module deref_or_null(4) but only because we have inaccessible_mem_only (malloc) and argmemonly + nofree (definition + HTO/thinLTO).
nosync is not set for any of the functions.

You really also need nocapture on the argument of argmemonly_nofree_but_not_nosync for this to work, right?

Second, perhaps malloc can be nosync. I see the quote from the C standard, but it's unclear to me whether that language is good for anything other than ensuring no data races in a formal model, and surely there are other means to achieve the same goal, e.g. saying that a re-allocated portion of memory is a different memory location as far as the memory model is concerned. The only way I can see to perform meaningful synchronization through free and malloc would be to compare the result of malloc with some pointer that was allocated earlier (if they're equal, this implies that the earlier allocation was freed, and we have a synchronization edge that can be relied upon). It may be possible to make this undefined somehow, perhaps involving the non-determinism trick that is proposed for pointer provenance can be applied here as well.

You basically described the side-channel that makes malloc/free not nosync, didn't you?

T0: foo() /* nosync */; while(1) { p = malloc(); atomic_relaxed_record(p); free(p); }
T1: do { p = malloc(); found = atomic_relaxed_lookup(p); free(p); } while (!found); print("foo is done!");

I'm not sure what atomic_relaxed_record/lookup are supposed to be. If they work by dereferencing p, then no, that's not what I had in mind. I was thinking along the lines of:

Common setup: p = malloc(); x = 0
T0: { x = 1; free(p); }
T1: { while ((q = malloc()) != p) free(q); assert(x == 1) }

But maybe that's just a variation on what you had in mind. In any case, this kind of side channel is clearly ridiculous and should just be closed. I doubt many people would oppose that :)

Finally, we started to track threads explicitly already, partially using domain knowledge, which allows us to reason about the interaction between threads
(https://reviews.llvm.org/D106397#C2702144NL1110). So even in the presence of synchronizations (atomics, barriers, etc), we can use other attributes
(argmemonly, nofree, ...) and such thread tracking to make useful deductions. This is not possible if we interleave the argmemonly/nofree/... semantics
with nosync. The above optimization is a real thing for a very common scenario on GPUs and also CPUs:

run_in_parallel {
  if (threadid == 0)
    effect();
  barrier();

  ... parallel stuff

  if (threadid == 0)
    effect();
  barrier();
 ...
}

I've been staring at this for quite some time now and I don't understand how it relates to this discussion. Can you be more explicit about which functions here have e.g. argmemonly but not nosync, and how that is used in an optimization?

You really also need nocapture on the argument of argmemonly_nofree_but_not_nosync for this to work, right?

yes, probably.

Finally, we started to track threads explicitly already, partially using domain knowledge, which allows us to reason about the interaction between threads
(https://reviews.llvm.org/D106397#C2702144NL1110). So even in the presence of synchronizations (atomics, barriers, etc), we can use other attributes
(argmemonly, nofree, ...) and such thread tracking to make useful deductions. This is not possible if we interleave the argmemonly/nofree/... semantics
with nosync. The above optimization is a real thing for a very common scenario on GPUs and also CPUs:

run_in_parallel {
  if (threadid == 0)
    effect();
  barrier();

  ... parallel stuff

  if (threadid == 0)
    effect();
  barrier();
 ...
}

I've been staring at this for quite some time now and I don't understand how it relates to this discussion. Can you be more explicit about which functions here have e.g. argmemonly but not nosync, and how that is used in an optimization?

If we merge the semantics of nosync into nofree, argmemonly, etc. to make them "global" instead of local, anything that contains a barrier/atomic/volatile/convergent operation will loose those attributes. Do you agree?
Now, if we start to look at barrier/atomic/convergent in more detail, e.g., by tracking the main thread on a GPU device to basically ensure there is no concurrent access to things, we would loose out on the nofree, argmemonly, etc. attributes of the functions the main thread calls in "critical regions".
In my example, let's assume effect is called only by the main thread in the two critical regions and it contains an atomic update of a global. Backing nosync into nofree, ... will prevent us to annotate effect with such arguments as it is locally not decidable if it is always called from critical regions. We can however determine it doesn't itself call free. Later, when we determine that the call sites are in critical regions we have the nofree, ... attributes available and we can act on them.

But maybe that's just a variation on what you had in mind. In any case, this kind of side channel is clearly ridiculous and should just be closed. I doubt many people would oppose that :)

I would not oppose it, I want it to be nosync, though I gave up on my patch. I fear that opens up security problems because the real malloc/free reuse memory while the abstract machine does not need to. Either way, this is something to keep in mind.

Finally, we started to track threads explicitly already, partially using domain knowledge, which allows us to reason about the interaction between threads
(https://reviews.llvm.org/D106397#C2702144NL1110). So even in the presence of synchronizations (atomics, barriers, etc), we can use other attributes
(argmemonly, nofree, ...) and such thread tracking to make useful deductions. This is not possible if we interleave the argmemonly/nofree/... semantics
with nosync. The above optimization is a real thing for a very common scenario on GPUs and also CPUs:

run_in_parallel {
  if (threadid == 0)
    effect();
  barrier();

  ... parallel stuff

  if (threadid == 0)
    effect();
  barrier();
 ...
}

I've been staring at this for quite some time now and I don't understand how it relates to this discussion. Can you be more explicit about which functions here have e.g. argmemonly but not nosync, and how that is used in an optimization?

If we merge the semantics of nosync into nofree, argmemonly, etc. to make them "global" instead of local, anything that contains a barrier/atomic/volatile/convergent operation will loose those attributes. Do you agree?

Yes (convergent operations can be nosync, but the overall point remains).

Now, if we start to look at barrier/atomic/convergent in more detail, e.g., by tracking the main thread on a GPU device to basically ensure there is no concurrent access to things, we would loose out on the nofree, argmemonly, etc. attributes of the functions the main thread calls in "critical regions".
In my example, let's assume effect is called only by the main thread in the two critical regions and it contains an atomic update of a global. Backing nosync into nofree, ... will prevent us to annotate effect with such arguments as it is locally not decidable if it is always called from critical regions. We can however determine it doesn't itself call free. Later, when we determine that the call sites are in critical regions we have the nofree, ... attributes available and we can act on them.

Okay, I see now where you're coming from. From the caller's perspective, effect isn't nosync, so it may synchronize with some other thread. There's in general no reason why that thread has to be part of the same workgroup or wave. However:

  1. One could envision a future refinement of nosync with an attribute that labels effect as not communicating outside e.g. a workgroup.
  1. There are other caller-based ways in which the possible reach of synchronization could be limited. For example, if there was a way to indicate that all synchronization is tied to memory locations, and effect is argmemonly and called with an uncaptured pointer, then it can't synchronize with any thread outside of the parallel region either and the argmemonly is still useful.

So I think this a good argument in favor of having attributes like nofree and argmemonly talk only about what happens in the calling thread.

ormris removed a subscriber: ormris.Jan 24 2022, 11:50 AM