Page MenuHomePhabricator

[LangRef] Clarify the semantics of lifetime intrinsics
Needs RevisionPublic

Authored by aqjune on Dec 15 2020, 11:53 PM.

Details

Summary

This is a proposal for clarifying the semantics of lifetime.start / lifetime.end with @nlopes, @RalfJung.
IIUC, the intrinsics are introduced to represent lifetimes of local variables while keeping allocas at the entry block in IR.
However, their semantics was poorly described in LangRef.

The proposal is consistent with a description about MIR's LIFETIME_START/LIFETIME_END markers
at StackColoring.cpp (https://github.com/llvm/llvm-project/blob/eb44682d671d66e422b02595a636050582a4d84a/llvm/lib/CodeGen/StackColoring.cpp#L163),
with additional syntactic constraints for valid lifetime intrinsic calls.

To summarize, this patch specifies that:

  • The alloca is initially *dead* if it is syntactically used by any llvm.lifetime.start.
  • Dereferencing a dead alloca is UB, but gep/ptrtoint to a dead alloca returns a valid value (explains LICM: https://gcc.godbolt.org/z/Wo9dbo)
  • A program that contains a path executing lifetime.start twice to same alloca is ill-formed. LLVM may raise assertion failure when such program is given.
  • lifetime.start may not be paired with lifetime.end in some execution (see https://gcc.godbolt.org/z/e7sdh1 , written by @nlopes)
  • Lifetime intrinsics' ptr should be either alloca or bitcast(alloca). I couldn't come up with a reasonable scenario where frontend generates lifetime intrinsics with non-stack objects. For heap allocations we can assume the allocation call/free call is simply its lifetime.

Diff Detail

Unit TestsFailed

TimeTest
180 msx64 windows > Clang.CodeGen::nobuiltin.c
Script: -- : 'RUN: at line 3'; c:\ws\w16c2-1\llvm-project\premerge-checks\build\bin\clang.exe -cc1 -internal-isystem c:\ws\w16c2-1\llvm-project\premerge-checks\build\lib\clang\12.0.0\include -nostdsysteminc -triple x86_64-linux-gnu -O1 -S -o - C:\ws\w16c2-1\llvm-project\premerge-checks\clang\test\CodeGen\nobuiltin.c | c:\ws\w16c2-1\llvm-project\premerge-checks\build\bin\filecheck.exe -check-prefix=STRCPY -check-prefix=MEMSET C:\ws\w16c2-1\llvm-project\premerge-checks\clang\test\CodeGen\nobuiltin.c
1,960 msx64 windows > Clang.Driver::memtag_lto.c
Script: -- : 'RUN: at line 5'; rm -f C:\ws\w16c2-1\llvm-project\premerge-checks\build\tools\clang\test\Driver\Output\memtag_lto.c.tmp*
50 msx64 windows > LLVM.Analysis/BasicAA::modref.ll
Script: -- : 'RUN: at line 2'; c:\ws\w16c2-1\llvm-project\premerge-checks\build\bin\opt.exe < C:\ws\w16c2-1\llvm-project\premerge-checks\llvm\test\Analysis\BasicAA\modref.ll -basic-aa -gvn -dse -S | c:\ws\w16c2-1\llvm-project\premerge-checks\build\bin\filecheck.exe --allow-unused-prefixes=false C:\ws\w16c2-1\llvm-project\premerge-checks\llvm\test\Analysis\BasicAA\modref.ll
40 msx64 windows > LLVM.Analysis/CostModel::free-intrinsics-datalayout.ll
Script: -- : 'RUN: at line 2'; c:\ws\w16c2-1\llvm-project\premerge-checks\build\bin\opt.exe -analyze -cost-model -cost-kind=code-size C:\ws\w16c2-1\llvm-project\premerge-checks\llvm\test\Analysis\CostModel\free-intrinsics-datalayout.ll -S -o - | c:\ws\w16c2-1\llvm-project\premerge-checks\build\bin\filecheck.exe --allow-unused-prefixes=false C:\ws\w16c2-1\llvm-project\premerge-checks\llvm\test\Analysis\CostModel\free-intrinsics-datalayout.ll --check-prefix=CHECK-SIZE
80 msx64 windows > LLVM.Analysis/CostModel::free-intrinsics-no_info.ll
Script: -- : 'RUN: at line 2'; c:\ws\w16c2-1\llvm-project\premerge-checks\build\bin\opt.exe -analyze -cost-model -cost-kind=code-size C:\ws\w16c2-1\llvm-project\premerge-checks\llvm\test\Analysis\CostModel\free-intrinsics-no_info.ll -S -o - | c:\ws\w16c2-1\llvm-project\premerge-checks\build\bin\filecheck.exe --allow-unused-prefixes=false C:\ws\w16c2-1\llvm-project\premerge-checks\llvm\test\Analysis\CostModel\free-intrinsics-no_info.ll --check-prefix=CHECK-SIZE
View Full Test Results (101 Failed)

Event Timeline

aqjune created this revision.Dec 15 2020, 11:53 PM
aqjune requested review of this revision.Dec 15 2020, 11:53 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 15 2020, 11:53 PM

I wonder whether the suggested specification looks okay or there are use cases of these intrinsics that I’m not aware of.

For the well-formedness of intrinsics, I want to write a validation algorithm for this at Verifier.cpp, but the algorithm is expensive.
For each lifetime.start we need to traverse CFG to check reachability to other lifetime.starts without visiting lifetime.end.
The time complexity will be O(N E) where N = # of lifetime.start, E = # of edges in CFG.
Is it allowed to have such expensive check in Verifier?

aqjune updated this revision to Diff 312132.Dec 15 2020, 11:58 PM

Minor update

aqjune edited the summary of this revision. (Show Details)Dec 16 2020, 12:04 AM
lebedev.ri added a subscriber: lebedev.ri.

I'm unconvinced. Some questions:

  1. Why can't it point at a subset of an allocation?
  2. Why is it being restricted to an alloca, why shouldn't it be used for heap allocations?
aqjune updated this revision to Diff 312134.Dec 16 2020, 12:07 AM

Update Verifier

I'm unconvinced. Some questions:

  1. Why can't it point at a subset of an allocation?
  2. Why is it being restricted to an alloca, why shouldn't it be used for heap allocations?

I think answers to these questions are somewhat related to the goal of these intrinsics. IIUC, they are introduced to represent lifetimes of local variables while keeping allocas at the entry block in IR.
Either a local variable is within the scope or not, but not only a portion of those are.
Heap allocation's lifetime can be explicitly described by the allocation call and free call.
Is there a use case where these two cases should be supported?

I'm unconvinced. Some questions:

  1. Why can't it point at a subset of an allocation?
  2. Why is it being restricted to an alloca, why shouldn't it be used for heap allocations?

I think answers to these questions are somewhat related to the goal of these intrinsics. IIUC, they are introduced to represent lifetimes of local variables while keeping allocas at the entry block in IR.
Either a local variable is within the scope or not, but not only a portion of those are.
Heap allocation's lifetime can be explicitly described by the allocation call and free call.
Is there a use case where these two cases should be supported?

The patch's description states *what* the change is, but not *why*, and i think that is backwards.

The patch's description states *what* the change is, but not *why*, and i think that is backwards.

I see, I'll slightly update the description.

aqjune edited the summary of this revision. (Show Details)Dec 16 2020, 12:25 AM

A contradiction in the proposed semantics vs the comments in the code is that we state that subsequent lifetime.start don't change the address of the alloca. There's only one address per alloca, even if it may have multiple disjoint liveness ranges.
It would be good to get confirmation from some CodeGen folks that the implemented algorithm respects this condition and that there are no plans to make the algorithm more aggressive in a way that may break this assumption.
Having multiple start/end pairs for the same alloca is not common, so I think imposing this condition is fine. It gives us free movement of gep/ptr2int, which is a good tradeoff.

nikic added inline comments.Dec 17 2020, 11:27 AM
llvm/docs/LangRef.rst
17647

or a zero-index GEP, because we canoncalize bitcasts to that.

This should probably be brought up on llvm-dev. This generally makes sense to me, but I don't have a good handle on what frontends use lifetime.start/end for.

What is the reason to restrict it to allocas? Just that we don't emit it right now? I don't see how that makes it conceptually better. I also second @lebedev.ri concerns. This change looks to me like it tries to fix a set of issues and at the same time do something else, that is restrict the functionality based on some use case. In any case, please split this into two patches. One for the clarifications without restricting it to alloca and without specifying that the entire object is affected, e.g., "memory that is referenced in a lifetime intrinsic is initially dead".
A second one can then be proposed for the alloca and "entire object" change, though I'm not sure that is helpful/needed at all.

llvm/lib/IR/Verifier.cpp
5184

I'm doubtful of a strict syntactic requirement here. Using something from the stripPointerCast family makes more sense to me.

What is the reason to restrict it to allocas? Just that we don't emit it right now? I don't see how that makes it conceptually better.

Potentially we could use these intrinsics with heap allocations such that after inlining they could maybe help to remove mallocs & friends. And maybe help alias analysis (Though the way basicAA works doesn't make implementing this any easy).
One of the reasons why it is restricted to allocas ATM is that the existence of lifetime.start(%p) determines that %p = alloca doesn't really allocate space right away. So we need to be able to syntactically determine all inputs to lifetime.start intrinsics. Which is harder if you start supporting non-alloca'ed objects, pointer arithmetic and everything else.
I suspect CodeGen would break right now if you use these intrinsics with non-alloca ptrs.

To summarize, I think we should restrict these intrinsics to alloca'ed ptrs for now as LLVM doesn't support anything else. If in the future someone extends them to support other kind of objects, then the semantics can be updated.

This change looks to me like it tries to fix a set of issues and at the same time do something else, that is restrict the functionality based on some use case.

Not at all. It's just documenting the implicit assumptions all over LLVM's code, from IR to CodeGen. It's not changing anything. That's why we called out some CodeGen folks to ensure what's written reflects the assumptions of the stack coloring algorithm.

What is the reason to restrict it to allocas? Just that we don't emit it right now? I don't see how that makes it conceptually better.

Potentially we could use these intrinsics with heap allocations such that after inlining they could maybe help to remove mallocs & friends. And maybe help alias analysis (Though the way basicAA works doesn't make implementing this any easy).
One of the reasons why it is restricted to allocas ATM is that the existence of lifetime.start(%p) determines that %p = alloca doesn't really allocate space right away. So we need to be able to syntactically determine all inputs to lifetime.start intrinsics. Which is harder if you start supporting non-alloca'ed objects, pointer arithmetic and everything else.

IMHO, this is backwards. We don't need to restrict the usage to X in order to make it easier to use. We need to check at the use site if it is X and only proceed then.

I suspect CodeGen would break right now if you use these intrinsics with non-alloca ptrs.

I'd argue this is a CodeGen bug.

To summarize, I think we should restrict these intrinsics to alloca'ed ptrs for now as LLVM doesn't support anything else. If in the future someone extends them to support other kind of objects, then the semantics can be updated.

I'm still missing the why beyond: "Because some passes don't want to check isa<AllocaInst>(V->stripPointerCasts()) before using the lifetime marker".

This change looks to me like it tries to fix a set of issues and at the same time do something else, that is restrict the functionality based on some use case.

Not at all. It's just documenting the implicit assumptions all over LLVM's code, from IR to CodeGen. It's not changing anything. That's why we called out some CodeGen folks to ensure what's written reflects the assumptions of the stack coloring algorithm.

It absolutely is changing something. The lang ref right now is not excluding the use of this for heap memory. You can argue a use for heap memory is wrong and breaks hidden assumptions but the more straight forward explanation is that the users should verify if their assumptions hold, especially if it is trivial to do so. As I said before, there are two parts because some of the clarifications addressed here are just that, clarifications. Without them the intrinsics don't make much sense and we should certainly adopt them. However, heap memory and partial object lifetimes make sense on their own and both can be checked easily.

lebedev.ri requested changes to this revision.Dec 17 2020, 12:33 PM
This revision now requires changes to proceed.Dec 17 2020, 12:33 PM

To summarize, I think we should restrict these intrinsics to alloca'ed ptrs for now as LLVM doesn't support anything else. If in the future someone extends them to support other kind of objects, then the semantics can be updated.

I'm still missing the why beyond: "Because some passes don't want to check isa<AllocaInst>(V->stripPointerCasts()) before using the lifetime marker".

We need to support performing stack coloring consistently. Completely skipping coloring for a large function is a big performance hit. We need LLVM IR intrinsics that support this. Given that, we have two options:

  1. We restrict llvm.lifetime.start/end so a random llvm.lifetime intrinsic can't completely disable the stack coloring pass. I think this means isa<AllocaInst>(V->stripPointerCasts()) must always be true; otherwise, I'm not sure how you would figure out the points in the function where a given alloca is live.
  2. We give up on the existing lifetime intrinsics and introduce new ones which do allow consistent stack coloring.

To summarize, I think we should restrict these intrinsics to alloca'ed ptrs for now as LLVM doesn't support anything else. If in the future someone extends them to support other kind of objects, then the semantics can be updated.

I'm still missing the why beyond: "Because some passes don't want to check isa<AllocaInst>(V->stripPointerCasts()) before using the lifetime marker".

We need to support performing stack coloring consistently. Completely skipping coloring for a large function is a big performance hit. We need LLVM IR intrinsics that support this. Given that, we have two options:

  1. We restrict llvm.lifetime.start/end so a random llvm.lifetime intrinsic can't completely disable the stack coloring pass. I think this means isa<AllocaInst>(V->stripPointerCasts()) must always be true; otherwise, I'm not sure how you would figure out the points in the function where a given alloca is live.
  2. We give up on the existing lifetime intrinsics and introduce new ones which do allow consistent stack coloring.

I'm confused. Why can't you just ignore the lifetime markers for which isa<AllocaInst>(V->stripPointerCasts()) is not true in the pass? That should give you at least the same results as this patch would, or am I missing something?

aqjune added a comment.EditedDec 17 2020, 11:28 PM

If a heap allocation is allowed, it really brings another level of complexity.
No one knows what lifetime.start of malloc really means; there is no optimization that works on it, it's nontrivial to decide how it interacts with free(), etc.
We're leaving a construct something that is really hard to be used. In perspective of the frontend, simply placing malloc()/free() or memset(undef) or their own marker at the place where lifetime.start/lifetime.end are supposed to be would have been a better choice.

Perhaps the best thing to do is to ask how lifetime intrinsic is used in other frontends/projects. I'll send a mail to llvm-dev.

BTW, the unit failures are due to two reasons:
(1) Some tests simply use lifetime.start(undef)/end(undef), but they can be fixed. There are only 14 such tests.
(2) Optimizations can transform lifetime's argument, such as bitcast -> gep (as @nikic said) or merging two lifetime calls in if-else branches and introducing phi.
I'll change the syntactic constraint to be more generous. Syntactic constraint is needed anyway, otherwise we really don't know which object is passed to lifetime.start. Assume that a pointer p is passed to call f(p), which is doing lifetime.start(p).

If a heap allocation is allowed, it really brings another level of complexity.
No one knows what lifetime.start of malloc really means; there is no optimization that works on it, it's nontrivial to decide how it interacts with free(), etc.

Why wouldn't/shouldn't it mean exactly what it means for stack allocations?
"The memory region specified in the lifetime start is dead until a lifetime start makes that memory accessible, ..."
Could you explain what is nontrivial about the interaction with free?

We're leaving a construct something that is really hard to be used. In perspective of the frontend, simply placing malloc()/free() or memset(undef) or their own marker at the place where lifetime.start/lifetime.end are supposed to be would have been a better choice.

I'm not so sure about that. memset(undef) is, for example, not recognized commonly while hasOnlyLifetimeUsers (or similar) is a common check. Moving malloc/free may come with a cost. Placing the malloc/free in a loop is a totally different story than placing lifetime markers in there. Imagine you have a heap allocation in a loop and the FE (or something else) places the malloc earlier instead and uses lifetime markers to assure the middle end that there are no loop carried dependences via this memory region. (Also, we should implement this transformation, it could not only save time wrt. allocations but also allow vectorization, especially given the lifetime markers.)

Perhaps the best thing to do is to ask how lifetime intrinsic is used in other frontends/projects. I'll send a mail to llvm-dev.

That can't hurt but the absence of users shouldn't be the only factor (IMHO). Above I already sketched an optimization to use them (by accident).

[...] Optimizations can transform lifetime's argument, such as bitcast -> gep (as @nikic said) or merging two lifetime calls in if-else branches and introducing phi. I'll change the syntactic constraint to be more generous. Syntactic constraint is needed anyway, otherwise we really don't know which object is passed to lifetime.start. Assume that a pointer p is passed to call f(p), which is doing lifetime.start(p).

I disagree that syntactic constraints are needed, in a generic sense, and I find they often make the IR harder to work with. I'm not sure I understand your example but I guess if you inline the call the lifetime argument could be syntactically something else, right? So it might become syntactically useful even if it wasn't before. The PHI case is the opposite. It started to be syntactically useful, i.a., alloca arguments, but after sinking there might not be a single alloca but a phi picking one. Arguably, the information is no different. A user could easily determine all the underlying objects and filter the allocas.

Why wouldn't/shouldn't it mean exactly what it means for stack allocations?
"The memory region specified in the lifetime start is dead until a lifetime start makes that memory accessible, ..."

lifetime.start also has a magic effect "backward in time": when you just do alloca, it allocates some memory. When you do alloca and later do lifetime.start on the resulting pointer, (stack) memory allocation is delayed until the lifetime.start. IOW, lifetime.start is a very unusual intrinsic. Adding it *later* in the code can make a pointer *earlier* in the code dangling.

How do you propose to be able to predict if an alloca actually allocates memory, if it is legal do to something like "%p = alloca (...)", followed by calling "f(p)", followed by "lifetime.start(p)" inside "f"? Now whether the original alloca actually allocates memory depends on the source code of another function...?

Possibly a better fix is to avoid this "magic backward in time" effect, and instead use a flag on alloca to reflect whether memory is immediately allocated, or not. (The details of the semantics of this could still be tricky though.)

Why wouldn't/shouldn't it mean exactly what it means for stack allocations?
"The memory region specified in the lifetime start is dead until a lifetime start makes that memory accessible, ..."
Could you explain what is nontrivial about the interaction with free?

We're leaving a construct something that is really hard to be used. In perspective of the frontend, simply placing malloc()/free() or memset(undef) or their own marker at the place where lifetime.start/lifetime.end are supposed to be would have been a better choice.

I'm not so sure about that. memset(undef) is, for example, not recognized commonly while hasOnlyLifetimeUsers (or similar) is a common check. Moving malloc/free may come with a cost. Placing the malloc/free in a loop is a totally different story than placing lifetime markers in there. Imagine you have a heap allocation in a loop and the FE (or something else) places the malloc earlier instead and uses lifetime markers to assure the middle end that there are no loop carried dependences via this memory region. (Also, we should implement this transformation, it could not only save time wrt. allocations but also allow vectorization, especially given the lifetime markers.)

Because in case of stack allocated objects, lifetime.start/end determines the object's location as well. Two allocas with disjoint lifetimes should be able to be overlapped.
For a heap allocation, malloc()/free() should determine the address, and I wanted to say that it isn't clear whether frontend writers will expect lifetime intrinsics to determine the disjointness; it is unclear.

The optimization makes sense, but won't simply putting memset(undef) at the end of the loop do the trick? The backend pipeline (e.g. CodeGenPrepare) can remove the redundant memset.

I disagree that syntactic constraints are needed, in a generic sense, and I find they often make the IR harder to work with. I'm not sure I understand your example but I guess if you inline the call the lifetime argument could be syntactically something else, right? So it might become syntactically useful even if it wasn't before. The PHI case is the opposite. It started to be syntactically useful, i.a., alloca arguments, but after sinking there might not be a single alloca but a phi picking one. Arguably, the information is no different. A user could easily determine all the underlying objects and filter the allocas.

It's because it can introduce UB. Consider this example:

p = alloca i32
store 0, p // analysis will say this is perfectly fine
f(p) // lifetime.start(p) is hidden in f(p)

After inlining:

p = alloca i32
store 0, p
lifetime.start(p) // this makes the above store raise UB

Possibly a better fix is to avoid this "magic backward in time" effect, and instead use a flag on alloca to reflect whether memory is immediately allocated, or not. (The details of the semantics of this could still be tricky though.)

Yeah, alloca should know its 'valid' location in advance. A simple solution would be to make ptrtoint yield poison if it is executed before lifetime.start, but it sacrifies mobility of ptrtoint.

Two allocas with disjoint lifetimes should be able to be overlapped.

Ah, that makes it very tricky... to also support ptrtoint mobility, alloca really needs to "see ahead" and determine which stack slots have disjoint lifetimes and which do not. Then it can immediately "reserve" some memory that does not overlap with stack slots with non-disjoint lifetime.

jdoerfert added a comment.EditedDec 18 2020, 7:33 AM

Two allocas with disjoint lifetimes should be able to be overlapped.

Ah, that makes it very tricky... to also support ptrtoint mobility, alloca really needs to "see ahead" and determine which stack slots have disjoint lifetimes and which do not. Then it can immediately "reserve" some memory that does not overlap with stack slots with non-disjoint lifetime.

IMHO, this is not the job of an "alloca". A late transformation could make this explicit, that is first perform lifetime analysis on stack slots and then coalesce reassign the allocas accordingly. Basically register allocation for stack slots. I would have assumed we have something like this but I'm generally too afraid to venture into the MIR layer to check.

Why wouldn't/shouldn't it mean exactly what it means for stack allocations?
"The memory region specified in the lifetime start is dead until a lifetime start makes that memory accessible, ..."

lifetime.start also has a magic effect "backward in time": when you just do alloca, it allocates some memory. When you do alloca and later do lifetime.start on the resulting pointer, (stack) memory allocation is delayed until the lifetime.start. IOW, lifetime.start is a very unusual intrinsic. Adding it *later* in the code can make a pointer *earlier* in the code dangling.

This mixes semantic and implementation of an alloca. Let's not do that. If you look at it a different way it makes more sense: Memory regions mentioned in lifetime intrinsics are initially dead. That does not mean they are not existent or that stack allocations would be delayed; they are known to be dead in a certain code region. Adding a lifetime adds information. Adding it
*later* adds more information. Arguably, we could mark the alloca in a way to indicate there are lifetimes attached but that is not the point. The thing is, we can decide to coalesce allocas with non-overlapping lifetimes into the same slot. (Under some conditions,) you can also reassign totally different allocas to code regions. Take

alloca x,y,z;

lifetime.start(x)
use(x)
if (...) {
  lifetime.start(y)
  use(y)                // x and y are life
  lifetime.end(y)
}
use(x)
if (...) {
  lifetime.start(z)
  use(z)                // x and z are life
  lifetime.end(z)
}
lifetime.end(x)

lifetime.start(y)
lifetime.start(z)
use(y)
use(z)                // y and z are life
lifetime.end(z)
lifetime.end(y)

If I didn't mess up, none of the allocas have disjoint lifetimes but we can split the lifetimes and get away with two slots anyway.
My point is, lifetime markers provide information and do not change the semantics of an alloca. A transformation can act on that
information but that is different.

How do you propose to be able to predict if an alloca actually allocates memory, if it is legal do to something like "%p = alloca (...)", followed by calling "f(p)", followed by "lifetime.start(p)" inside "f"? Now whether the original alloca actually allocates memory depends on the source code of another function...?

I propose to error on the cautious side and allocate memory if we cannot prove we don't need it? How is this any different from the proposal in this patch which would simply disallow those lifetime markers. Having some we cannot act upon is not worse than forbidding them.

Possibly a better fix is to avoid this "magic backward in time" effect, and instead use a flag on alloca to reflect whether memory is immediately allocated, or not. (The details of the semantics of this could still be tricky though.)

As mentioned above, I can see it being useful to have a flag mentioning if here are lifetime markers attached but that is something different.

RalfJung added a comment.EditedDec 18 2020, 7:41 AM

This mixes semantic and implementation of an alloca. Let's not do that.

I'm afraid that happened a long time ago when these intrinsics were added. It is, as far as I can see, not possible to describe the semantic of alloca in a correct way without adding this "magic rule".

You stated what you think the semantics of lifetime.start should be, but what about alloca? Somehow you have to explain the following behavior:

x = alloca 4
y = alloca 4
// assume no lifetime.start in the rest of the function
// now I can prove that x != y

but

x = alloca 4
y = alloca 4
// Now, x and y may be equal!

lifetime.start x
...
lifetime.end x
lifetime.start y
...
lifetime.end y

In other words, to describe the semantics of alloca (its behavior on the Abstract Machine), we need to "look ahead" and check the lifetime.start of the current function.

We shouldn't mix semantics and what analyses can conclude about a program (which is a consequence of the semantics). "It marks the local dead before the call" is an effect on liveness analyses, it doesn't say anything about what happens in the Abstract Machine. You have to explain in which way "deadness" is reflected in the Abstract Machine.

In particular, one key aspect of the semantics of alloca + lifetime.start is when the allocation (the non-deterministic choice of the address for this local) is made, and how the choice is constrained. Usually for allocation the semantics is that you can pick anything that doesn't overlap with other currently existing allocations, but for alloca in the presence of lifetime.start, this does not work, as the example above shows. That's why we keep talking about "looking ahead", "delaying the allocation", etc.

uenoku added a subscriber: uenoku.Dec 18 2020, 7:42 AM

[...]

We're leaving a construct something that is really hard to be used. In perspective of the frontend, simply placing malloc()/free() or memset(undef) or their own marker at the place where lifetime.start/lifetime.end are supposed to be would have been a better choice.

I'm not so sure about that. memset(undef) is, for example, not recognized commonly while hasOnlyLifetimeUsers (or similar) is a common check. Moving malloc/free may come with a cost. Placing the malloc/free in a loop is a totally different story than placing lifetime markers in there. Imagine you have a heap allocation in a loop and the FE (or something else) places the malloc earlier instead and uses lifetime markers to assure the middle end that there are no loop carried dependences via this memory region. (Also, we should implement this transformation, it could not only save time wrt. allocations but also allow vectorization, especially given the lifetime markers.)

Because in case of stack allocated objects, lifetime.start/end determines the object's location as well. Two allocas with disjoint lifetimes should be able to be overlapped.

Sure. You could do the same transformation with heap objects under the same constraints. So you go from

for (/* a long time */) {
  x = malloc(large);
  use(x);
  free(x);

  y = malloc(large);
  use(y);
  free(y)
}

to

x = malloc(large);
y = malloc(large);
for (/* a long time */) {
  lifetime.start(x)
  use(x);
  lifetime.end(x)

  lifetime.start(y)
  use(y);
  lifetime.end(y)
}
free(x);
free(y);

and finally to

x = malloc(large);
for (/* a long time */) {
  lifetime.start(x)
  use(x);
  lifetime.end(x)

  lifetime.start(x)
  use(x);
  lifetime.end(x)
}
free(x);

same as you would for stack allocations.

For a heap allocation, malloc()/free() should determine the address, and I wanted to say that it isn't clear whether frontend writers will expect lifetime intrinsics to determine the disjointness; it is unclear.

The above transformation could be done in the middle end just fine. Even the first step is beneficial and the lifetime markers make it possible to perform the second part.

The optimization makes sense, but won't simply putting memset(undef) at the end of the loop do the trick? The backend pipeline (e.g. CodeGenPrepare) can remove the redundant memset.

Yes and no. If you'd replace both markers with memset(undef) you might get to this but there are reasons not to do it. First, you'd need to teach places about this so the dead store is not removed. Second, you'd need to teach places about the memset(undef) pattern (as we did/do for lifetime markers). Third, it is arguably easier to read/understand lifetime.start(x) than memset(x, undef). Finally, a memset(undef) looks like a write (and is one) but that is not strictly what a lifetime marker is. If you happen to have an empty lifetime range followed by a non empty lifetime range, in the memset version there are WAW dependences which you need to argue away while in the lifetime version they don't exist; This is probably part of the teaching point above but shows how memset now becomes different things.

I disagree that syntactic constraints are needed, in a generic sense, and I find they often make the IR harder to work with. I'm not sure I understand your example but I guess if you inline the call the lifetime argument could be syntactically something else, right? So it might become syntactically useful even if it wasn't before. The PHI case is the opposite. It started to be syntactically useful, i.a., alloca arguments, but after sinking there might not be a single alloca but a phi picking one. Arguably, the information is no different. A user could easily determine all the underlying objects and filter the allocas.

It's because it can introduce UB. Consider this example:

p = alloca i32
store 0, p // analysis will say this is perfectly fine
f(p) // lifetime.start(p) is hidden in f(p)

After inlining:

p = alloca i32
store 0, p
lifetime.start(p) // this makes the above store raise UB

Given the proposed semantic of lifetime.start, the program had UB semantics prior to inlining, we might not have known it statically but that is not an issue.

This mixes semantic and implementation of an alloca. Let's not do that.

I'm afraid that happened a long time ago when these intrinsics were added. It is, as far as I can see, not possible to describe the semantic of alloca in a correct way without adding this "magic rule".

You stated what you think the semantics of lifetime.start should be, but what about alloca?

I would argue the alloca is the problem not the lifetime markers. The crucial problem in your example is that we assume alloca represents a unique address, unconditionally.
Let's assume it doesn't for a second, see below.

Somehow you have to explain the following behavior:

x = alloca 4
y = alloca 4
// assume no lifetime.start in the rest of the function
// now I can prove that x != y

but

x = alloca 4
y = alloca 4
// Now, x and y may be equal!

lifetime.start x
...
lifetime.end x
lifetime.start y
...
lifetime.end y

In other words, to describe the semantics of alloca (its behavior on the Abstract Machine), we need to "look ahead" and check the lifetime.start of the current function.

We shouldn't mix semantics and what analyses can conclude about a program (which is a consequence of the semantics). "It marks the local dead before the call" is an effect on liveness analyses, it doesn't say anything about what happens in the Abstract Machine. You have to explain in which way "deadness" is reflected in the Abstract Machine.

In particular, one key aspect of the semantics of alloca + lifetime.start is when the allocation (the non-deterministic choice of the address for this local) is made, and how the choice is constrained. Usually for allocation the semantics is that you can pick anything that doesn't overlap with other currently existing allocations, but for alloca in the presence of lifetime.start, this does not work, as the example above shows. That's why we keep talking about "looking ahead", "delaying the allocation", etc.

Proposal:

  1. As you noted, there are "constrainted" and "unconstrained" allocas; if there might be a lifetime marker use, an alloca is constrained.
  2. Once an alloca is unconstrained, it cannot go back, thus we need to manifest it with the alloca.
  3. We do not know that constrained allocas have unique addresses, thus don't fold comparisons of (different) allocas if one is constrained.
  4. [Optional] Provide a alloca coalescing pass in the middle-end that will utilize the lifetime markers and split/coalesce allocas.

Interestingly, I think 2) is also needed in this approach as we could otherwise introduce lifetime markers after folding some comparisons but not others.
So if we start with:

x = alloca
y = alloca
r0 = x == y; // `false` as of now and with this proposal, afaict
//
use(/* noescape */ x);
//
use(/* noescape */ y);
//
autp cmp = [&](){ return x == y; };
r1 = cmp();

return r0 + r1;

Now we fold r0 to false first. Then we inline cmp. Then we realize x and y are only dereferenced in the use part and we introduce lifetime markers:

x = alloca
y = alloca

lifetime.start(x);
use(/* noescape */ x);
lifetime.end(x);

lifetime.start(y);
use(/* noescape */ y);
lifetime.end(y);

r1 = x == y;

return false + r1;

we coalesce x and y and r1 might be true. WDYT?

As Ralf mentioned, the ship has sailed. Alloca and lifetime intrinsics were implemented like this several years ago. They were a quick hack to save stack space after inlining. That's it, and their design reflects the goals at the time.
We simply want to document what is implemented. @jdoerfert you seem to want to change the implementation and/or the design, which is a separate discussion. I suggest we first document how LLVM works and then if you want to make changes you start a *separate* discussion on the things you want to change, why, and what's the upgrade path, etc. We can't change the semantics of either alloca or lifetime intrinsics without an automatic upgrade path as otherwise we would break all frontends out there.

As Ralf mentioned, the ship has sailed. Alloca and lifetime intrinsics were implemented like this several years ago. They were a quick hack to save stack space after inlining. That's it, and their design reflects the goals at the time.
We simply want to document what is implemented. @jdoerfert you seem to want to change the implementation and/or the design, which is a separate discussion. I suggest we first document how LLVM works and then if you want to make changes you start a *separate* discussion on the things you want to change, why, and what's the upgrade path, etc. We can't change the semantics of either alloca or lifetime intrinsics without an automatic upgrade path as otherwise we would break all frontends out there.

But you don't "simply document what is implemeneted". You make up new *syntactic* restrictions tailored towards the use case in the backend. This changes the semantics of the IR. Now you somehow seem to argue that this is necessary to fix a bug and therefore we don't need a separate discussion for such a change (or the one you had is sufficient). I disagree.

Also, I did not suggest to change the semantics without an RFC and everything else that is needed. You suggest otherwise in your response. I did propose alternatives to fix your bug, but most of my responses show how the restrictions you want to put in place are making (potentially existing) optimizations invalid.

Finally, in my very first response I asked to split this into the clarifications and the additional restrictions. Since then we argued almost exclusively why those restrictions might be needed or why not. I am not convinced that the clarifications, in addition to the isa<AllocaInst>(V->stripPointerCast()) at the use side, are not sufficient to fix your problem anyway.
At least I have not seen an example where the syntactic requirement was really needed.

Are existing heap optimizations checking whether a malloc is used by lifetime.start/end?
I see that dereferenceable_or_null(n) is always attached to call i8* malloc(<n>). If lifetime.start/end can take a heap pointer, its lifetime may start way after this call, so malloc isn't immediately dereferenceable even if it isn't null.
We can change the semantics of dereferenceable_or_null, but this can transitively require changes to other attributes or dereferenceability analyses.

Sure. You could do the same transformation with heap objects under the same constraints. So you go from
... (snip)

I think there are more cases to consider, such as:

p = malloc(1)
lifetime.start(p)
lifetime.end(p)
q = malloc(1)
// Can the addresses of p and q overlap?
free(q)
lifetime.start(p)
lifetime.end(p)
free(p)

That is why I still prefer memset(undef); it may be a churn to teach that memset(undef) is a canonicalized form to e.g., loop dependence analysis, but teaching lifetime also needs churn.
Most importantly, memset's semantics is crystal clear; it doesn't change the object's address. It also naturally supports a thing that is analogous to 'partial lifetime start'.

It's because it can introduce UB. Consider this example:

p = alloca i32
store 0, p // analysis will say this is perfectly fine
f(p) // lifetime.start(p) is hidden in f(p)

After inlining:

p = alloca i32
store 0, p
lifetime.start(p) // this makes the above store raise UB

Given the proposed semantic of lifetime.start, the program had UB semantics prior to inlining, we might not have known it statically but that is not an issue.

But it would be great to say that alloca is simply dereferenceable as far as it isn't used by any lifetime.start/end, isn't it?
We don't want to assume that a function call may affect lifetime of escaped local variables, and IMO this is what LLVM assumes as well.

Given the proposed semantic of lifetime.start, the program had UB semantics prior to inlining, we might not have known it statically but that is not an issue.

The question is, *how* is it UB? To define UB precisely, we need an operational specification -- something that could, in principle, be implemented in a (complicated) interpreter. But for this example, which is the instruction that raises the UB? It cannot be the alloca, since when the alloca is being executed we have no way to know whether the return value will be used in a lifetime.start in the future! So does UB only arise when the lifetime.start happens? That would mean that the store itself is fine, but when lifetime.start happens we check if there was a load/store already and we raise UB at that point. I am not sure if that has the right behavior.

It also doesn't solve the issue of defining alloca in a way that allocations can overlap.

As you noted, there are "constrainted" and "unconstrained" allocas; if there might be a lifetime marker use, an alloca is constrained.

The thing is, with a purely operational interpretation of lifetime.start, whether an alloca is "constrained" depends on *what happens in the future*. This is not a well-defined semantics (or at least, it is non-obvious how to make it one).

To give one example where "depending on the future" raises some nasty questions that one really doesn't want to arise, consider a program like

x = alloca 4
y = alloca 4
if (x == y) {
  lifetime.start(x); lifetime.end(x);
  lifetime.start(y); lifetime.end(y);
}

Now if the two pointers may be equal, then their lifetimes are disjoint so they cannot be equal. But if they cannot be equal, the if is never executed so the program is equivalent to

x = alloca 4
y = alloca 4

But in this program, they can be equal.

Also, I did not suggest to change the semantics without an RFC and everything else that is needed. You suggest otherwise in your response. I did propose alternatives to fix your bug, but most of my responses show how the restrictions you want to put in place are making (potentially existing) optimizations invalid.

The thing is, in our view (aqjune, nlopes, and me) you *are* changing the semantics by allowing the lifetime intrinsics to be used on things like "malloc". From all we can tell, the intrinsics were never meant to be used with anything but alloca, so for all intents and purposes, currently, they only support alloca. All this patch is doing is making this currently-implicit assumption explicit. Your responses show that this is really long-overdue, since currently different parts of the compiler seem to assume different semantics.

The syntactic restriction we propose is only new in the sense that it hasn't been written down yet. But conceptually, I'd argue that the intrinsics have always been restricted to be "for alloca only", and I think the history of their introduction and how they currently work confirms this.

Now, maybe there are good reasons to support these intrinsics also more generally. But IMO the best way to go about this is to first document the limited cases the intrinsics was meant (and is implemented) to work with, and once we have a solid base to work on, we can consider extending the scope of what the intrinsics can be used for.

Maybe the root of the problem was that LangRef was too obscure about what the intrinsic meant. :/
If there is a well-known use case that does not involve alloca, then this will seriously break the case, so the semantics of lifetime.start/end should consider it from the start.
But we haven't found one, and I think the dereferenceable_or_null on malloc is actually against the interpretation (maybe I should have found it before writing this patch and share it to reduce discussion cost, sorry).
So I think it is okay to specify it as working on alloca only first, and if there is any really problematic case, extend the lifetime's semantics as discussed in this thread.

nhaehnle added a subscriber: nhaehnle.EditedDec 21 2020, 9:52 PM

Also, I did not suggest to change the semantics without an RFC and everything else that is needed. You suggest otherwise in your response. I did propose alternatives to fix your bug, but most of my responses show how the restrictions you want to put in place are making (potentially existing) optimizations invalid.

The thing is, in our view (aqjune, nlopes, and me) you *are* changing the semantics by allowing the lifetime intrinsics to be used on things like "malloc". From all we can tell, the intrinsics were never meant to be used with anything but alloca, so for all intents and purposes, currently, they only support alloca.

This isn't a very useful part of the discussion, but I'd say that as a matter of policy the LangRef must be a document that people can rely on, else what's the point of having it? Since it is written as-if lifetime intrinsics could be applied rather freely, I agree with @jdoerfert here. It's fair to say that this patch is a change of semantics by forbidding the use of the lifetime intrinsics in certain places where they were previously allowed. This is also reflected in the verifier changes that are part of the patch. Those changes should be justified on their own merit instead of pretending that they're not changes :)

That said, I find the direction of this patch very reasonable. The meaning of lifetime intrinsics has always been somewhat unclear to me, and this patch clarifies it in a way that makes sense: they have a pragmatic purpose specifically to enable efficient stack slot allocation in large functions. Preventing other uses can help us reduce the potential for confusion, which is a good thing.

Given the proposed semantic of lifetime.start, the program had UB semantics prior to inlining, we might not have known it statically but that is not an issue.

The question is, *how* is it UB? To define UB precisely, we need an operational specification -- something that could, in principle, be implemented in a (complicated) interpreter. But for this example, which is the instruction that raises the UB? It cannot be the alloca, since when the alloca is being executed we have no way to know whether the return value will be used in a lifetime.start in the future! So does UB only arise when the lifetime.start happens? That would mean that the store itself is fine, but when lifetime.start happens we check if there was a load/store already and we raise UB at that point. I am not sure if that has the right behavior.

I agree that this is a bit iffy, though I'm not sure a literally operational specification is always required. Recall the example program:

p = alloca i32
store 0, p // analysis will say this is perfectly fine
f(p) // lifetime.start(p) is hidden in f(p)

With this patch, the program isn't just UB, it actually becomes ill-formed IR because the lifetime.start in f(p) doesn't syntactically reference an alloca. So the example does not exist.

@jdoerfert's proposal is on point conceptually:

  1. As you noted, there are "constrainted" and "unconstrained" allocas; if there might be a lifetime marker use, an alloca is constrained.
  2. Once an alloca is unconstrained, it cannot go back, thus we need to manifest it with the alloca.
  3. We do not know that constrained allocas have unique addresses, thus don't fold comparisons of (different) allocas if one is constrained.

The one change I would make in that proposal is to say that "if there is a lifetime marker use, an alloca is constrained". As long as the markers must be in the same IR function as the alloca itself, and phi/select are syntactically excluded, this determination can be made reliably (if not easily, due to the bitcast mess).

Ideally, there would be a flag on the alloca instruction itself to distinguish between constrained and unconstrained. As it is, the flag is implicitly given as "has a (indirect) lifetime.start use". That's not great, and perhaps you could make a follow-up patch that introduces such a flag explicitly, but I think it's workable either way.

I would suggest that you add language to the LangRef to make the distinction between constrained and unconstrained explicit, including the example of how it affects folding of pointer comparisons.

llvm/docs/LangRef.rst
17647

Assuming this language is agreed upon, replace "should" by "must". It's clearer, and this patch adds a corresponding check to the verifier.

17674–17675

Make this a "must".

17701–17702

Same here: "must" instead of "should"

llvm/lib/IR/Verifier.cpp
5184

I think the strict syntactic requirement is necessary to make this work without a "constrained vs. unconstrained" flag on the alloca instruction.

I agree that this probably needs stripPointerCast or similar. I also think that the two cases should be unified -- as it is, there's pointless code duplication.

I'd say that as a matter of policy the LangRef must be a document that people can rely on, else what's the point of having it? Since it is written as-if lifetime intrinsics could be applied rather freely

I think it is rather clear that in this case, LangRef is ambiguous and incomplete (and not entirely coherent with "what happens in the code"), so in my opinion clinging to what exactly it says is less useful than trying to figure out the intent of these intrinsics by taking into account both LangRef and "what happens in the code".

It's fair to say that this patch is a change of semantics by forbidding the use of the lifetime intrinsics in certain places where they were previously allowed.

They were allowed by the LangRef but not supported in the code, and the discussion around their introduction not even considered this case. Looks like a LangRef bug to me. The original authors presumably never considered the possibility of using the intrinsics for anything else, so they did not deem this case even worth mentioning.

Neither LangRef nor the code let us say anything precise and meaningful about what happens when the intrinsics are applied to anything other than alloca. As this discussion shows, even saying what they do for alloca is very non-trivial. Figuring out what exactly the intrinsics do in general is a significant addition to the LangRef -- do you agree with this part? So maybe we can add some wording that "if the intrinsics are used in this specific syntactically constrained way, then their behavior is X; otherwise their behavior is currently unclear and needs to be more precisely determined". Then we can later separately discuss removing the "otherwise" clause and saying that other uses are simply forbidden (or someone finds a good way to precisely document the more general case).

I agree that this is a bit iffy, though I'm not sure a literally operational specification is always required.

I don't know any other way to make a precise specification that is internally consistent.

I am considering the patch to still be "operational enough" because the non-operational part is given by fairly simple purely syntactic rules that only check the body of the current function. So taking into account "is there, syntactically, an intrinsic call on the result of this alloca" is okay -- it's not great, but at least it is fairly clear how to make it precise. I am only worried that there might be code moving/removing optimizations that change this syntactic information in problematic ways, but we first need a more precise LangRef to even say which changes are 'problematic'. If that's what you mean by not using a "literally operational specification", we are in agreement.

What I object to is specifications that "look into the future" by asking questions like "will the result of this alloca be passed to lifetime.start some time during this program execution". First of all, the data-flow implicit in this statement is rather subtle, and secondly, as my example shows, "looking into the future" to alter past behaviors can easily lead to contradictions reminiscent of "causal loops".

With this patch, the program isn't just UB, it actually becomes ill-formed IR because the lifetime.start in f(p) doesn't syntactically reference an alloca. So the example does not exist.

Indeed, and IMO that's good. We have no idea how to properly define the semantics of the program, and there is no good reason to support this program in the first place, so punting the question by excluding it from the domain of programs worth considering seems like a good strategy to me.

The one change I would make in that proposal is to say that "if there is a lifetime marker use, an alloca is constrained". As long as the markers must be in the same IR function as the alloca itself, and phi/select are syntactically excluded, this determination can be made reliably (if not easily, due to the bitcast mess).

So the "is" here is a purely syntactic check? That sounds pretty similar to what the patch says here, right (modulo the details of the meaning of "constrained", and of course saying what exactly constitutes a "use")?

The purely syntactic check fails, however, for allocas that are not following the syntactic restriction proposed in the patch (or a slight relaxation of this restriction to support a few more operations on the "use path" from the alloca to the intrinsic).

Ideally, there would be a flag on the alloca instruction itself to distinguish between constrained and unconstrained. As it is, the flag is implicitly given as "has a (indirect) lifetime.start use". That's not great, and perhaps you could make a follow-up patch that introduces such a flag explicitly, but I think it's workable either way.

Note that it's not just a boolean flag. For constrained allocas, it also matters with which other constrained allocas their lifetime can potentially overlap. This determines which allocations might overlap in memory, and which may not. So even an explicit boolean flag would be insufficient to avoid "magic syntactical rules".

I'll resolve the unit test failures and see if there are any unexpected cases.

In some cases, this can resolve sub-optimal codegen because verifier now explicitly gives how lifetime.start/end should be used.
AddressSanitizer.cpp is currently bailing out if there is an 'unknown' lifetime:

if (HasUntracedLifetimeIntrinsic) {
   // If there are lifetime intrinsics which couldn't be traced back to an
   // alloca, we may not know exactly when a variable enters scope, and
   // therefore should "fail safe" by not poisoning them.
   StaticAllocaPoisonCallVec.clear();
   DynamicAllocaPoisonCallVec.clear();
}
ychen added a subscriber: ychen.Dec 26 2020, 5:04 PM

It's fair to say that this patch is a change of semantics by forbidding the use of the lifetime intrinsics in certain places where they were previously allowed.

They were allowed by the LangRef but not supported in the code

You mean not supported by codegen, right? It would still have been possible for somebody to use them in some intermediate-only use of LLVM. I agree with your more important points, though.

The one change I would make in that proposal is to say that "if there is a lifetime marker use, an alloca is constrained". As long as the markers must be in the same IR function as the alloca itself, and phi/select are syntactically excluded, this determination can be made reliably (if not easily, due to the bitcast mess).

So the "is" here is a purely syntactic check? That sounds pretty similar to what the patch says here, right (modulo the details of the meaning of "constrained", and of course saying what exactly constitutes a "use")?

Right. As I said, I think the patch is pointing in the right direction. It would be good to more explicitly talk about "constrained" vs. "unconstrained" allocas in the LangRef part.

The purely syntactic check fails, however, for allocas that are not following the syntactic restriction proposed in the patch (or a slight relaxation of this restriction to support a few more operations on the "use path" from the alloca to the intrinsic).

That's why I'd hope that the IR verifier checks will stick around. Also, slight relaxations of the restriction can still be checked syntactically, but the cost becomes more expensive. For example, when lifetime.start(bitcast(alloca)) is allowed, determining whether an alloca is constrained requires searching not just the users of the alloca, but also the users of any bitcasts of the alloca. This makes the syntactic check more expensive, and it's why I brought up the idea of an explicit flag on the alloca instruction.

Though relaxing the syntactic rules has other downsides. For example: if only the form lifetime.start(alloca) is allowed, then preventing the introduction of phis or selects in there (as in: lifetime.start(select(alloca, ...)) is easy. If lifetime.start(bitcast(alloca)) is allowed as well, then preventing the introduction of phis or selects between the bitcast and the alloca (as in: lifetime.start(bitcast(select(alloca, ...))) becomes a burden.

That's the part that makes me most uneasy about this patch.

Really, it'd be best if the syntactic rules could be kept as strict as possible.

Ideally, there would be a flag on the alloca instruction itself to distinguish between constrained and unconstrained. As it is, the flag is implicitly given as "has a (indirect) lifetime.start use". That's not great, and perhaps you could make a follow-up patch that introduces such a flag explicitly, but I think it's workable either way.

Note that it's not just a boolean flag. For constrained allocas, it also matters with which other constrained allocas their lifetime can potentially overlap. This determines which allocations might overlap in memory, and which may not. So even an explicit boolean flag would be insufficient to avoid "magic syntactical rules".

I do believe a boolean flag is sufficient for what's pragmatically required. I see a number of different kinds of questions that we want to be answered:

  1. Is a store/load through the alloca'd pointer UB without a lifetime.start?
  2. Is the alloca'd pointer known to compare unequal to a different alloca'd pointer? (for IR transforms)
  3. Which allocas can be overlapped on the stack? (for codegen)

Only the first one really needs a definitive answer encoded in the IR, and it's a simple Yes/No question.

For the second question, the actual lifetime ranges is important, but a boolean flag can be used to quickly determine whether the question must be answered conservatively or not, and I'd wager that that's sufficient for most practical purposes.

The third question also needs lifetime ranges, but I don't see why you'd want more than a boolean flag on the alloca instruction here as part of the IR itself. It feels more natural to derive the information when needed from the constrained/unconstrained distinction and placement of lifetime markers.

jdoerfert added a comment.EditedDec 29 2020, 10:59 AM

[Removed, will repost the full comment shortly.]

jdoerfert added a comment.EditedDec 29 2020, 2:15 PM

I believe the problem is that we use the address/allocation "rules" to fold comparisons while we later "change" those rules. This patch tries to define a way out by differentiating allocas. I'm not sure anymore that is a good idea to begin with. Consistency could also be achieved by not using the "rules" to fold alloca comparisons to "not equal". We could instead coalesce allocas in the middle end (more aggressively and not only based on lifetime markers) in order to fold comparisons of the "then equal" allocas to "equal". That would not require us to assign different semantics to allocas depending on their uses, prevent us from restricting lifetime markers, and, probably most importantly, allow us to coalesce more without creating inconsistencies. First step is to check how often we fold alloca comparisons to "not equal". If that doesn't happen at all there seems little point in all of this. Even if not, an alternative, as sketched before, might very well perform at least as good while improving other aspects.


Are existing heap optimizations checking whether a malloc is used by lifetime.start/end?

Not in tree, though I sketched one that would a few posts ago.

I see that dereferenceable_or_null(n) is always attached to call i8* malloc(<n>). If lifetime.start/end can take a heap pointer, its lifetime may start way after this call, so malloc isn't immediately dereferenceable even if it isn't null.
We can change the semantics of dereferenceable_or_null, but this can transitively require changes to other attributes or dereferenceability analyses.

We don't need to change dereferenceable_or_null. The LangRef *without* this patch was fine with the meaning and lifetime on malloced pointers.
The need to adjust something with this patch has yet to be determined.

Sure. You could do the same transformation with heap objects under the same constraints. So you go from
... (snip)

I think there are more cases to consider, such as:

p = malloc(1)
lifetime.start(p)
lifetime.end(p)
q = malloc(1)
// Can the addresses of p and q overlap?  [%c = icmp eq %p, %q]
free(q)
lifetime.start(p)
lifetime.end(p)
free(p)

So the %c I added can be either true or false. false is probably clear, true becomes clear
if you argue that the pointer is dead outside it's lifetime and we could allocate/deallocate it "on demand".
FWIW, I look at this the same way as I look at the alloca case (below), after all, an alloca is a malloc
from a different "pile" with an implicit free at the function exit.

p = alloca
lifetime.start(p)
lifetime.end(p)
q = alloca
// Can the addresses of p and q overlap?  [%c = icmp eq %p, %q]
lifetime.start(p)
lifetime.end(p)
// implicit pop/free(q)
// implicit pop/free(p)

In both cases you need to be consistent though. The one problem with consistency I see is that you can drop
lifetime markers later, I would assume.

That is why I still prefer memset(undef); it may be a churn to teach that memset(undef) is a canonicalized form to e.g., loop dependence analysis, but teaching lifetime also needs churn.
Most importantly, memset's semantics is crystal clear; it doesn't change the object's address. It also naturally supports a thing that is analogous to 'partial lifetime start'.

IMHO, lifetime markers should not change the object's address either. While there is a pass using lifetime markers to build life ranges to coalesce objects, that is, in principle, unrelated to lifetime markers. You could implement the same logic without lifetime markers (in the middle or backend). That said, the problem of consistency is then about the fact that we reason about the pointer addresses "before" we "fix" them. Maybe we should not do that instead. That would really open up new possibilities to coalesce stack locations. Related: Has anyone checked how much we benefit from alloca equality folding?

It's because it can introduce UB. Consider this example:

p = alloca i32
store 0, p // analysis will say this is perfectly fine
f(p) // lifetime.start(p) is hidden in f(p)

After inlining:

p = alloca i32
store 0, p
lifetime.start(p) // this makes the above store raise UB

Given the proposed semantic of lifetime.start, the program had UB semantics prior to inlining, we might not have known it statically but that is not an issue.

But it would be great to say that alloca is simply dereferenceable as far as it isn't used by any lifetime.start/end, isn't it?
We don't want to assume that a function call may affect lifetime of escaped local variables, and IMO this is what LLVM assumes as well.

I get the feeling this mixes pre-patch and post-patch semantic, I might be wrong though. Here is my take:
Pre-patch, this is perfectly legal and the alloca is dereferenceable. The store is dead, as per "implicit lifetime.end marker that makes the alloca dead until lifetime start". So, pre-patch, no UB and a derefereceable alloca, agreed?
Post-patch the pre-inline IR is not legal anymore and the post-inline IR contains UB. That is arguably different.


I played around with lifetime markers again and went down a rabbit hole (https://gcc.godbolt.org/z/cEbqds *). Now I'm not sure anymore if this patch was supposed to fix this issue or not. Afaict, the stack locations are not accessed but their address is taken. I'm confused now.

  • I started with source #1, got IR, made this source #2 and commented some lifetime markers (doesn't matter but just to show), run opt with a few passes, made the result source #3, running llc gets us to the inconsistent result. Interestingly, it seems we do not fold escaping allocas: https://gcc.godbolt.org/z/75sMfY

In the beginning, I wanted to create an example where we fold a comparison with an alloca *before* we inline it and create lifetime markers that are then used to coalesce the stack location with another one we used in the comparison before.


In case this patch goes in after all, maybe make lifetime start and end behave consistently, e.g.,

lifetime.start(p)
lifetime.start(p)  // ill-formed

lifetime.end(q)
lifetime.end(q)    // no-op

I discussed with Nuno, and maybe I can split this patch into two: one that simply makes the description at the StackColoring.cpp explicit at LangRef, and another one that deals with the syntactic restriction.
I feel that with the second issue having a new subsection about the lifetime marker at LangRef, with a bunch of examples. :) That will help go the discussion easier as well.

(will make the splited patches on Thu)

aqjune added a comment.Jan 3 2021, 5:15 PM

The first patch is here: D94002

For the second patch, let me fill it with more context.

RalfJung added a comment.EditedJan 11 2021, 4:01 AM

You mean not supported by codegen, right? It would still have been possible for somebody to use them in some intermediate-only use of LLVM. I agree with your more important points, though.

Yes.

I do believe a boolean flag is sufficient for what's pragmatically required. I see a number of different kinds of questions that we want to be answered:

I think I see what you mean. My argument was, this boolean flag would not entirely remove the need for "syntactic foresight" from the semantics (to accurately describe what happens, we still have to determine which constrained allocas have disjoint lifetime). You seem to agree with this, but you also say that having the boolean flag could still help deal with allocas inside LLVM. I was thinking purely theoretically; I am happy to leave the pragmatic concerns to those familiar with the codebase. :) As long as we agree that "syntactic foresight" (and thus syntactic restrictions) is still required, I think we are on the same page here.

I believe the problem is that we use the address/allocation "rules" to fold comparisons while we later "change" those rules. This patch tries to define a way out by differentiating allocas. I'm not sure anymore that is a good idea to begin with. Consistency could also be achieved by not using the "rules" to fold alloca comparisons to "not equal". We could instead coalesce allocas in the middle end (more aggressively and not only based on lifetime markers) in order to fold comparisons of the "then equal" allocas to "equal". That would not require us to assign different semantics to allocas depending on their uses, prevent us from restricting lifetime markers, and, probably most importantly, allow us to coalesce more without creating inconsistencies. First step is to check how often we fold alloca comparisons to "not equal". If that doesn't happen at all there seems little point in all of this. Even if not, an alternative, as sketched before, might very well perform at least as good while improving other aspects.

An interesting proposal, thanks! I must be missing some detail though: in general, coalescing allocas could change observable behavior if the program is comparing the addresses of these allocas. So why would that fold to "equal" be corrrect? I'd expect that if the lifetime markers do not affect alloca behavior, then two ptrs returned by different allocas will always compare inequal, so folding them to "equal" would be wrong.

So the %c I added can be either true or false. false is probably clear, true becomes clear
if you argue that the pointer is dead outside it's lifetime and we could allocate/deallocate it "on demand".

I don't think true is permitted here if we ignore lifetime marekrs. We have two allocations that clearly both exist at the same time (when the icmp happens), so absent lifetime markers this should be *guaranteed* to compare "inequal". I see no way to justify "true" here unless lifetime markers affect an object's address. What is the operational semantics of allocation and "icmp" that makes this happen (other than "icmp may non-deterministically always return true no matter the pointers", which I don't think is the semantics anyone wants)?