This is an archive of the discontinued LLVM Phabricator instance.

Reapply: [IRGen] Emit lifetime intrinsics around temporary aggregate argument allocas
Needs ReviewPublic

Authored by nickdesaulniers on Feb 5 2020, 3:09 PM.

Details

Summary

This reverts commit e26c24b849211f35a988d001753e0cd15e4a9d7b.

These temporaries are only used in the callee, and their memory can be
reused after the call is complete.

rdar://58552124

Original Author: Erik Pilkington <erik.pilkington@gmail.com>

In order to help users transition to a more disciplined lifetime of
temporaries, the flag -Xclang -sloppy-temporary-lifetimes was added,
which can allow them to retain the previous behavior.

Link: https://github.com/llvm/llvm-project/issues/38157
Link: https://github.com/llvm/llvm-project/issues/41896
Link: https://github.com/llvm/llvm-project/issues/43598
Link: https://github.com/ClangBuiltLinux/linux/issues/39
Link: https://reviews.llvm.org/rGfafc6e4fdf3673dcf557d6c8ae0c0a4bb3184402

Diff Detail

Event Timeline

erik.pilkington created this revision.Feb 5 2020, 3:09 PM
rjmccall added inline comments.Feb 6 2020, 11:12 AM
clang/lib/CodeGen/CGCall.cpp
3697

If the argument type has a C++ destructor, will we end its lifetime before we call destructors at the end of the full-expression?

erik.pilkington planned changes to this revision.Feb 6 2020, 12:03 PM
erik.pilkington marked an inline comment as done.
erik.pilkington added inline comments.
clang/lib/CodeGen/CGCall.cpp
3697

Yeah, this is broken :/

rjmccall added inline comments.Feb 6 2020, 12:31 PM
clang/lib/CodeGen/CGCall.cpp
3697

Well, we could at least do it for trivially-destructible types. That probably needs to include all kinds of non-trivial destructibility, not just C++, though.

erik.pilkington marked 2 inline comments as done.

Disable the optimization for non-trivially destructible types.

thegameg added inline comments.Feb 7 2020, 11:30 AM
clang/lib/CodeGen/CGCall.cpp
3687

Is there any other use of EmitAnyExprToTemp that can benefit from this?

erik.pilkington marked an inline comment as done.Feb 7 2020, 11:35 AM
erik.pilkington added inline comments.
clang/lib/CodeGen/CGCall.cpp
3687

Right now, the only other user of that function is CodeGenFunction::EmitStmtExprLValue, which doesn't seem all that interesting. There are a lot of places that do create temporary allocas that would be worth looking into in the future though.

This revision is now accepted and ready to land.Feb 7 2020, 12:16 PM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptFeb 7 2020, 2:43 PM
foad added a subscriber: foad.Aug 2 2023, 8:40 AM
foad added a comment.Aug 2 2023, 8:40 AM

Hi @erik.pilkington, I see this got reverted:

commit e26c24b849211f35a988d001753e0cd15e4a9d7b
Author: Erik Pilkington <erik.pilkington@gmail.com>
Date:   Wed Feb 12 12:02:58 2020 -0800

    Revert "[IRGen] Emit lifetime intrinsics around temporary aggregate argument allocas"
    
    This reverts commit fafc6e4fdf3673dcf557d6c8ae0c0a4bb3184402.
    
    Should fix ppc stage2 failure: http://lab.llvm.org:8011/builders/clang-ppc64be-linux-multistage/builds/23546

Do you have any more info on the "ppc stage2 failure"? I'd like to pursue something like this patch to get more accurate lifetime markers for temporaries, so that LLVM stack slot coloring can do a better job, and we get smaller stack usage. This is prompted by https://github.com/llvm/llvm-project/issues/41896

Herald added a project: Restricted Project. · View Herald TranscriptAug 2 2023, 8:40 AM

Hi @erik.pilkington, I see this got reverted:

commit e26c24b849211f35a988d001753e0cd15e4a9d7b
Author: Erik Pilkington <erik.pilkington@gmail.com>
Date:   Wed Feb 12 12:02:58 2020 -0800

    Revert "[IRGen] Emit lifetime intrinsics around temporary aggregate argument allocas"
    
    This reverts commit fafc6e4fdf3673dcf557d6c8ae0c0a4bb3184402.
    
    Should fix ppc stage2 failure: http://lab.llvm.org:8011/builders/clang-ppc64be-linux-multistage/builds/23546

Do you have any more info on the "ppc stage2 failure"? I'd like to pursue something like this patch to get more accurate lifetime markers for temporaries, so that LLVM stack slot coloring can do a better job, and we get smaller stack usage. This is prompted by https://github.com/llvm/llvm-project/issues/41896

I too am very interested, as this was flagged as a potential help for https://github.com/llvm/llvm-project/issues/41896.

Hi @erik.pilkington, I see this got reverted:

I'm not sure if @erik.pilkington is still watching Phabricator, but in any case I think (like me) he no longer has rdar access. Since this was a Linux PPC bot, it's possible Erik didn't get far investigating the failure. Three years later, the sands may have shifted substantially... maybe the best option to is rebase and investigate the new failures (if any).

@akyrtzi @arphaman @jroelofs, is one of you available to check rdar://58552124, in case it has extra context from the PPC bot failure?

jroelofs added a subscriber: lei.Aug 3 2023, 4:01 PM

Hi @erik.pilkington, I see this got reverted:

I'm not sure if @erik.pilkington is still watching Phabricator, but in any case I think (like me) he no longer has rdar access. Since this was a Linux PPC bot, it's possible Erik didn't get far investigating the failure. Three years later, the sands may have shifted substantially... maybe the best option to is rebase and investigate the new failures (if any).

@akyrtzi @arphaman @jroelofs, is one of you available to check rdar://58552124, in case it has extra context from the PPC bot failure?

I found rdar://59400672&60009515, which track the revert and reminder to investigate. Neither go into any more detail on how/why it broke that bot, unfortunately. I also found https://reviews.llvm.org/rGfafc6e4fdf3673dcf557d6c8ae0c0a4bb3184402 has a tiny bit more context than this review. Maybe @lei can share the reproducer mentioned in that thread? Otherwise, as Duncan said, I think your best bet is to try to re-land and see what breaks.

(I am working on rebasing this; tests need updating for opaque ptr)

nickdesaulniers reopened this revision.Aug 4 2023, 10:18 AM
This revision is now accepted and ready to land.Aug 4 2023, 10:18 AM
nickdesaulniers retitled this revision from [IRGen] Emit lifetime intrinsics around temporary aggregate argument allocas to Reapply: [IRGen] Emit lifetime intrinsics around temporary aggregate argument allocas.
nickdesaulniers edited the summary of this revision. (Show Details)
  • rebase on opaque ptrs

WIP, the following tests now fail and need to be updated:
Failed Tests (3):

Clang :: CodeGenCXX/amdgcn-call-with-aggarg.cpp
Clang :: CodeGenCoroutines/pr59181.cpp
Clang :: Driver/ps4-ps5-relax-relocations.c
nickdesaulniers requested changes to this revision.Aug 4 2023, 10:19 AM
This revision now requires changes to proceed.Aug 4 2023, 10:19 AM
ributzka removed a subscriber: ributzka.Aug 7 2023, 9:58 AM
nickdesaulniers commandeered this revision.Aug 7 2023, 10:07 AM
nickdesaulniers edited reviewers, added: erik.pilkington; removed: nickdesaulniers.
This revision is now accepted and ready to land.Aug 7 2023, 10:07 AM
nickdesaulniers planned changes to this revision.Aug 7 2023, 10:08 AM
  • fix remaining test failures
This revision is now accepted and ready to land.Aug 7 2023, 10:45 AM
  • remove instance of auto

Hi @lei can you please help us retest this version of this patch that previously broke your multistage build?
https://reviews.llvm.org/rGfafc6e4fdf3673dcf557d6c8ae0c0a4bb3184402#890444

  • remove another auto, sink ArgSlotAlloca to reduce scope

Hi @lei can you please help us retest this version of this patch that previously broke your multistage build?
https://reviews.llvm.org/rGfafc6e4fdf3673dcf557d6c8ae0c0a4bb3184402#890444

ping @lei

lei added a comment.Aug 15 2023, 7:22 AM

Sorry I missed this. Will kick off now and let you know the results soon.

lei added a comment.Aug 15 2023, 12:04 PM

@nickdesaulniers I have verified this patch on top of 40ee8abee77a2e8fb0089d4c7f5723b71f27d416 passes our multistage bot http://lab.llvm.org:8011/builders/clang-ppc64be-linux-multistage

Thank-you!

Thanks @lei for the help testing. Re-applied.

This commit caused invalid AddressSanitizer: stack-use-after-scope errors in our internal setup. Trying to create a standalone repro, but so far we think that the patch is incorrect. @vitalybuka says "The patch seems wrong, callee can return a reference to temp, so lifetime should be extended to full expression."

alexfh added a comment.Sep 1 2023, 3:10 AM

This commit caused invalid AddressSanitizer: stack-use-after-scope errors in our internal setup. Trying to create a standalone repro, but so far we think that the patch is incorrect. @vitalybuka says "The patch seems wrong, callee can return a reference to temp, so lifetime should be extended to full expression."

A standalone reproducer is here: https://gcc.godbolt.org/z/Ed1s15Kv5. The test passes when compiled with clang before this patch, and generates a bogus address sanitizer error after this patch. This only happens with -march=haswell.

The affected code is:

#include <unsupported/Eigen/CXX11/Tensor>
#include <gtest/gtest.h>
#include <gmock/gmock.h>
#include <absl/types/span.h>

template <typename Tensor>
auto AsArray(const Tensor& t)
    -> absl::Span<const typename Tensor::Scalar> {
  return {t.data(), (size_t)t.size()};
}
using testing::ElementsAre;

int main() {
  using Tensor = Eigen::TensorFixedSize<float, Eigen::Sizes<2, 2>,
                                       Eigen::RowMajor, Eigen::DenseIndex>;
  Tensor a;
  a.setValues({{1, 2}, {3, 4}});
  auto round = [](Tensor m) {
    return (m + 0.5f).cast<int>().cast<float>();
  };

  const Tensor t3 = round(a.log().exp());
  EXPECT_THAT(AsArray(t3), ElementsAre(1, 2, 3, 4));
}

The problem reproduces with clang -std=c++17 -O3 -fsanitize=address -march=haswell.

I'm going to revert the commit.

This commit caused invalid AddressSanitizer: stack-use-after-scope errors in our internal setup. Trying to create a standalone repro, but so far we think that the patch is incorrect. @vitalybuka says "The patch seems wrong, callee can return a reference to temp, so lifetime should be extended to full expression."

I will lazily quote https://en.cppreference.com/w/cpp/language/lifetime, we can lookup the standard, all ask experts, if this is wrong:

All temporary objects are destroyed as the last step in evaluating the full-expression that (lexically) contains the point where they were created, and if multiple temporary objects were created, they are destroyed in the order opposite to the order of creation. This is true even if that evaluation ends in throwing an exception.

And full-expression here is:

const Tensor t3 = round(a.log().exp());

Parameter objects are not temporaries and have their own lifetime rules, so there's nothing wrong with this idea in principle. This seems to just be a bug, probably that we're doing a type check on E->getType() without considering whether E might be a gl-value. We definitely do not want to be emitting lifetime intrinsics for the referent of a reference argument just because the underlying type is an aggregate.

Parameter objects are not temporaries and have their own lifetime rules, so there's nothing wrong with this idea in principle. This seems to just be a bug, probably that we're doing a type check on E->getType() without considering whether E might be a gl-value. We definitely do not want to be emitting lifetime intrinsics for the referent of a reference argument just because the underlying type is an aggregate.

I think the issue is more so about returning a reference to one of the parameters; example:

struct foo { int x; };

// Returns a reference to the minimum foo.
foo &min(foo a, foo b);

void consume (foo&);

void bar () {
  foo a, b;
  consume(min(a, b));
}

With the current revision, we emit:

define dso_local void @_Z3barv() #0 {
entry:
  %a = alloca %struct.foo, align 4
  %b = alloca %struct.foo, align 4
  %agg.tmp = alloca %struct.foo, align 4
  %agg.tmp1 = alloca %struct.foo, align 4
  call void @llvm.lifetime.start.p0(i64 4, ptr %a) #4
  call void @llvm.lifetime.start.p0(i64 4, ptr %b) #4
  call void @llvm.lifetime.start.p0(i64 4, ptr %agg.tmp) #4
  call void @llvm.memcpy.p0.p0.i64(ptr align 4 %agg.tmp, ptr align 4 %a, i64 4, i1 false), !tbaa.struct !5
  call void @llvm.lifetime.start.p0(i64 4, ptr %agg.tmp1) #4
  call void @llvm.memcpy.p0.p0.i64(ptr align 4 %agg.tmp1, ptr align 4 %b, i64 4, i1 false), !tbaa.struct !5
  %coerce.dive = getelementptr inbounds %struct.foo, ptr %agg.tmp, i32 0, i32 0
  %0 = load i32, ptr %coerce.dive, align 4
  %coerce.dive2 = getelementptr inbounds %struct.foo, ptr %agg.tmp1, i32 0, i32 0
  %1 = load i32, ptr %coerce.dive2, align 4
  %call = call noundef nonnull align 4 dereferenceable(4) ptr @_Z3min3fooS_(i32 %0, i32 %1)
  call void @llvm.lifetime.end.p0(i64 4, ptr %agg.tmp) #4
  call void @llvm.lifetime.end.p0(i64 4, ptr %agg.tmp1) #4
  call void @_Z7consumeR3foo(ptr noundef nonnull align 4 dereferenceable(4) %call)
  call void @llvm.lifetime.end.p0(i64 4, ptr %b) #4
  call void @llvm.lifetime.end.p0(i64 4, ptr %a) #4
  ret void
}

which doesn't look correct (the lifetimes of %agg.tmp and %agg.tmp1 need to last past the call to consume, IIUC).

That makes me think we should probably do this lifetime annotation in the caller of CodeGenFunction::EmitCallArg since it will have information about the return type of the caller. If a function returns a reference to a type, we probably don't want to emit the lifetime markers for any parameter of that type. I wonder if strict aliasing needs to be considered, too?

(or am I chasing the wrong thing here?)

clang/test/CodeGen/lifetime-call-temp.c
54

this check line looks broken; missing :

nickdesaulniers reopened this revision.Sep 11 2023, 10:50 AM
This revision is now accepted and ready to land.Sep 11 2023, 10:50 AM
nickdesaulniers planned changes to this revision.Sep 11 2023, 10:50 AM
  • skip optimization when returning a reference; regardless of type
  • fix test run line
  • use --check-prefixes=
  • reflow RUN lines
This revision is now accepted and ready to land.Sep 11 2023, 10:52 AM
rjmccall added a comment.EditedSep 11 2023, 10:54 AM

Under [expr.call]p6 (https://eel.is/c++draft/expr.call#6), it is implementation-defined whether the lifetime of a parameter ends at the end of the containing full expression or at the end of the function. If the implementation chooses the latter, a reference to the parameter will dangle after the return. In your example, on your chosen target (probably x64-64?), we are actually forced by the psABI to end those lifetimes at the end of the function, because the contents of the parameters are passed in registers (rather than by implicit reference) and reconstituted in the stack frame of the callee. Therefore your min function has undefined behavior, and not just in a formal sense but actually in a very directly dangerous sense. That should not be surprising; intuitively, you should think of parameters as local variables, and therefore this code as returning a reference to a local variable, which is highly suspect even if it is sometimes — only on certain targets and for certain types — formally allowed.

That makes me think we should probably do this lifetime annotation in the caller of CodeGenFunction::EmitCallArg since it will have information about the return type of the caller. If a function returns a reference to a type, we probably don't want to emit the lifetime markers for any parameter of that type. I wonder if strict aliasing needs to be considered, too?

This is definitely barking up the wrong tree. The language specifies the lifetime of an object, and any access to the object during its lifetime is valid. If the parameter's lifetime *was* guaranteed to be the full expression here, we would be obliged to keep it alive for that full duration.

In particular, we cannot do anything with reasoning like "the return value cannot be a reference to the parameter". The right way to understand that is as an effort to invoke the as-if rule, arguing that it is impossible to reference the parameter after the call if it is not part of the return value, and therefore we can be more aggressive than the language would formally allow. But it is not true that the only way to reference a parameter after a call is if it is part of the return value. For example, the function could store a pointer to the parameter into a global variable, and as long as all accesses through that global happen before the parameter's lifetime ends, that's totally fine — it's obviously a very confusing and error-prone way to write code, but it's legal. (Also, even if the address of the parameter *could* only escape into the return value, the type analysis would have to be much more complicated than just whether it's a reference to the type of a parameter.)

nickdesaulniers planned changes to this revision.Sep 11 2023, 11:00 AM

Under [expr.call]p6 (https://eel.is/c++draft/expr.call#6), it is implementation-defined whether the lifetime of a parameter ends at the end of the containing full expression or at the end of the function. If the implementation chooses the latter, a reference to the parameter will dangle after the return. In your example, on your chosen target (probably x64-64?), we are actually forced by the psABI to end those lifetimes at the end of the function, because the contents of the parameters are passed in registers (rather than by implicit reference) and reconstituted in the stack frame of the callee. Therefore your min function has undefined behavior, and not just in a formal sense but actually in a very directly dangerous sense. That should not be surprising; intuitively, you should think of parameters as local variables, and therefore this code as returning a reference to a local variable, which is highly suspect even if it is sometimes — only on certain targets and for certain types — formally allowed.

That makes me think we should probably do this lifetime annotation in the caller of CodeGenFunction::EmitCallArg since it will have information about the return type of the caller. If a function returns a reference to a type, we probably don't want to emit the lifetime markers for any parameter of that type. I wonder if strict aliasing needs to be considered, too?

This is definitely barking up the wrong tree. The language specifies the lifetime of an object, and any access to the object during its lifetime is valid. If the parameter's lifetime *was* guaranteed to be the full expression here, we would be obliged to keep it alive for that full duration.

In particular, we cannot do anything with reasoning like "the return value cannot be a reference to the parameter". The right way to understand that is as an effort to invoke the as-if rule, arguing that it is impossible to reference the parameter after the call if it is not part of the return value, and therefore we can be more aggressive than the language would formally allow. But it is not true that the only way to reference a parameter after a call is if it is part of the return value. For example, the function could store a pointer to the parameter into a global variable, and as long as all accesses through that global happen before the parameter's lifetime ends, that's totally fine — it's obviously a very confusing and error-prone way to write code, but it's legal. (Also, even if the address of the parameter *could* only escape into the return value, the type analysis would have to be much more complicated than just whether it's a reference to the type of a parameter.)

Hmm...ok thanks for the feedback.

In that case then, is it your opinion that there was also a "very directly dangerous sense" of UB then in the test case for which the previous revision of this patch was reverted?

It's currently unclear to me how to go about relanding this; is there an explicit bug in e698695fbbf62e6676f8907665187f2d2c4d814b or not?

rjmccall requested changes to this revision.Sep 11 2023, 11:05 AM
akyrtzi removed a subscriber: akyrtzi.Sep 11 2023, 11:16 AM

I can't easily tell you because the "standalone example" involves a million templates that I don't know anything about, and nobody seems to have done the analysis to identify the specific source of the miscompile.

What I can tell you is that there is an enormous semantic difference between passing a pr-value argument to a parameter that's passed by reference and to a parameter that's passed by value. In the former case, the reference is bound to a temporary with full-expression lifetime, and it is totally reasonable to return its address. In the latter case, the parameter object will often have a lifetime bound by the current function, and you cannot return its address without it being UB.

Looking closer, my previous guess that this was accidentally triggering for reference parameters seems to be wrong, because that case is filtered out earlier in the function. I'm not sure what the problem is exactly, then, unless the code indeed has UB.

I can't easily tell you because the "standalone example" involves a million templates that I don't know anything about, and nobody seems to have done the analysis to identify the specific source of the miscompile.

Sure; @alexfh @vitalybuka can you both please help produce a further reduced test case from https://reviews.llvm.org/D74094#4633799 ? Otherwise I'm beginning to think this patch may have been reverted prematurely; before an understood + reduced test case was reduced.

What I can tell you is that there is an enormous semantic difference between passing a pr-value argument to a parameter that's passed by reference and to a parameter that's passed by value. In the former case, the reference is bound to a temporary with full-expression lifetime, and it is totally reasonable to return its address. In the latter case, the parameter object will often have a lifetime bound by the current function, and you cannot return its address without it being UB.

I'm trying to come up with an example of "the former."

struct foo {
  int x;
};

// Returns a reference to the minimum foo.
foo &min(foo &a, foo &b);

void consume (foo&);

void bar () {
  consume(min(foo{0}, foo{1}));
}

doesn't compile:

foo.cpp:11:11: error: no matching function for call to 'min'
   11 |   consume(min(foo{0}, foo{1}));
      |           ^~~
foo.cpp:6:6: note: candidate function not viable: expects an lvalue for 1st argument
    6 | foo &min(foo &a, foo &b);
      |      ^   ~~~~~~

Have I misinterpreted what you meant by "passing a pr-value argument to a parameter that's passed by reference?"

Looking closer, my previous guess that this was accidentally triggering for reference parameters seems to be wrong, because that case is filtered out earlier in the function. I'm not sure what the problem is exactly, then, unless the code indeed has UB.

In which case, perhaps this was reverted prematurely.

I can't easily tell you because the "standalone example" involves a million templates that I don't know anything about, and nobody seems to have done the analysis to identify the specific source of the miscompile.

Sure; @alexfh @vitalybuka can you both please help produce a further reduced test case from https://reviews.llvm.org/D74094#4633799 ? Otherwise I'm beginning to think this patch may have been reverted prematurely; before an understood + reduced test case was reduced.

Well, it's good to be conservative about adding new optimizations if they're causing apparent miscompiles. Even if the optimization is just exposing existing UB in some new one, that's something we should understand as a project. I have no problem with the eager revert.

It would be great if Alex and Vitaly could help track down the exact problem. However, if they can't, I think it's ultimately your responsibility as the person hoping to land the optimization to understand why the miscompile is not in fact your fault.

What I can tell you is that there is an enormous semantic difference between passing a pr-value argument to a parameter that's passed by reference and to a parameter that's passed by value. In the former case, the reference is bound to a temporary with full-expression lifetime, and it is totally reasonable to return its address. In the latter case, the parameter object will often have a lifetime bound by the current function, and you cannot return its address without it being UB.

I'm trying to come up with an example of "the former."

struct foo {
  int x;
};

// Returns a reference to the minimum foo.
foo &min(foo &a, foo &b);

void consume (foo&);

void bar () {
  consume(min(foo{0}, foo{1}));
}

doesn't compile:

foo.cpp:11:11: error: no matching function for call to 'min'
   11 |   consume(min(foo{0}, foo{1}));
      |           ^~~
foo.cpp:6:6: note: candidate function not viable: expects an lvalue for 1st argument
    6 | foo &min(foo &a, foo &b);
      |      ^   ~~~~~~

Have I misinterpreted what you meant by "passing a pr-value argument to a parameter that's passed by reference?"

Okay, to be frank, this is a bit of a red flag for me. Optimizations like this are ultimately rooted in language rules, and I'm not sure you can effectively do this work if you don't understand the basics of C++ parameter passing. In C++, you cannot bind a non-const l-value reference to a temporary, but you can bind either a const l-value reference or an x-value reference to one. So you could declare min as either const foo &min(const foo &a, const foo &b); or foo &&min(foo &&a, foo &&b);.

I can't easily tell you because the "standalone example" involves a million templates that I don't know anything about, and nobody seems to have done the analysis to identify the specific source of the miscompile.

Sure; @alexfh @vitalybuka can you both please help produce a further reduced test case from https://reviews.llvm.org/D74094#4633799 ? Otherwise I'm beginning to think this patch may have been reverted prematurely; before an understood + reduced test case was reduced.

Well, it's good to be conservative about adding new optimizations if they're causing apparent miscompiles. Even if the optimization is just exposing existing UB in some new one, that's something we should understand as a project. I have no problem with the eager revert.

It would be great if Alex and Vitaly could help track down the exact problem. However, if they can't, I think it's ultimately your responsibility as the person hoping to land the optimization to understand why the miscompile is not in fact your fault.

I've cloned:

  1. abseil @ f3eae68bd1d17df708f2b59a43ad7e837a616e6a
  2. eigen @ 8f858a4ea8106b52f3cf475ffc5e0e9c6a91baa2
  3. googletest @ eab0e7e289db13eabfc246809b0284dac02a369d

I had to build googletest. Then I built a.out via:

$ clang++ foo.cpp -I eigen/ -I googletest/googletest/include/ \
  -I googletest/googlemock/include/ -I abseil-cpp googletest/build/lib/libgtest.a \
  -std=c++17 -O3 -fsanitize=address -march=haswell

With this change re-applied (b7f4915644844fb9f32e8763922a070f5fe4fd29 reverted), a.out runs without issue. So yes, I need something other than a godbolt link which uses clang-trunk which isn't stable after a revert to understand what they are talking about.

What I can tell you is that there is an enormous semantic difference between passing a pr-value argument to a parameter that's passed by reference and to a parameter that's passed by value. In the former case, the reference is bound to a temporary with full-expression lifetime, and it is totally reasonable to return its address. In the latter case, the parameter object will often have a lifetime bound by the current function, and you cannot return its address without it being UB.

I'm trying to come up with an example of "the former."

struct foo {
  int x;
};

// Returns a reference to the minimum foo.
foo &min(foo &a, foo &b);

void consume (foo&);

void bar () {
  consume(min(foo{0}, foo{1}));
}

doesn't compile:

foo.cpp:11:11: error: no matching function for call to 'min'
   11 |   consume(min(foo{0}, foo{1}));
      |           ^~~
foo.cpp:6:6: note: candidate function not viable: expects an lvalue for 1st argument
    6 | foo &min(foo &a, foo &b);
      |      ^   ~~~~~~

Have I misinterpreted what you meant by "passing a pr-value argument to a parameter that's passed by reference?"

Okay, to be frank, this is a bit of a red flag for me. Optimizations like this are ultimately rooted in language rules, and I'm not sure you can effectively do this work if you don't understand the basics of C++ parameter passing.

Fuck me for trying to help, right? If x-values are part of the "basics" of parameter passing, I'm afraid to ask about the more complicated cases.

In C++, you cannot bind a non-const l-value reference to a temporary, but you can bind either a const l-value reference or an x-value reference to one. So you could declare min as either const foo &min(const foo &a, const foo &b); or foo &&min(foo &&a, foo &&b);.

Thanks, that's all I needed.

With b7f4915644844fb9f32e8763922a070f5fe4fd29 reverted, neither of those cases look fishy to me:
https://gist.github.com/nickdesaulniers/1c7b962222639e54c3339b63b572b166

Fuck me for trying to help, right? If x-values are part of the "basics" of parameter passing, I'm afraid to ask about the more complicated cases.

I can see how my response was somewhat aggressive, and I regret that and apologize. I imagine you're probably approaching this from the perspective of general optimization and are unhappy that you're getting bogged down in all this C++ minutiae. It is the nature of high-level optimization, though, that you often have to learn a lot of language details in order to get your job done. I'm happy to teach some of these concepts during review — I wouldn't expect more than, like, twenty people on the planet to know the differences between parameter and temporary lifetimes off-hand, and I understand that non-experts in C++ aren't going to know all these parameter initialization rules by heart. But I am surprised at some of the things we've had to cover, like that it's not generally okay for a function to return a reference to a by-value parameter. I am volunteering my time to try to help you with your goal of landing this patch, and it can be a little frustrating to spend that time on this stuff.

If you're able to build Alex's test case, I would suggest diffing the IR output both with and without your original patch. -O0 output is probably fine for this and will be a lot easier to track back to the IRGen code responsible for it. You may see something that immediately stands out as wrong, but at the very least, it'll tell us what's changed.

With this change re-applied (b7f4915644844fb9f32e8763922a070f5fe4fd29 reverted), a.out runs without issue. So yes, I need something other than a godbolt link which uses clang-trunk which isn't stable after a revert to understand what they are talking about.

Slight update here, I'm still working on this (slowly), but I now have a reproducer in hand.

// clang++ foo.cpp -I eigen -fsanitize=address  -std=c++17 -O3 -march=haswell

#include <unsupported/Eigen/CXX11/Tensor>

using Tensor = Eigen::TensorFixedSize<float, Eigen::Sizes<2, 2>, Eigen::RowMajor, Eigen::DenseIndex>;

auto round (Tensor m) {
    return (m + 0.5f).cast<int>().cast<float>();
}

int main() {
  Tensor a;
  a.setValues({{1, 2}, {3, 4}});

  const Tensor t3 = round(a.log().exp());
  return t3(0, 0);
}

with eigen @ 8f858a4ea810, and llvm @ 102715a60ee4 with b7f491564484 reverted (i.e. this commit re-applied a second time).

I don't yet fully comprehend yet what's going wrong, and probably need to familiarize myself with the language rules around auto's type deduction.

This can probably be reduced further as well; I'll try to preprocess it and run it through cvise to see if we can get something even more concise.

I'll also check the codegen differences from removing null -fsanitize= option (and perhaps -march=haswell, though I'd guess eigen is doing something with that deep inside).

I don't yet fully comprehend yet what's going wrong, and probably need to familiarize myself with the language rules around auto's type deduction.

For reduction purposes, it might be useful to factor out the auto type deduction on the return. I think you can do that with the help of -Xclang -ast-dump. E.g., for this function:

auto f() { return 1; };

the AST dump tells me the return type is int:

|-FunctionDecl 0x1508e5698 <t.cpp:1:1, col:22> col:6 f 'int ()'
| `-CompoundStmt 0x1508e5928 <col:10, col:22>
|   `-ReturnStmt 0x1508e5918 <col:12, col:19>
|     `-IntegerLiteral 0x1508e5780 <col:19> 'int' 1

You could then replace the auto with the type given in the dump.

I don't yet fully comprehend yet what's going wrong, and probably need to familiarize myself with the language rules around auto's type deduction.

For reduction purposes, it might be useful to factor out the auto type deduction on the return. I think you can do that with the help of -Xclang -ast-dump.

Heh, yeah I tried that, and hit an assertion before we were able to dump the full AST.

| | |     | `-TypedefDecl 0x7f4a5cf03fb0 <line:72:3, line:73:106> col:106 referenced SegmentWrapper 'std::conditional_t<CanAlign, Ref<const Matrix<Scalar, Dynamic, 1, 0, blockSize, 1>, internal::evaluator<VectorTypeCopyClean>::Alignment>, typename VectorTypeCopyClean::ConstSegmentReturnType>':'clang-18: /android0/llvm-project/clang/lib/AST/NestedNameSpecifier.cpp:309: void clang::NestedNameSpecifier::print(raw_ostream &, const PrintingPolicy &, bool) const: Assertion `!isa<ElaboratedType>(T) && "Elaborated type in nested-name-specifier"' failed.

2.75 hours of cvise later: https://github.com/llvm/llvm-project/issues/43179#issuecomment-1719922669

Can probably either rebuild with assertions disabled, or put a clearly wrong type and have the compiler tell me what it was. Doing the latter:

const std::conditional_t<internal::is_same<float, CoeffReturnType>::value, TensorConversionOp<int, const TensorCwiseUnaryOp<bind2nd_op<scalar_sum_op<float, float>>, const TensorFixedSize<float, Sizes<2, 2>, 1, long>>>, TensorConversionOp<float, const TensorConversionOp<int, const TensorCwiseUnaryOp<bind2nd_op<scalar_sum_op<float, float>>, const TensorFixedSize<float, Sizes<2, 2>, 1, long>>>>>' (aka 'const Eigen::TensorConversionOp<float, const Eigen::TensorConversionOp<int, const Eigen::TensorCwiseUnaryOp<Eigen::internal::bind2nd_op<Eigen::internal::scalar_sum_op<float>>, const Eigen::TensorFixedSize<float, Eigen::Sizes<2, 2>, 1>>>>

I don't yet fully comprehend yet what's going wrong, and probably need to familiarize myself with the language rules around auto's type deduction.

For reduction purposes, it might be useful to factor out the auto type deduction on the return. I think you can do that with the help of -Xclang -ast-dump.

Heh, yeah I tried that, and hit an assertion before we were able to dump the full AST.

| | |     | `-TypedefDecl 0x7f4a5cf03fb0 <line:72:3, line:73:106> col:106 referenced SegmentWrapper 'std::conditional_t<CanAlign, Ref<const Matrix<Scalar, Dynamic, 1, 0, blockSize, 1>, internal::evaluator<VectorTypeCopyClean>::Alignment>, typename VectorTypeCopyClean::ConstSegmentReturnType>':'clang-18: /android0/llvm-project/clang/lib/AST/NestedNameSpecifier.cpp:309: void clang::NestedNameSpecifier::print(raw_ostream &, const PrintingPolicy &, bool) const: Assertion `!isa<ElaboratedType>(T) && "Elaborated type in nested-name-specifier"' failed.

That can be worked around by adding -Xclang -ast-dump-filter=<function>, in this case -Xclang -ast-dump -Xclang -ast-dump-filter=round:
https://gist.github.com/nickdesaulniers/4ea4b46f112f1b3bc440020f3a2a01e1

No real comment on the issue itself but on the example as a former Eigen maintainer (sorry for the noise if that's all obvious for you):

auto round (Tensor m) {
    return (m + 0.5f).cast<int>().cast<float>();
}

does not return a Tensor but an expression encoding the computation which is evaluated during the assignment const Tensor t3 = round(a.log().exp()); using an overloaded operator=. With "evaluation" I mean evaluation the in the sense of performing the intended mathematical computation.

Many things can go wrong when using auto in such cases, of which two seem to be relevant here:

  1. Eigen can (this is an implementation detail(!)) decide to evaluate subexpressions into temporaries. The returned expression would then reference the result of such an evaluation beyond its lifetime.
  2. If 1. does not happen, the unevaluated expression would reference its arguments. The immediate 0.5f is copied since that's cheap, but the tensor would be referenced. Your example function round takes its parameter by-value and the returned expression would reference it. I'm unsure if the lifetime would be extended in this case (again, the exact details of C++ lifetime rules are not my area of expertise). I think if the lifetime would be extended, ASAN complaining after applying this patch is wrong, if the lifetime is not extended ASAN should complain and the example is wrong.

Btw, these issues are so common that Eigen documents the recommendation to avoid using auto with Eigen types all together: https://eigen.tuxfamily.org/dox/TopicPitfalls.html#title3

No real comment on the issue itself but on the example as a former Eigen maintainer (sorry for the noise if that's all obvious for you):

auto round (Tensor m) {
    return (m + 0.5f).cast<int>().cast<float>();
}

does not return a Tensor but an expression encoding the computation which is evaluated during the assignment const Tensor t3 = round(a.log().exp()); using an overloaded operator=. With "evaluation" I mean evaluation the in the sense of performing the intended mathematical computation.

Many things can go wrong when using auto in such cases, of which two seem to be relevant here:

  1. Eigen can (this is an implementation detail(!)) decide to evaluate subexpressions into temporaries. The returned expression would then reference the result of such an evaluation beyond its lifetime.
  2. If 1. does not happen, the unevaluated expression would reference its arguments. The immediate 0.5f is copied since that's cheap, but the tensor would be referenced. Your example function round takes its parameter by-value and the returned expression would reference it. I'm unsure if the lifetime would be extended in this case (again, the exact details of C++ lifetime rules are not my area of expertise). I think if the lifetime would be extended, ASAN complaining after applying this patch is wrong, if the lifetime is not extended ASAN should complain and the example is wrong.

Btw, these issues are so common that Eigen documents the recommendation to avoid using auto with Eigen types all together: https://eigen.tuxfamily.org/dox/TopicPitfalls.html#title3

Okay, thanks, I can see how that works, and I definitely would never had figured that out from just looking at this code. The formal lifetime of a parameter of type Tensor is definitely in the land of implementation-defined behavior, so this code seems to be at best non-portable.

Nick, maybe we can take a new look at this patch from that perspective. You've been trying to use very tight lifetime bounds for these objects that only cover the duration of call, which essentially means you're defining the lifetime of parameter objects to just be the call rather than the full-expression (at least, when the parameter type doesn't have a destructor). In the abstract, that's probably the right rule for Clang to adopt, because I think it matches programmer intuition (it's always wrong to return the address of a local, and that includes by-value parameters), and it means we'd have a consistent rule for all types and targets. It's also a fairly aggressive rule that's likely to uncover a fair amount of code like this that's assuming longer lifetimes. I think it's reasonable to say that these examples are all program bugs that should be fixed, but it might be better to pursue that as a *refinement* on top of an initially more conservative rule. I suspect that that conservative rule will also give us a large fraction of the optimization benefit that we can expect from this change, because at least parameter objects from calls in different full-expressions will be able to share memory. So we can start by giving these objects full-expression lifetime, chase down any program bugs that that uncovers (which will all *unquestionably* be program bugs under the standard), and then gradually work on landing the more aggressive rule (perhaps even for non-trivial types).

So we can start by giving these objects full-expression lifetime, chase down any program bugs that that uncovers (which will all *unquestionably* be program bugs under the standard), and then gradually work on landing the more aggressive rule (perhaps even for non-trivial types).

maybe provide a new flag to control this behaviour?

So we can start by giving these objects full-expression lifetime, chase down any program bugs that that uncovers (which will all *unquestionably* be program bugs under the standard), and then gradually work on landing the more aggressive rule (perhaps even for non-trivial types).

maybe provide a new flag to control this behaviour?

You mean for when we start limiting lifetimes to the call rather than the full-expression? I don't know that we want such a flag in the long term, but having at least a -cc1 flag is usually helpful when we're bringing up new/stricter optimizations, yeah.

Yes, cc1 flag would be useful.

(Sorry for the wall of text)

No real comment on the issue itself but on the example as a former Eigen maintainer (sorry for the noise if that's all obvious for you):

Woah! Deus Ex Machina?

auto round (Tensor m) {
    return (m + 0.5f).cast<int>().cast<float>();
}

does not return a Tensor but an expression encoding the computation which is evaluated during the assignment const Tensor t3 = round(a.log().exp()); using an overloaded operator=. With "evaluation" I mean evaluation the in the sense of performing the intended mathematical computation.

Many things can go wrong when using auto in such cases, of which two seem to be relevant here:

  1. Eigen can (this is an implementation detail(!)) decide to evaluate subexpressions into temporaries. The returned expression would then reference the result of such an evaluation beyond its lifetime.
  2. If 1. does not happen, the unevaluated expression would reference its arguments. The immediate 0.5f is copied since that's cheap, but the tensor would be referenced. Your example function round takes its parameter by-value and the returned expression would reference it. I'm unsure if the lifetime would be extended in this case (again, the exact details of C++ lifetime rules are not my area of expertise). I think if the lifetime would be extended, ASAN complaining after applying this patch is wrong, if the lifetime is not extended ASAN should complain and the example is wrong.

Btw, these issues are so common that Eigen documents the recommendation to avoid using auto with Eigen types all together: https://eigen.tuxfamily.org/dox/TopicPitfalls.html#title3

Thanks for the info, https://eigen.tuxfamily.org/dox/TopicPitfalls.html#title3 is quite helpful. That helps explain why the return value was deduced to a const Eigen::TensorConversionOp<..... Sounds like this is some kind of lazy evaluation?

Of note, I modified the reduced test case to use a static function with auto return type. But the code as written originally used a lambda!

auto round = [](Tensor m) {
  return (m + 0.5f).cast<int>().cast<float>();
}

So the original author was well intentioned; generally you want to use auto rather than the full type of the lambda. But the author may not have been aware of the advice or hazards described by https://eigen.tuxfamily.org/dox/TopicPitfalls.html#title3. Our fix which we've committed is to use the -> for the return type of the lambda.

-auto round = [](Tensor m) {
+auto round = [](Tensor m) -> Tensor {
   return (m + 0.5f).cast<int>().cast<float>();
 }

Which I think is inline wrt. the advice in https://eigen.tuxfamily.org/dox/TopicPitfalls.html#title3. (i.e. evaluate the expression so that we don't have some sort of partially evaluated calculation containing references to temporaries). So I think the code in question (that we reverted this over) was just wrong.

One thing that's problematic is that neither clang or gcc do a great job with -Wreturn-stack-address (clang) or -Wreturn-local-addr (GCC): https://godbolt.org/z/6oq5nKnY7 Clang only spots obviously incorrect examples; GCC depends on optimization level.

I wonder if there's anything more we can do to help spot these at compile-time during semantic analysis. It seems for more complex types, it could even be very beneficial. The examples in eigen come to mind (could we catch those, for example?)


Okay, thanks, I can see how that works, and I definitely would never had figured that out from just looking at this code. The formal lifetime of a parameter of type Tensor is definitely in the land of implementation-defined behavior, so this code seems to be at best non-portable.

Right, so that was my first question here; sounds like the optimization is correct wrt. the example that broke. Though I still could not tell you precisely why. Perhaps, "round returns an object, but one that contains references to temporary variables`."

Nick, maybe we can take a new look at this patch from that perspective. You've been trying to use very tight lifetime bounds for these objects that only cover the duration of call, which essentially means you're defining the lifetime of parameter objects to just be the call rather than the full-expression (at least, when the parameter type doesn't have a destructor). In the abstract, that's probably the right rule for Clang to adopt, because I think it matches programmer intuition (it's always wrong to return the address of a local, and that includes by-value parameters), and it means we'd have a consistent rule for all types and targets. It's also a fairly aggressive rule that's likely to uncover a fair amount of code like this that's assuming longer lifetimes. I think it's reasonable to say that these examples are all program bugs that should be fixed, but it might be better to pursue that as a *refinement* on top of an initially more conservative rule. I suspect that that conservative rule will also give us a large fraction of the optimization benefit that we can expect from this change, because at least parameter objects from calls in different full-expressions will be able to share memory. So we can start by giving these objects full-expression lifetime, chase down any program bugs that that uncovers (which will all *unquestionably* be program bugs under the standard), and then gradually work on landing the more aggressive rule (perhaps even for non-trivial types).

Let me see if I can picture how the lifetimes are expected to work, for my own sake. Let's say we have functions foo, bar, and baz that all take a struct widget by value and return a struct widget. Then let's say we have code like:

struct widget { long x, y, z, w; };

widget foo(widget);
widget bar(widget);
widget baz(widget);

void quux () {
    widget w = foo(bar(baz({ .x = 42, })));
}

With b7f4915644844fb9f32e8763922a070f5fe4fd29 reverted, we'd get:

define dso_local void @_Z4quuxv() local_unnamed_addr #0 {
entry:
  %w = alloca %struct.widget, align 8
  %agg.tmp = alloca %struct.widget, align 8
  %agg.tmp1 = alloca %struct.widget, align 8
  %agg.tmp2 = alloca %struct.widget, align 8
  call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %w) #4
  call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %agg.tmp) #4
  call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %agg.tmp1) #4
  call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %agg.tmp2) #4
  call void @llvm.memset.p0.i64(ptr noundef nonnull align 8 dereferenceable(32) %agg.tmp2, i8 0, i64 32, i1 false)
  store i64 42, ptr %agg.tmp2, align 8, !tbaa !5
  call void @_Z3baz6widget(ptr nonnull sret(%struct.widget) align 8 %agg.tmp1, ptr noundef nonnull byval(%struct.widget) align 8 %agg.tmp2)
  call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %agg.tmp2) #4
  call void @_Z3bar6widget(ptr nonnull sret(%struct.widget) align 8 %agg.tmp, ptr noundef nonnull byval(%struct.widget) align 8 %agg.tmp1)
  call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %agg.tmp1) #4
  call void @_Z3foo6widget(ptr nonnull sret(%struct.widget) align 8 %w, ptr noundef nonnull byval(%struct.widget) align 8 %agg.tmp)
  call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %agg.tmp) #4
  call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %w) #4
  ret void
}

which is very aggressive; each temporary is marked dead for the next non-intrinsic call. This looks like it would be optimal. IIUC, @rjmccall 's proposal is instead to do something a la:

define dso_local void @_Z4quuxv() local_unnamed_addr #0 {
entry:
  %w = alloca %struct.widget, align 8
  %agg.tmp = alloca %struct.widget, align 8
  %agg.tmp1 = alloca %struct.widget, align 8
  %agg.tmp2 = alloca %struct.widget, align 8
  call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %w) #4
  call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %agg.tmp) #4
  call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %agg.tmp1) #4
  call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %agg.tmp2) #4
  call void @llvm.memset.p0.i64(ptr noundef nonnull align 8 dereferenceable(32) %agg.tmp2, i8 0, i64 32, i1 false)
  store i64 42, ptr %agg.tmp2, align 8, !tbaa !5
  call void @_Z3baz6widget(ptr nonnull sret(%struct.widget) align 8 %agg.tmp1, ptr noundef nonnull byval(%struct.widget) align 8 %agg.tmp2)
  call void @_Z3bar6widget(ptr nonnull sret(%struct.widget) align 8 %agg.tmp, ptr noundef nonnull byval(%struct.widget) align 8 %agg.tmp1)
  call void @_Z3foo6widget(ptr nonnull sret(%struct.widget) align 8 %w, ptr noundef nonnull byval(%struct.widget) align 8 %agg.tmp)
  call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %agg.tmp2) #4
  call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %agg.tmp1) #4
  call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %agg.tmp) #4
  call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %w) #4
  ret void
}

i.e. to have a less aggressive optimization where the lifetimes extend to the full expression, and the temporaries would not be able to share stack slots. That's better than the status quo (no lifetime markers at all) but still suboptimal, and I don't think it will solve the particular case I care about (a particular Linux kernel driver written in C which is passing many aggregates by value).

Do I want to create a flag for that? Not particularly; it does mean supporting BOTH codegen patterns; I'd rather have a flag to control the optimization or not do it at all rather than support 2 different optimizations. I'm also curious whether you'd expect the default value of the flag to be on (more aggressive) or not (less aggressive)? I don't want to have to add -Xclang (or -mllvm) flags to my build system to opt into secret optimizations. If folks find that this optimization breaks their incorrect code, they can use the already existing`-Xclang -disable-lifetime-markers` to opt out of optimizations (though more optimizations than just those introduced by this change). I like that pattern better, since using those flags is itself a red flag to be scrutinized during code review. Then again GCC does have -fconserve-stack that clang does not; we could put more aggressive stack saving optimizations under that, though IME in GCC that flag limits inline substitution to a magic per-target-architecture number of bytes. Or hide it under -O3 perhaps?

At the least, a release note to at least acknowledge that "you're not crazy; clang-18 is now more precise about minimizing the lifetime of temporaries (since the C++ spec says it's implementation defined) in order to reduce stack usage. ASAN is your friend. If the code works under -Xclang -disable-lifetime-markers then this is likely the culprit." would be an improvement if we were to reland this.

Personally, I feel that "we found a buggy example of code we broke, we fixed it. Let's reland it and tell people to be careful when upgrading." But I haven't been around long enough to know what's the precedent here for "aggressive optimizations." Do you feel strongly that we should not just reland this, @rjmccall (or anyone else)?


Converting the above widget example to C, I noticed that we're also missing the lifetime markers on compound literals!

struct widget { long x, y, z, w; };

struct widget foo(struct widget);
struct widget bar(struct widget);
struct widget baz(struct widget);

void quux () {
    struct widget w = foo(bar(baz((struct widget){ .x = 42, })));
}

->

define dso_local void @quux() local_unnamed_addr #0 {
entry:
  %w = alloca %struct.widget, align 8
  %agg.tmp = alloca %struct.widget, align 8
  %agg.tmp1 = alloca %struct.widget, align 8
  %.compoundliteral = alloca %struct.widget, align 8    ; <--- NO LIFETIME MARKERS :(
  call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %w) #4
  call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %agg.tmp) #4
  call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %agg.tmp1) #4
  call void @llvm.memset.p0.i64(ptr noundef nonnull align 8 dereferenceable(32) %.compoundliteral, i8 0, i64 32, i1 false)
  store i64 42, ptr %.compoundliteral, align 8, !tbaa !5
  call void @baz(ptr nonnull sret(%struct.widget) align 8 %agg.tmp1, ptr noundef nonnull byval(%struct.widget) align 8 %.compoundliteral) #4
  call void @bar(ptr nonnull sret(%struct.widget) align 8 %agg.tmp, ptr noundef nonnull byval(%struct.widget) align 8 %agg.tmp1) #4
  call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %agg.tmp1) #4
  call void @foo(ptr nonnull sret(%struct.widget) align 8 %w, ptr noundef nonnull byval(%struct.widget) align 8 %agg.tmp) #4
  call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %agg.tmp) #4
  call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %w) #4
  ret void
}

One thing I noticed, simply printing the expression when we were about to add the lifetime cleanup; I noticed some examples would trigger on CXXConstructExpr, CXXTemporaryObjectExpr, and CXXOperatorCallExpr. Those feel very C++ specific; I do wonder if we could do something differently for C, but I think we should not since as @rjmccall said "we'd have a consistent rule for all types and targets" which IMO is nicer+simpler.


I tried to write a verifier check to detect when a use was dominated by a call to the lifetime end intrinsic. I don't think that's what's going wrong here, but my initial hypothesis was that we were somehow getting this wrong in codegen, which would lead to bugs.

This failed on multiple examples in tree; llvm/test/CodeGen/AArch64/stack-tagging.ll has some examples that surprised me. Such as values with lifetimes that restart after being ended:

call void @llvm.lifetime.start.p0(i64 4, ptr nonnull %x)
call void @use8(ptr %x) #3
call void @llvm.lifetime.end.p0(i64 4, ptr nonnull %x)

call void @llvm.lifetime.start.p0(i64 4, ptr nonnull %x)
call void @use8(ptr %x) #3
call void @llvm.lifetime.end.p0(i64 4, ptr nonnull %x)

which makes it so that we can't simply use a single dominance check to check for uses after lifetime has ended (though I guess this is just a kind of bounds check; maybe if the use is then additionally dominated by a lifetime start and the start is dominated by the end then we're "resurrected."). Perhaps a separate discussion orthogonal to this bug, but after reading

another question unresolved in my mind is "why do we even need intrinsics for lifetime? why can't we use the initial use as the implicit lifetime start, and the final use as the implicit lifetime end?" I'm certain this is a naive question that probably depends on the semantics of a specific language frontend in particular (or even ABI; perhaps that's why the C++ spec mentions "implementation defined." I can't recall where I saw this, but for some reason I suspect the ABI used for Microsoft platforms differs here?). I also don't understand under what circumstances that something is marked dead can it later be "brought back to life/resurrected." Why is that a thing?

That's better than the status quo (no lifetime markers at all) but still suboptimal, and I don't think it will solve the particular case I care about (a particular Linux kernel driver written in C which is passing many aggregates by value).

Ah, all in the same full-expression? Then yeah, my idea wouldn't be good enough to optimize your situation.

Do I want to create a flag for that? Not particularly; it does mean supporting BOTH codegen patterns; I'd rather have a flag to control the optimization or not do it at all rather than support 2 different optimizations.

The received wisdom here is that it's very helpful to have a flag because it lets users test the effect of different changes independently. Users mostly don't live on trunk, and LLVM releases are roughly every six months, so when users get a new compiler it's going to have at least six months of accumulated changes in it. If a project breaks with a new compiler, it's really great to be able to isolate which change is causing the problem without having to rebuild the compiler.

The fact that we have dynamic checking of this from ASan does help a lot, though.

Personally, I feel that "we found a buggy example of code we broke, we fixed it. Let's reland it and tell people to be careful when upgrading." But I haven't been around long enough to know what's the precedent here for "aggressive optimizations." Do you feel strongly that we should not just reland this, @rjmccall (or anyone else)?

That several things broke immediately after commit does suggest that there are quite a bit more project bugs out there waiting to blow up. I think having a dedicated flag for just this optimization is probably a good idea in the short term.

One thing I noticed, simply printing the expression when we were about to add the lifetime cleanup; I noticed some examples would trigger on CXXConstructExpr, CXXTemporaryObjectExpr, and CXXOperatorCallExpr. Those feel very C++ specific; I do wonder if we could do something differently for C, but I think we should not since as @rjmccall said "we'd have a consistent rule for all types and targets" which IMO is nicer+simpler.

Well, if the language rules are different, they're different. (More on that in a second.) As far as I know, the lifetime of a parameter variable in C is never specified as outliving the call, and the entire concept of a full-expression is a C++ism. We could do something more aggressive in C.

Converting the above widget example to C, I noticed that we're also missing the lifetime markers on compound literals!

Yeah, so this is probably because the lifetime of a compound literal is somewhat surprising in C: it's actually the containing block scope. We could emit lifetime markers, but we'd have to make sure we emitted them in the right place. In pure C, I think this is only detectable if the user actually takes the address of the literal; unlike C++, C doesn't implicitly expose the address of temporaries or have destructors that allow their lifetime to be directly tested. But in ObjC, you could have a compound literal with weak/strong references in it, which arguably are required to be destroyed at the close of the enclosing block.

That's better than the status quo (no lifetime markers at all) but still suboptimal, and I don't think it will solve the particular case I care about (a particular Linux kernel driver written in C which is passing many aggregates by value).

Ah, all in the same full-expression? Then yeah, my idea wouldn't be good enough to optimize your situation.

Yeah, here's an example:
https://github.com/torvalds/linux/blob/2cf0f715623872823a72e451243bbf555d10d032/drivers/gpu/drm/amd/display/dc/dml/calcs/dce_calcs.c#L564-L570
These nested function calls take a struct by value and return the same type. The struct contains a single unsigned long long. When targeting 32b ARM, we're kind of hamstrung (I think) by https://github.com/ARM-software/abi-aa/blob/main/aapcs32/aapcs32.rst#result-return; clang will "unwrap" this aggregate for other targets, but not 32b ARM, I think because of that final bullet point about composite types. Though, reading the next section https://github.com/ARM-software/abi-aa/blob/main/aapcs32/aapcs32.rst#parameter-passing, I suspect clang may also be missing out the logic referring to "NCRN." Perhaps LLVM performs that later, but by not unwrapping the parameters, the omission of lifetime markers is painful.

I'm certain I could rewrite this code to not use a 1-element aggregate, but I think this is a common case of discrepancies where GCC does much better than clang in terms of stack usage. D74094 causes a 5x reduction in stack usage for this function in particular for 32b arm targets, so I'd rather target the root issue here which I think is clang.

Do I want to create a flag for that? Not particularly; it does mean supporting BOTH codegen patterns; I'd rather have a flag to control the optimization or not do it at all rather than support 2 different optimizations.

The received wisdom here is that it's very helpful to have a flag because it lets users test the effect of different changes independently. Users mostly don't live on trunk, and LLVM releases are roughly every six months, so when users get a new compiler it's going to have at least six months of accumulated changes in it. If a project breaks with a new compiler, it's really great to be able to isolate which change is causing the problem without having to rebuild the compiler.

The fact that we have dynamic checking of this from ASan does help a lot, though.

Ok, I will add that, and a release note about it and that ASan can help detect this as well.

One thing I noticed, simply printing the expression when we were about to add the lifetime cleanup; I noticed some examples would trigger on CXXConstructExpr, CXXTemporaryObjectExpr, and CXXOperatorCallExpr. Those feel very C++ specific; I do wonder if we could do something differently for C, but I think we should not since as @rjmccall said "we'd have a consistent rule for all types and targets" which IMO is nicer+simpler.

Well, if the language rules are different, they're different. (More on that in a second.) As far as I know, the lifetime of a parameter variable in C is never specified as outliving the call, and the entire concept of a full-expression is a C++ism. We could do something more aggressive in C.

I could only find language in the latest draft about lifetime of a parameter, nothing about lifetime of arguments. See footnote 206 in https://www.open-std.org/jtc1/sc22/wg14/www/docs/n3096.pdf pdf page 178.

Converting the above widget example to C, I noticed that we're also missing the lifetime markers on compound literals!

Yeah, so this is probably because the lifetime of a compound literal is somewhat surprising in C: it's actually the containing block scope. We could emit lifetime markers, but we'd have to make sure we emitted them in the right place. In pure C, I think this is only detectable if the user actually takes the address of the literal; unlike C++, C doesn't implicitly expose the address of temporaries or have destructors that allow their lifetime to be directly tested. But in ObjC, you could have a compound literal with weak/strong references in it, which arguably are required to be destroyed at the close of the enclosing block.

Ack; will track that as a separate TODO.

  • undo ReturnsReference checks, add -Xclang -sloppy-temporary-lifetimes, add test for that, add release note
nickdesaulniers edited the summary of this revision. (Show Details)Sep 20 2023, 9:41 AM
  • re-add sentence I accidentally dropped from the release note
  • git clang-format HEAD~