This is an archive of the discontinued LLVM Phabricator instance.

Redefine deref(N) attribute family as point-in-time semantics (aka deref-at-point)
AbandonedPublic

Authored by reames on Sep 29 2021, 11:23 AM.

Details

Summary

This change shifts all of our mechanisms for specifying dereferenceability to be explicitly point-in-time. This change covers both argument and return attribute forms of both dereferenceable(N) and dereferenceable_or_null(N) attributes, as well as the parallel family of metadata.

The change in semantics is believed to be backwards compatible - that is, legacy bitcode which expected the previous semantics will not miscompile. There may be a non-trivial performance (i.e. optimization quality) impact with this change. Please see detailed discussion later on.

A bit of history for context...

Our existing semantics (as implemented in the optimizer), were primarily scope based. Once an argument was marked dereferenceable, the underlying memory was assumed to *remain* dereferenceable until the end of the dynamic scope. It's been pointed out a couple of times that this didn't align with the way clang uses the attribute, and there were even mentions of potential miscompiles.

The tension comes up from the fact that the legacy semantics happen to be really useful for other frontends, and we've been hesitant to give up the optimization power implied. We don't really have a great mechanism for context-sensitive dereferenceability.

The last iteration of this - from 2019 - eventually evolved into a proposal to split the attributes into two variants - one scoped, one point-in-time. You can see the history and discussion on D61652, but the proposal never landed.

Earlier this year, I returned to the topic and worked with Johannes (the author of the last attempt) to come up with a proposal which allowed us - we thought - to use the point-in-time semantics while recovering the scoped semantics in all practical cases. Unfortunately, this attempt failed as it turns out we don't have broad agreement on what some of our function and argument attributes actually mean. See D99100 and linked reviews/RFC for the history.

While this latest effort didn't succeed for *all* cases where the scoped semantics might be useful, it did succeed for one in particular - the GCed language use case which motivated my strong objections to Johannes' earlier proposals.

All of this is to say, we're going back to the original idea of just redefining the attributes to the weaker semantics, and giving up some optimization potential in the progress.

Expected code quality impact

There may be a negative code quality (performance) impact on some workloads. In particular, there is great uncertainty about the impact on rust code compiled by rustc. On the C/C++/Clang side, no large impact has been identified to date, but we also have not run extensive benchmarks. (Reviewer help with that would be very appreciated!)

The risk of negative code quality impacts will be reduced for LTO and languages with larger modules. We have managed to strengthen the (on by default) inference of the nosync and nofree function attributes, and one of the few cases we got everyone on board with was that an argument to such a function couldn't be freed within it's dynamic scope. As a result, attribute inference will tend to extend point-in-time semantics to scoped semantics for many common cases, provided we can see the whole call graph.

However, once we inline that function out of existence, we will tend to loose the strong deref fact in the callers scope. This means there's a pass ordering interaction which is undesirable, but also hard to avoid as we don't yet have a fully context sensitive deref analysis. We may, in the future, need to consider implementing one. (To be clear, this is not a totally new problem. The old scoped semantics had the same issue for the argument attributes, which were the most heavily used.)

If you think you might have code quality impact, you can test using the -use-dereferenceable-at-point-semantics=false flag. Once confirmed, help reproducing and reducing would be very appreciated.

On test changes

I had previously made an effort to go through and add "nofree nosync" to tests where doing so seemed to preserve the spirit of the test. What's left are cases where a) the semantic change is meaningful, or b) tests which have been updated since my last sweep. Out of these, the only ones I really haven't investigated are the AMDGPU changes.

Diff Detail

Event Timeline

reames created this revision.Sep 29 2021, 11:23 AM
reames requested review of this revision.Sep 29 2021, 11:23 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptSep 29 2021, 11:23 AM
reames edited the summary of this revision. (Show Details)Sep 29 2021, 11:42 AM
nlopes accepted this revision.Oct 1 2021, 2:43 AM

LGTM. We have discussed this before. It's important to fix the reference types in C++.

llvm/test/Analysis/BasicAA/dereferenceable.ll
66

this one is unfortunate.. They can't alias because objsize(%obj) < dereferenceable size of %ret. This case should be trivial to catch.

This revision is now accepted and ready to land.Oct 1 2021, 2:43 AM
nikic requested changes to this revision.Oct 1 2021, 3:26 AM

Sorry, but the fact that there is still no way to opt-in to the old behavior is still a blocker from my side. If we can't use dereferenceable + nofree arguments for that purpose, then we need to provide a different way to do that. Like dereferenceable + really_nofree. It looks like the current implementation doesn't even accept the dereferenceable + nofree + noalias case originally proposed (which is pretty bad from a design perspective, but would at least work fairly well for rustc in practice). I don't think that our current analysis capabilities are sufficient to land this change at this time.

This revision now requires changes to proceed.Oct 1 2021, 3:26 AM
reames added a comment.Oct 1 2021, 3:25 PM

Sorry, but the fact that there is still no way to opt-in to the old behavior is still a blocker from my side. If we can't use dereferenceable + nofree arguments for that purpose, then we need to provide a different way to do that. Like dereferenceable + really_nofree. It looks like the current implementation doesn't even accept the dereferenceable + nofree + noalias case originally proposed (which is pretty bad from a design perspective, but would at least work fairly well for rustc in practice). I don't think that our current analysis capabilities are sufficient to land this change at this time.

@nikic Do you have any specific examples of where this causes a workload to regress? At this point, I really need something specific as opposed to a general concern. We're at the point where perfection is very much the enemy of the good here. As noted, I've already spent a lot of time trying to minimize impact.

ychen added a subscriber: ychen.Oct 2 2021, 12:39 PM

This really needs to be properly benchmarked.

llvm/test/Transforms/InstCombine/AMDGPU/memcpy-from-constant.ll
9

No longer happens..

llvm/test/Transforms/LICM/hoist-deref-load.ll
419

Regression in that C code?

@nikic ping on previous question. It's been a month, and this has been LGTMed. Without response, I plan to land this.

This really needs to be properly benchmarked.

This has been benchmarked on every workload I care about, and shows no interesting regressions. Unfortunately, those are all non-public Java workloads,

On the C/C++ side, I don't have a ready environment in which to run anything representative. From the semantic change, I wouldn't expect C++ to show much difference, and besides, this is fixing a long standing fairly major correctness issue. If you have particular suites you care about, please run them and share results.

At this point, I strongly lean towards committing and letting regressions be reported. We might revert, or we might simply fix forward depending on what comes up

nikic added a subscriber: aeubanks.Nov 13 2021, 8:55 AM

@nikic ping on previous question. It's been a month, and this has been LGTMed. Without response, I plan to land this.

Sorry, I did do some measurements but forgot to report back. The only run-time workload I can easily measure are rustc check builds, where I observed regressions ranging between -0.6% and +4.8% across different projects (the -0.6% "regressions" indicate an improvement, i.e. that some deref-based optimization happens to be non-profitable). I'm not sure that helps you much apart from saying that yes indeed, this does have some impact on rust code. It's not catastrophic (though caveat emptor: impact on "benchmarky" code may well be higher), but it's also avoidable.

This really needs to be properly benchmarked.

This has been benchmarked on every workload I care about, and shows no interesting regressions. Unfortunately, those are all non-public Java workloads,

I think something important to mention here is that the Java (i.e. GC'd language) case is also the only where we don't really expect a regression, because it has good modeling under the proposed patch: GC'd pointers can't be freed during the optimization pipeline (until statepoint lowering), so they're simply not affected by this change. For that reason I don't think the fact that Java workloads don't see regressions really tells us anything about how this would behave for other frontends, which are mostly not GC'd.

On the C/C++ side, I don't have a ready environment in which to run anything representative. From the semantic change, I wouldn't expect C++ to show much difference, and besides, this is fixing a long standing fairly major correctness issue. If you have particular suites you care about, please run them and share results.

Maybe @asbirlea or @aeubanks can run some benchmarks? I would expect regressions for C++, because nofree inference is very limited (e.g. won't be inferred for pretty much any code using growing STL containers). Though at least in the C++ case regressions may be somewhat justified, in that this fixes a correctness issue in some cases.

At this point, I strongly lean towards committing and letting regressions be reported. We might revert, or we might simply fix forward depending on what comes up

I'm not sure what you're expecting from that. At least as far as Rust is concerned, the problem here seems pretty well-understood to me: You are dropping (without replacement) the ability to specify that an argument is dereferenceable within a function. I'm perfectly happy with the change in default behavior, all I want is a way to get the old one back. I don't think that having an example of that in "real" code is going to add any useful information.

reames abandoned this revision.Dec 1 2021, 9:45 AM

I'm stopping work on this. This has already exceeded the amount of work which is worthwhile for me, and it seems there is yet more work needed.

On @nikic's prompting, I finally went ahead and got the test-suite setup and tested a clang version with and without the flag thrown. Unfortunately, the results were not pretty. We have significant regressions in some of the memset/memcpy tests. I did not bother to dig into why in detail, but I suspect the lack of context sensitivity is biting us.

Between the measured regressions on both C++ and rust, I don't think this can go in.

At this point, I've done everything I reasonable can to drive this to conclusion. My actual motivation for this was a purely defensive effort to avoid regressing Java performance when this someday got fixed, and to make a good faith effort to justify my objections to Johannes' original patches. That is complete.

Frankly, I think it's incredibly unfortunate that clang has an active miscompile, and no one seems motivated to fix that after *years* of it being there. However, I have no commercial interest in clang, and the amount of work that seems to be remaining is well beyond anything I'm willing to do on a volunteer basis.

Let me summarize some ideas on future direction for the next poor person who stumbles into this rats nest.

The approach taken in this round of trying to infer scoped dereferenceability from existing attributes and to strengthen the inference of such to cover practical cases has been partially successful, but I no longer believe can be pushed across the finish line. The problem here is not technical, but political. We appear to have unresolved disagreements about the semantics of attributes, and the review process towards resolving those disagreements touch on many otherwise disjoint parts of the project. I would definitely not advise moving further in this direction unless you greatly enjoy herding cats.

We could implement a contextual dereferenceability analysis. This is useful to have no matter what, but requires extending the current must-execute logic and finding ways to efficiently make that information available cheaply through much of the pass pipeline. I have some ideas on that, and if someone wants to brainstorm this, feel free to reach out. However, I think it needs to be said that its unclear if a "perfect" version of this analysis is enough to recover the scoped facts in all cases. This is a fairly speculative approach, and it might not be enough.

The approach taken in D61652 of splitting the dereferenceability attribute into two is a bit ugly. The objection to this approach in this round was mostly driven by the observation that the "alloc-size" attribute had the same semantic split around whether the implied dereferenceability was scoped or not. The good news is that the work done in this round was enough to cover performance regressions from the "alloc-size" version, and at this point, the only checked in code for "alloc-size" uses the non-globally dereferenceable semantics. (We had to because it was actively miscompiling otherwise.)

Personally, if I was motivated to continue working on this, I'd probably resurrect D61652 and call it a day.

@nikic - Since you now have the sole remaining frontend for which dropping global deref is a performance regression without also being a correctness fix, any chance you're interested in driving this further?