This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen] Disable tail calls at -O0/-O1
Changes PlannedPublic

Authored by aeubanks on Aug 24 2022, 4:41 PM.

Details

Summary

-O1 is designed to impact debuggability as little as possible and tail
calls hurt debuggability, so turn off tail calls at -O0/-O1.

Add a new pass that adds the "disable-tail-calls"="true" function
attribute, which disables generating (non-musttail) tail calls in that
function. Add this pass to the -O0/-O1 codegen pipeline.

Motivation: D130374 inferred more tail calls even in -O1, causing
various internal symbolizers to regress.

Diff Detail

Event Timeline

aeubanks created this revision.Aug 24 2022, 4:41 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 24 2022, 4:41 PM
aeubanks requested review of this revision.Aug 24 2022, 4:41 PM

The goal clearly makes sense to me, and I think this is a reasonable way to do it. Another option is to implement this in the frontend the way that we implement optnone at -O0.

+@probinson @dblaikie, this is kind of -Og / -O1 related if you have thoughts on how that should be handled.

The goal clearly makes sense to me, and I think this is a reasonable way to do it. Another option is to implement this in the frontend the way that we implement optnone at -O0.

+@probinson @dblaikie, this is kind of -Og / -O1 related if you have thoughts on how that should be handled.

I think it'd be good to have some more words about how exactly this improves debuggability - because at least at a cursory reading I was/am confused - DWARFv5 has debug info to more accurately model tail calls so back traces can be provided. Does that not work in some cases? If this is about debuggability in the absence of debug info - I'm not sure I'm in support of this direction. Tail calls lose stack trace info in a similar way to inlining, and we don't disable inlining at -O1 - so is there some reasoning that differentiates the impact of these cases & justifies treating one differently than the other?

causing various internal symbolizers to regress.

Regress in what way?

DWARFv5 has debug info to more accurately model tail calls so back traces can be provided. Does that not work in some cases?

If you're three levels down in a tail-call chain, I can imagine it would take a lot of effort to reconstruct the call sequence, even with the tail-call info.

nikic added a subscriber: nikic.Aug 29 2022, 11:51 AM

Why does this need to use an attribute rather than querying optimization level in the backend?

Why does this need to use an attribute rather than querying optimization level in the backend?

The attribute already exists and is used in several places. This patch just adds the attribute to more functions.

rnk added a comment.Aug 31 2022, 11:02 AM

Why does this need to use an attribute rather than querying optimization level in the backend?

It's also generally consistent with how LLVM has handled O0, Os, and Oz to date, as attributes: optnone, minsize, optsize. That was in turn driven by a desire to allow mixing O[2zs0] objects during classic (fat?) LTO, and have things work as they did in traditional compilation.

Why does this need to use an attribute rather than querying optimization level in the backend?

The attribute already exists and is used in several places. This patch just adds the attribute to more functions.

Well, really, this is querying the optimizations level in the backend; that's what the code in TargetPassConfig::addISelPrepare does. The reason it feels weird to me (and probably to @nikic) is that it's querying it in a slightly weird place. If we're going to check it in the backend anyway, we might as well just check it in the same place we'd check for the attribute, instead of adding the attribute, then checking for it a couple passes later.

rnk added a comment.Sep 6 2022, 12:51 PM

Well, really, this is querying the optimizations level in the backend; that's what the code in TargetPassConfig::addISelPrepare does. The reason it feels weird to me (and probably to @nikic) is that it's querying it in a slightly weird place. If we're going to check it in the backend anyway, we might as well just check it in the same place we'd check for the attribute, instead of adding the attribute, then checking for it a couple passes later.

I see, sorry, I didn't read the patch closely. I think we should set this in the frontend, or check the optimization level in the codegenerator. Using an extra pass seems heavyweight.

Well, really, this is querying the optimizations level in the backend; that's what the code in TargetPassConfig::addISelPrepare does. The reason it feels weird to me (and probably to @nikic) is that it's querying it in a slightly weird place. If we're going to check it in the backend anyway, we might as well just check it in the same place we'd check for the attribute, instead of adding the attribute, then checking for it a couple passes later.

I see, sorry, I didn't read the patch closely. I think we should set this in the frontend, or check the optimization level in the codegenerator. Using an extra pass seems heavyweight.

I think we are better off not having two ways to spell "don't tail-call here" (attribute & opt-level) that have to be checked every place we might tail-call. So, if doing it in the pass is weird, it should be set in the frontend.

The contract we have with the -O1 pipeline is to attempt to generate debuggable code as much as possible. For that reason I don't think the frontend should be responsible for adding the attribute.

As for a separate pass adding the attribute vs checking the opt level in the various isels, I like a separate pass better because it's more explicit and less magical, plus I'd have to go and touch all the various isels. Touching all the isels isn't a huge deal but it makes it easier to regress with future changes ("forgot to check opt level on top of disable-tail-calls"), and I'd have to make sure to not miss anything in the first place. Having only one way for a backend to check for disabling tail calls seems nicer.

jrtc27 added a comment.Sep 7 2022, 2:02 PM

The contract we have with the -O1 pipeline is to attempt to generate debuggable code as much as possible

No it's not. What we say is:

    :option:`-O0` Means "no optimization": this level compiles the fastest and
    generates the most debuggable code.

    :option:`-O1` Somewhere between :option:`-O0` and :option:`-O2`.

    :option:`-O2` Moderate level of optimization which enables most
    optimizations.

...


    :option:`-Og` Like :option:`-O1`. In future versions, this option might
    disable different optimizations in order to improve debuggability.

The contract we have with the -O1 pipeline is to attempt to generate debuggable code as much as possible

No it's not. What we say is:

    :option:`-O0` Means "no optimization": this level compiles the fastest and
    generates the most debuggable code.

    :option:`-O1` Somewhere between :option:`-O0` and :option:`-O2`.

    :option:`-O2` Moderate level of optimization which enables most
    optimizations.

...


    :option:`-Og` Like :option:`-O1`. In future versions, this option might
    disable different optimizations in order to improve debuggability.

The -O1 pipeline has been tuned in the past to attempt to keep debuggability, and in OptimizationLevel.h we say

/// Optimize quickly without destroying debuggability.
///
/// This level is tuned to produce a result from the optimizer as quickly
/// as possible and to avoid destroying debuggability. This tends to result
/// in a very good development mode where the compiled code will be
/// immediately executed as part of testing. As a consequence, where
/// possible, we would like to produce efficient-to-execute code, but not
/// if it significantly slows down compilation or would prevent even basic
/// debugging of the resulting binary.
///
/// As an example, complex loop transformations such as versioning,
/// vectorization, or fusion don't make sense here due to the degree to
/// which the executed code differs from the source code, and the compile
/// time cost.
static const OptimizationLevel O1;

So I'd say that currently we do expect -O1 to generate debuggable code. If there's ever a push for a separate -Og pipeline that's fine, but we don't have that right now.

For that reason I don't think the frontend should be responsible for adding the attribute.

Not sure I follow - the "optnone at -O0" is implemented in the frontend, seems suitable to implement this one the same way, especially if we already have the attribute and infrastructure for it?

Why does this need to use an attribute rather than querying optimization level in the backend?

The attribute already exists and is used in several places. This patch just adds the attribute to more functions.

This seems less magical and more explicit. I also didn't want to make sure I had found every place that could possibly check the attribute and have potential future changes miss disabling tail calls for -O1.

And +1 to rnk's comment, although we only LTO at an IR level, not at an MIR/object file level.

Still trying to understand exactly what's going on with the symbolizer (which is just llvm-symbolizer)

For that reason I don't think the frontend should be responsible for adding the attribute.

Not sure I follow - the "optnone at -O0" is implemented in the frontend, seems suitable to implement this one the same way, especially if we already have the attribute and infrastructure for it?

This actually came up in a talk with some people recently.
It would be interesting to have some sort of monolithic pipeline where we embed -O0/1/2/3/s/z attributes per-function in the frontend and don't have per-optimization-level pipelines. Currently we have function attributes for -O0/s/z but not the others. -O1 is fairly distinct from -O2/3 in that we want to preserve debuggability. However, we don't have an -O1 function attribute yet, so I don't think what you're proposing makes sense with the current state of things where we express our intent via the optimization level parameter to the optimization pipeline.

It would be interesting to have some sort of monolithic pipeline where we embed -O0/1/2/3/s/z attributes per-function in the frontend and don't have per-optimization-level pipelines.

I would balk at this. A single generic pipeline at -O0 would be a compile-time performance hit, compared to the separate pipeline, and one of the explicit goals of -O0 is fast compilation. I'd expect a hit also at -O1 although probably less noticeable.
(You could experiment with this today, by compiling -emit-llvm -O0 to get IR with optnone everywhere, then compare compiling that IR at O0 vs O3.)

For that reason I don't think the frontend should be responsible for adding the attribute.

Not sure I follow - the "optnone at -O0" is implemented in the frontend, seems suitable to implement this one the same way, especially if we already have the attribute and infrastructure for it?

This actually came up in a talk with some people recently.
It would be interesting to have some sort of monolithic pipeline where we embed -O0/1/2/3/s/z attributes per-function in the frontend and don't have per-optimization-level pipelines. Currently we have function attributes for -O0/s/z but not the others. -O1 is fairly distinct from -O2/3 in that we want to preserve debuggability. However, we don't have an -O1 function attribute yet, so I don't think what you're proposing makes sense with the current state of things where we express our intent via the optimization level parameter to the optimization pipeline.

At least when I think optnone (or maybe optsize, I forget - I remember it was @chandlerc that was expressing these opinions most vocally) was proposed there was some distinction drawn between semantically meaningful differences (that "optimizing for size" and "not optimizing at all" were observable changes in behavior, in some sense - for debugging, for fitting in certain embedded devices, etc) and "varying shades of optimization" (essentially between -O2 and -O3) - and that the latter didn't seem suitable for per-function granularity expression (& so should be expressed only by the optimization pipeline, not embedded in the IR), but the former did.

As -O1 becomes/meanders near -Og, I guess we might end up with another of these "but this needs to be carried in the IR/is semantically meaningful to the end user" attributes ("optdebug"?) - but I don't know that we're there yet. I think if -O1/-Og's desire to be debuggable can be expressed by the frontend as much as possible with existing attributes in a similar way to -O0/optnone, that's a good thing - makes it more compatible with LTO, at least?

It would be interesting to have some sort of monolithic pipeline where we embed -O0/1/2/3/s/z attributes per-function in the frontend and don't have per-optimization-level pipelines.

I would balk at this. A single generic pipeline at -O0 would be a compile-time performance hit, compared to the separate pipeline, and one of the explicit goals of -O0 is fast compilation. I'd expect a hit also at -O1 although probably less noticeable.
(You could experiment with this today, by compiling -emit-llvm -O0 to get IR with optnone everywhere, then compare compiling that IR at O0 vs O3.)

+1
I would be very much against a single pipeline that would run a bunch of passes that check an attribute to decide whether they should do anything. Sounds like it would also prevent building dynamic pass pipelines which ORC JIT currently allows the user to do IIRC.

aeubanks planned changes to this revision.Oct 3 2022, 10:09 PM

At least when I think optnone (or maybe optsize, I forget - I remember it was @chandlerc that was expressing these opinions most vocally) was proposed there was some distinction drawn between semantically meaningful differences (that "optimizing for size" and "not optimizing at all" were observable changes in behavior, in some sense - for debugging, for fitting in certain embedded devices, etc) and "varying shades of optimization" (essentially between -O2 and -O3) - and that the latter didn't seem suitable for per-function granularity expression (& so should be expressed only by the optimization pipeline, not embedded in the IR), but the former did.

The talk I had was actually with Chandler who liked the idea of a unified pipeline that looked at function attributes. (I seemed to remember that you had brought up Chandler being against this before and mentioned it but he had no recollection of being against something like this).

But anyway, we're getting off topic, I'm not actually proposing a unified optimization pipeline. I'll need to find time to understand the tail call vs inlining hurting stack traces issue.

At least when I think optnone (or maybe optsize, I forget - I remember it was @chandlerc that was expressing these opinions most vocally) was proposed there was some distinction drawn between semantically meaningful differences (that "optimizing for size" and "not optimizing at all" were observable changes in behavior, in some sense - for debugging, for fitting in certain embedded devices, etc) and "varying shades of optimization" (essentially between -O2 and -O3) - and that the latter didn't seem suitable for per-function granularity expression (& so should be expressed only by the optimization pipeline, not embedded in the IR), but the former did.

The talk I had was actually with Chandler who liked the idea of a unified pipeline that looked at function attributes. (I seemed to remember that you had brought up Chandler being against this before and mentioned it but he had no recollection of being against something like this).

Hmm - perhaps I'm misremembering and/or things just got jumbled up over the years and the people involved in different parts of the conversation, etc. :/

But anyway, we're getting off topic, I'm not actually proposing a unified optimization pipeline. I'll need to find time to understand the tail call vs inlining hurting stack traces issue.

*nod* Sounds good - basically, I think the issue is that DWARFv5-or-gcc-extension call site descriptions could solve some amount of tail call lossage in backtraces, but wouldn't guarantee all tail calls could get reconstructed - but it'd be nice to see how much we could get covered with that debug info. Maybe it'd be "enough"?