This is an archive of the discontinued LLVM Phabricator instance.

[C++20] [Modules] Offer -fno-import-inter-module-function-defs to avoid duplicated compilation in modules
Changes PlannedPublic

Authored by ChuanqiXu on Feb 27 2023, 12:26 AM.

Details

Summary

Close https://github.com/llvm/llvm-project/issues/61015
Close https://github.com/llvm/llvm-project/issues/60996

Clang will import the function definitions from other module unit within
optimizations by default to get the best performance. But there are
cases users prefer faster compilation speed than the best performance.

Although we may not agree such ideas, we should offer an option for the
users to give them the right to choose.

Diff Detail

Event Timeline

ChuanqiXu created this revision.Feb 27 2023, 12:26 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 27 2023, 12:26 AM
ChuanqiXu requested review of this revision.Feb 27 2023, 12:26 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 27 2023, 12:26 AM

Seem to recall @iains and others were concerned about the number of modules flags - this one I'd be inclined to suggest we shouldn't add if possible. If one way or the other is generally better - we should, at least initially, dictate that behavior entirely until someone can demonstrate divergent use cases that seem reasonable to support but must have different behavior here.

The performance of cross-module inlining could be achieved with something like ThinLTO if we were to lean in favor of not allowing cross-module inlining by default, for instance.

But if there are particular idioms where cross-module inlining is disadvantageous, perhaps we can implement better ways for clang to detect them (or if they're undetectable, offer users a way to annotate their code, maybe).

Could we key off -Os or -Oz or some other existing optimization/compile time tradeoff flags instead of introducing a new flag?

Seem to recall @iains and others were concerned about the number of modules flags - this one I'd be inclined to suggest we shouldn't add if possible. If one way or the other is generally better - we should, at least initially, dictate that behavior entirely until someone can demonstrate divergent use cases that seem reasonable to support but must have different behavior here.

Agreed, [IMO, at least] we have a very large set of modules options, and expanding that should only be done if no sensible alternative can be found (we are already seeing users getting confused about which ones apply in each case).

A second point here - that (once we have the ability to generate object and PCM in the same compilation) that we will move to elide the function bodies for non-inline and non-dependent cases from the PCM, so that this problem will magically "go away" (to restore the current behaviour, we'd then be using some optimisation control to avoid the elision, I suppose).

The performance of cross-module inlining could be achieved with something like ThinLTO if we were to lean in favor of not allowing cross-module inlining by default, for instance.

+1 this seems exactly what LTO is intended for (also there are folks who seem to have an expectation that the BMI somehow magically contains an optimised representation of the source - which again is the province of LTO).

But if there are particular idioms where cross-module inlining is disadvantageous, perhaps we can implement better ways for clang to detect them (or if they're undetectable, offer users a way to annotate their code, maybe).

I'd be interested to see an example where cross-module function body imports somehow beats the equivalent LTO.

Could we key off -Os or -Oz or some other existing optimization/compile time tradeoff flags instead of introducing a new flag?

That's an interesting idea .. I'd suggest we should default the behaviour to "off" (so that compilation speed is prioritised for default options, as usual) and then maybe enable import of function bodie at O3 or so? [maybe even an optimisation representative of LTO .. so that when we slim the BMIs down we do not get complaints about regression in code performance].

Seem to recall @iains and others were concerned about the number of modules flags - this one I'd be inclined to suggest we shouldn't add if possible. If one way or the other is generally better - we should, at least initially, dictate that behavior entirely until someone can demonstrate divergent use cases that seem reasonable to support but must have different behavior here.

Agreed, [IMO, at least] we have a very large set of modules options, and expanding that should only be done if no sensible alternative can be found (we are already seeing users getting confused about which ones apply in each case).

A second point here - that (once we have the ability to generate object and PCM in the same compilation) that we will move to elide the function bodies for non-inline and non-dependent cases from the PCM, so that this problem will magically "go away" (to restore the current behaviour, we'd then be using some optimisation control to avoid the elision, I suppose).

Yeah, this seems like the simplest concrete point to suggest we just change the defaults here - that's the plan moving forward, and I don't think it's worth trying to maintain two codepaths/supports here - people can write the functions as explicitly inline when they want cross-module inlining & we can let non-inline functions be the same as they were before modules.

If, years from now, someone wants to prototype some stronger optimization opportunities in modules - let's do that later/then?

The performance of cross-module inlining could be achieved with something like ThinLTO if we were to lean in favor of not allowing cross-module inlining by default, for instance.

+1 this seems exactly what LTO is intended for (also there are folks who seem to have an expectation that the BMI somehow magically contains an optimised representation of the source - which again is the province of LTO).

There's some advantages (build system complexity, time, etc) to doing it at compile-time, rather than link-time, but I don't think we have enough data to be worth the complexity for now - I'd vote for letting the feature go/removing it for now, and folks can try to bring it back with data/measurements later.

ChuanqiXu abandoned this revision.Mar 14 2023, 7:51 PM

Got your points. Let's postpone this one.

But I want to emphasize that this patch (and the thin PCM) will decrease the performance. While LTO can save the regression, LTO is not widely used. (ThinLTO can only mitigate this.) I mean the default behavior shouldn't cause performance regression. We can offer a new alternative for the users but we can't enable that by default simply now.

ChuanqiXu reclaimed this revision.Mar 14 2023, 8:08 PM
ChuanqiXu planned changes to this revision.

It should be "Plan Changed" instead of "Abandoned".

Got your points. Let's postpone this one.

But I want to emphasize that this patch (and the thin PCM) will decrease the performance. While LTO can save the regression, LTO is not widely used. (ThinLTO can only mitigate this.) I mean the default behavior shouldn't cause performance regression. We can offer a new alternative for the users but we can't enable that by default simply now.

Agree that we (ideally, at least) should not decrease performance with a new feature.
However, "performance" also includes compilation speed in the 'no optimisation, debug' case - that is also considered very important. So, perhaps, the short-term approach should be (as @dblaikie suggested) to include the bodies for -O >= 3?

However, "performance" also includes compilation speed in the 'no optimisation, debug' case - that is also considered very important. So, perhaps, the short-term approach should be (as @dblaikie suggested) to include the bodies for -O >= 3?

I don't think so. I think "performance" refers to the runtime performance generally. I don't believe the normal users will be happy if modules will decrease the performance of their program in any means. So I think we should include the bodies by default.

I don't think of this as a performance regression for users though - this functionality's never really "shipped" so we get to choose what the baseline is.

And I think a reasonable baseline to compare to isn't this implementation we don't think is ideal (because of the build invalidation issues, if nothing else, caused by having thick PCMs) - I think the baseline is what a users non-modular code is. And in non-modular code these non-inline functions would be in the implementation files, not able to cross-TU inline without LTO.

I think not providing definitions of non-inline functions for cross-TU optimizations is not a regression, but exactly in-line with existing non-modular behavior, which is totally fine.

iains added a comment.EditedMar 15 2023, 2:15 PM

However, "performance" also includes compilation speed in the 'no optimisation, debug' case - that is also considered very important. So, perhaps, the short-term approach should be (as @dblaikie suggested) to include the bodies for -O >= 3?

I don't think so. I think "performance" refers to the runtime performance generally. I don't believe the normal users will be happy if modules will decrease the performance of their program in any means. So I think we should include the bodies by default.

I think I must be misunderstanding something here.

The default for clang is to compile without optimisation - this benefits the compile-edit-debug cycle, by providing output that is closest to the original source, and quickest to compile.

The user should not be expecting any optimisations to be applied unless they supply -ON (n fact, they might well complain if we optimise something that makes debugging harder).

So, we should try to ensure that adding modules supports that model - and provides the quickest and closest to the original sources for the default options. If the user wants better optimisation (at the expense of longer compile times), then they provide -ON, right?

I don't think of this as a performance regression for users though - this functionality's never really "shipped" so we get to choose what the baseline is.

And I think a reasonable baseline to compare to isn't this implementation we don't think is ideal (because of the build invalidation issues, if nothing else, caused by having thick PCMs) - I think the baseline is what a users non-modular code is. And in non-modular code these non-inline functions would be in the implementation files, not able to cross-TU inline without LTO.

I think not providing definitions of non-inline functions for cross-TU optimizations is not a regression, but exactly in-line with existing non-modular behavior, which is totally fine.

Yeah, I refer "regression" to compare the performance of modular codes with the non modular codes. Since no users will be happy if they found the performance goes down after they convert their existing library to modules.

Then the problem is that the implicitly inline functions in headers may be non-inline functions in modules. And we made some simple experiments for the patch, we observed performance regression indeed.

However, "performance" also includes compilation speed in the 'no optimisation, debug' case - that is also considered very important. So, perhaps, the short-term approach should be (as @dblaikie suggested) to include the bodies for -O >= 3?

I don't think so. I think "performance" refers to the runtime performance generally. I don't believe the normal users will be happy if modules will decrease the performance of their program in any means. So I think we should include the bodies by default.

I think I must be misunderstanding something here.

The default for clang is to compile without optimisation - this benefits the compile-edit-debug cycle, by providing output that is closest to the original source, and quickest to compile.

The user should not be expecting any optimisations to be applied unless they supply -ON (n fact, they might well complain if we optimise something that makes debugging harder).

So, we should try to ensure that adding modules supports that model - and provides the quickest and closest to the original sources for the default options. If the user wants better optimisation (at the expense of longer compile times), then they provide -ON, right?

Oh, we are talking the case that the users will use -ON as default. Since every C++ programs will use -ON in practice.

And I am still wondering how thin the thin BMI will be by removing the non-inline function bodies. For example, we'll still need to contain the function bodies for templates, right? So I guess if we want a thin BMI, we need more refactoring to the current pcm format, which requires more working.