This is an archive of the discontinued LLVM Phabricator instance.

[Support][CachePruning] Fix regression that prevents disabling of pruning
ClosedPublic

Authored by bd1976llvm on Dec 14 2017, 5:21 AM.

Details

Summary

borked by: rL284966 (see: https://reviews.llvm.org/D25730).

Previously, Interval was unsigned (see: CachePruning.h), replacing the type with std::chrono::seconds (which is signed) causes a regression in behaviour because the c-api intends negative values to translate to large positive intervals to *effectively* disable the pruning (see comments on: setCachePruningInterval()).

Diff Detail

Event Timeline

bd1976llvm created this revision.Dec 14 2017, 5:21 AM

The max() value will overflow if you convert it to a more precise type (e.g., milliseconds). This is something that can easily happen even without you realising it (I haven't checked whether that happens here, but even if it's not, it can be introduced in the future).
I'd consider having a more explicit way of specifying that pruning is disabled (for example, in lldb we use Optional<Duration> for various timeouts, and then llvm::None means infinite timeout).

I like your proposal.

However, as the cache code is shared between multiple api's and only the c api has this disabling behaviour for negative values a fix here is more appropriate..

I have updated the patch to be more defensive, hopefully this addresses your concerns.

I removed the comment change.

I noticed that a number of comments needed updating and have moved all the comment changes here: https://reviews.llvm.org/D41279.

This patch just fixes the regression.

I like your proposal.

However, as the cache code is shared between multiple api's and only the c api has this disabling behaviour for negative values a fix here is more appropriate..

I don't quite agree with that conclusion -- the fact we have one upper layer that supports feature X means that the lower layer needs to support it as well (the other front-ends can just choose not to use it). If one front-end is relying on an undocumented quirk of a back-end to implement its feature (which the implicit signed-to-unsigned conversion was, and was probably the reason I did not notice this when doing the conversion) then you get problems.

That said, I don't think it's my place to ask you to do this extra work, and I agree that we should fix the regression. I just felt I should point out a potential issue with this approach, but I am leaving the final decision in the hands of the respective owner(s).

I have updated the patch to be more defensive, hopefully this addresses your concerns.

Somewhat, but the conversion here was not my primary concern. I am more worried about some code down the line writing
auto deadline = system_clock::now() /*nanoseconds*/ + CacheOptions.Policy.Interval /*implicitly converted to nanoseconds*/;

I like your proposal.

However, as the cache code is shared between multiple api's and only the c api has this disabling behaviour for negative values a fix here is more appropriate..

I don't quite agree with that conclusion -- the fact we have one upper layer that supports feature X means that the lower layer needs to support it as well (the other front-ends can just choose not to use it).

Sorry, I shouldn't have used the word disabling it has caused confusion. The intent of negative values in the c-api was always simply to set the interval to maximum not to disable (although practically there is no difference).
That is why I don't want to change the backend. None of the api's including the c-api support disabling so we don't need to implement it.

I don't quite agree with that conclusion -- the fact we have one upper layer that supports feature X means that the lower layer needs to support it as well (the other front-ends can just choose not to use it). If one front-end is relying on an undocumented quirk of a back-end to implement its feature (which the implicit signed-to-unsigned conversion was, and was probably the reason I did not notice this when doing the conversion) then you get problems.

The reliance on signed-to-unsigned is an implementation detail to achieve the documented effects.
I think that the reason you didn't spot this is that this area of code is untested. We have excised it as part of our release process which is why I am only flagging these problems now.

That said, I don't think it's my place to ask you to do this extra work, and I agree that we should fix the regression. I just felt I should point out a potential issue with this approach, but I am leaving the final decision in the hands of the respective owner(s).

Would you be happy to accept this patch to fix the regression? I will then author a further patch to address your concerns in the backend.

I have updated the patch to be more defensive, hopefully this addresses your concerns.

Somewhat, but the conversion here was not my primary concern. I am more worried about some code down the line writing
auto deadline = system_clock::now() /*nanoseconds*/ + CacheOptions.Policy.Interval /*implicitly converted to nanoseconds*/;

I agree that this is a concern - thanks for pointing it out - however, your suggested fix does not prevent the problem here as the api's all allow setting values for Interval that will overflow.
Overflow was also a concern before the change to use std::chrono.
I think that any additions in the backend should be made safer. However, I would like to do this in another change as it is not directly related to this regression.

Add a test? You can use llvm-lto to trigger this legacy interface - although it looks like no support was added to llvm-lto to invoke this interface. It should be straightforward to add that (see where it calls setCacheDir).

Add a test? You can use llvm-lto to trigger this legacy interface - although it looks like no support was added to llvm-lto to invoke this interface. It should be straightforward to add that (see where it calls setCacheDir).

I will attempt to do this.

Are you aware of any users of the legacy C++ api - i.e the one exposed by lvm-lto?

Add a test? You can use llvm-lto to trigger this legacy interface - although it looks like no support was added to llvm-lto to invoke this interface. It should be straightforward to add that (see where it calls setCacheDir).

I will attempt to do this.

Are you aware of any users of the legacy C++ api - i.e the one exposed by lvm-lto?

Yes: ld64 as well as some proprietary linkers I have heard about anecdotally

Thanks for the great review comments.

@tejohnson:
I didn't realize that it was easy to write a test for this via llvm-lto (i.e. I hadn't noticed that the c/c++ legacy apis share a backend!) otherwise I would have added a test without prompting.

@labath:
Absolutely correct about the danger of overflow with chrono duration types. In fact on windows we do actually get this overflow occurring. This is because on windows the modification time for files is of type nanoseconds as opposed to seconds on e.g. linux (see: getLastModificationTime). I have attempted to make the code safer at the cost of half a second of precision.

Explicitly cast system_clock::now type as well.

Just a couple of minor comments - thanks for adding the test. With those fixed it looks good from my perspective, but please wait for labath to sign off.

include/llvm/LTO/legacy/ThinLTOCodeGenerator.h
159

The indentation looks a little funky to me here - can you run through clang-format?

tools/lto/lto.cpp
576 ↗(On Diff #127377)

extra whitespace

I think that at this point you are already "implementing" disabling of pruning, only it's done via ugly decltype tricks. Have looked at how hard would it be to just make the interval be llvm::Optional<Duration> ? It would make things much cleaner and self-documenting.

If you don't want to embark on that, then maybe you could just set the interval to std::chrono::*nanoseconds*::max(). That way you won't get an overflow unless someone tries to convert this to picoseconds or something, and you will still have a pruning interval of several hundred years.

labath accepted this revision.Dec 18 2017, 9:57 AM

If Teresa is fine with this, then I don't want to hold this up. However, I would recommend looking at other ways to implement this behavior in the long term.

This revision is now accepted and ready to land.Dec 18 2017, 9:57 AM
kromanova added inline comments.Dec 19 2017, 3:14 AM
include/llvm/LTO/legacy/ThinLTOCodeGenerator.h
154–157

Ben, how does this assert avoids risk of overflow? I thought this check is only to make sure that CacheOptions.Policy.Interval has type std::chrono::seconds.

"Interval", which is a member of struct CachePruningPolicy, has type std::chrono::seconds. If we write something to Interval, the value will be implicitly converted to std::chrono::seconds, so the assert will never be triggered. Am I wrong?

Or are you adding this assert simply to prevent someone from changing the type of Interval in CachePruningPolicy without paying attention to this functionl? In this case, I still confused about "risk of overflow" comment.

lib/Support/CachePruning.cpp
158 ↗(On Diff #127377)

Ben,
In this conversion (and in the one below) you explicitly convert std::chrono::nanoseconds (system_clock::now()) to std::chrono::seconds
(Policy.Interval), and as a result you have CurrentTime in seconds. You do a similar explicit conversion to have TimeStampModTime (line 172) and TimeStampAge (line 174) in seconds, in order to avoid the overflow on the line 175, where Policy.Interval() could overflow when there is a max value stored in it, since "max" seconds will be converted to nanoseconds for comparison. Is my understanding of your "overflow prevention" right?

I personally think that these conversions make the code harder to read, especially a few lines later there is a similar conversion happening, but since there is no danger of overflow, you didn't change it.
const auto FileAccessTime = StatusOrErr->getLastAccessedTime();
These inconsistences and tricks with conversions to avoid overflow make the code harder to understand.

Have you considered Pavel's suggestion about setting the interval to std::chrono::*nanoseconds*::max()? It's plenty sufficient, and we won't need explicit conversions.

kromanova added inline comments.Dec 19 2017, 3:46 AM
lib/Support/CachePruning.cpp
158 ↗(On Diff #127377)

Or maybe instead of using max(), you could define your own constant (in seconds), that is large enough to ensure that pruning won't happen in the next century, but won't overflow when converted to nanoseconds?

Or (I know already I will be hanged for his suggestion) to add one Boolean member to CachePruningPolicy struct, that will be set to false if we don't want ever run the pruner.

test/ThinLTO/X86/cache.ll
38 ↗(On Diff #127377)

This is not directly related to Ben's review, I was just wondering why here, for llvm-lto2 we don't test that the pruner only removes files matching the pattern "llvmcache-*" how we do in the check just above with llvm-lto?

This revision was automatically updated to reflect the committed changes.

If Teresa is fine with this, then I don't want to hold this up. However, I would recommend looking at other ways to implement this behavior in the long term.

I have had another look at this and the approach you suggested is definitely nicer: https://reviews.llvm.org/D41497

Many thanks for the review - sorry it took a bit of time to rework this.

bd1976llvm added inline comments.Dec 21 2017, 10:35 AM
include/llvm/LTO/legacy/ThinLTOCodeGenerator.h
154–157

Risk of overflow is later on if someone does something like:

auto deadline = system_clock::now() /*nanoseconds*/ + CacheOptions.Policy.Interval /*implicitly converted to nanoseconds*/;

By asserting the type is seconds here and using decltype else where I was trying to force every chrono type into seconds to avoid risk of overflow.

lib/Support/CachePruning.cpp
158 ↗(On Diff #127377)

Accepted. I have now reworked the patch.

test/ThinLTO/X86/cache.ll
38 ↗(On Diff #127377)

llvm-lto2 does not expose the pruning api. Instead this is checked in the linker (gold, lld) specific system tests for thin LTO.