This is an archive of the discontinued LLVM Phabricator instance.

LTO: Fix a potential race condition in the caching API.
ClosedPublic

Authored by pcc on Mar 16 2017, 5:28 PM.

Details

Summary

After the call to sys::fs::exists succeeds, indicating a cache hit, we call
AddFile and the client will open the file using the supplied path. If the
client is using cache pruning, there is a potential race between the pruner
and the client. To avoid this, change the caching API so that it provides
a MemoryBuffer to the client, and have clients use that MemoryBuffer where
possible.

This scheme won't work with the gold plugin because the plugin API expects a
file path. So we have the gold plugin use the buffer identifier as a path and
live with the race for now. (Note that the gold plugin isn't actually affected
by the problem at the moment because it doesn't support cache pruning.)

This effectively reverts r279883 modulo the change to use the existing path
in the gold plugin.

Event Timeline

pcc created this revision.Mar 16 2017, 5:28 PM
mehdi_amini accepted this revision.Mar 16 2017, 5:37 PM

LGTM. Thanks!

This revision is now accepted and ready to land.Mar 16 2017, 5:37 PM
mehdi_amini added inline comments.Mar 16 2017, 5:39 PM
llvm/tools/gold/gold-plugin.cpp
835

I may comment what's going-on here though, there is an implicit expectation that the API is always backed up by a file.

This revision was automatically updated to reflect the committed changes.
pcc marked an inline comment as done.

It looks like this commit broke the SystemZ build bot:

http://lab.llvm.org:8011/builders/clang-s390x-linux/builds/6102

  • TEST 'LLVM :: ThinLTO/X86/cache-config.ll' FAILED ****

LLVM ERROR: Failed to open cache file /home/uweigand/sandbox/buildbot/clang-s390x-linux/stage1/test/ThinLTO/X86/Output/cache-config.ll.tmp.cache/88CBD6AEFE21EF8594951733D6A16FDEBB8AE7E5: No such file or directory

Any suggestions what might go wrong here? Using strace it appears that the file indeed does not exist:
[pid 54322] open("/home/uweigand/sandbox/buildbot/clang-s390x-linux/stage1/test/ThinLTO/X86/Output/cache-config.ll.tmp.cache/88CBD6AEFE21EF8594951733D6A16FDEBB8AE7E5", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)

Not sure what the expected behavior should have been here ...

pcc added a comment.Mar 17 2017, 10:56 AM

Is there a rename syscall immediately before the open syscall? If that is the case, the fatal error is probably coming from Caching.cpp:73, which is unexpected because we just put a file in place with the correct name, so maybe something is going on with the filesystem on that machine?

The other alternative is that the error is coming from Caching.cpp:41, and in that case the file not existing is expected. If the file does not exist we are supposed to fail the if condition on line 40 and create a new cache entry.

The only explanation I can think of is that there is some sort of bug in your stdlib that causes the if condition to succeed. What happens if you apply this patch?

diff --git a/llvm/lib/LTO/Caching.cpp b/llvm/lib/LTO/Caching.cpp
index d8b91c48ee3..c24540e1860 100644
--- a/llvm/lib/LTO/Caching.cpp
+++ b/llvm/lib/LTO/Caching.cpp
@@ -37,7 +37,8 @@ Expected<NativeObjectCache> lto::localCache(StringRef CacheDirectoryPath,
       return AddStreamFn();
     }
 
-    if (MBOrErr.getError() != std::errc::no_such_file_or_directory)
+    if (MBOrErr.getError() != std::errc::no_such_file_or_directory &&
+        MBOrErr.getError().message() != "No such file or directory")
       report_fatal_error(Twine("Failed to open cache file ") + EntryPath +
                          ": " + MBOrErr.getError().message() + "\n");
In D31063#704025, @pcc wrote:

The only explanation I can think of is that there is some sort of bug in your stdlib that causes the if condition to succeed.

Indeed, this does appear to be the case. I can reproduce the problem with this simple program:

#include <iostream>
#include <system_error>

int main() {
    std::error_code code = std::error_code(ENOENT, std::generic_category());
    if (code != std::errc::no_such_file_or_directory)
      std::cout << "comparison failed" << std::endl;
}

This fails on the build bot machine, but succeeds on my other development machines. It appears the problem is related to the fact on this machine (running SLES12), the system compiler is a gcc 4.8, but the system libstdc++ library is one from gcc 6. This of course is supposed to be a compatible combination, but apparently there is a problem when comparing error codes. The value is the same, but the category class pointer doesn't match, apparently because in the gcc 6 libstdc++ there are two flavours of those classes, and when mixing gcc 4.8 and gcc 6 code, sometimes wrong versions get compared ...

What happens if you apply this patch?

diff --git a/llvm/lib/LTO/Caching.cpp b/llvm/lib/LTO/Caching.cpp
index d8b91c48ee3..c24540e1860 100644
--- a/llvm/lib/LTO/Caching.cpp
+++ b/llvm/lib/LTO/Caching.cpp
@@ -37,7 +37,8 @@ Expected<NativeObjectCache> lto::localCache(StringRef CacheDirectoryPath,
       return AddStreamFn();
     }
 
-    if (MBOrErr.getError() != std::errc::no_such_file_or_directory)
+    if (MBOrErr.getError() != std::errc::no_such_file_or_directory &&
+        MBOrErr.getError().message() != "No such file or directory")
       report_fatal_error(Twine("Failed to open cache file ") + EntryPath +
                          ": " + MBOrErr.getError().message() + "\n");

This does fix the problem for me. Likewise the (maybe more straightforward) fix

if ((std::errc)MBOrErr.getError().value() != std::errc::no_such_file_or_directory)
  ...

I'll work with the distro to hopefully get this fixed, but maybe we can get a workaround along those lines installed so the build bot can continue to run in the meantime?

pcc added a comment.Mar 17 2017, 3:01 PM

I'll work with the distro to hopefully get this fixed, but maybe we can get a workaround along those lines installed so the build bot can continue to run in the meantime?

I committed your proposed workaround in r298127.

I'm quite amazed how you guys narrowed this down so quickly! Well done...

This fails on the build bot machine, but succeeds on my other development machines. It appears the problem is related to the fact on this machine (running SLES12), the system compiler is a gcc 4.8, but the system libstdc++ library is one from gcc 6. This of course is supposed to be a compatible combination, but apparently there is a problem when comparing error codes. The value is the same, but the category class pointer doesn't match, apparently because in the gcc 6 libstdc++ there are two flavours of those classes, and when mixing gcc 4.8 and gcc 6 code, sometimes wrong versions get compared ...

I'll work with the distro to hopefully get this fixed, but maybe we can get a workaround along those lines installed so the build bot can continue to run in the meantime?

Unfortunately it looks like this is a known issue; there are some C++11 related incompatibilities between the gcc 4.8 and gcc 6 versions of libstdc++. Since C++11 support in GCC 4.8 was considered experimental to begin with, there are no plans for fixing this at this point.

I guess that means we're stuck with the workaround (the latest iteration using LLVM's errc works fine).