This is an archive of the discontinued LLVM Phabricator instance.

Use file locks to synchronize access to ModuleCache.
ClosedPublic

Authored by ovyalov on Apr 16 2015, 11:46 AM.

Details

Reviewers
zturner
clayborg
Summary

Concurrent remote test runs may attempt to read and write from module cache simultaneously.
This CL intends to fix such races by using file-based locks - module id is used as a lock file name providing synchronization on per-module basis.

Diff Detail

Event Timeline

ovyalov updated this revision to Diff 23869.Apr 16 2015, 11:46 AM
ovyalov retitled this revision from to Use file-based locks to synchronize access to ModuleCache..
ovyalov updated this object.
ovyalov edited the test plan for this revision. (Show Details)
ovyalov added reviewers: clayborg, zturner.
ovyalov added a subscriber: Unknown Object (MLST).
zturner edited edge metadata.Apr 16 2015, 11:56 AM

Would a mutex not work? At least on Windows we can have mutexes that can synchronize among different processes. LockFile will lock a file for exclusive access by a process, but if another process fails to acquire the lock, we won't be able to distinguish the failure from an a different kind of error, like an I/O error.

Also I've never really liked the concept of lock files to begin with, because it requires a cleanup step which won't happen if the program crashes. If there's a way to make this work with a true synchronization primitive I think that would be better. On Windows you do this by just calling CreateMutex and passing a non-null value for the name. Not sure on other platforms.

Would a mutex not work? At least on Windows we can have mutexes that can synchronize among different processes. LockFile will lock a file for exclusive access by a process, but if another process fails to acquire the lock, we won't be able to distinguish the failure from an a different kind of error, like an I/O error.

Also I've never really liked the concept of lock files to begin with, because it requires a cleanup step which won't happen if the program crashes. If there's a way to make this work with a true synchronization primitive I think that would be better. On Windows you do this by just calling CreateMutex and passing a non-null value for the name. Not sure on other platforms.

Thanks for suggestion - mutex sounds like a good alternative. On the other hand, AFAIK file locks are capable to handle program crash properly and to revoke ownership from terminated process.
There are POSIX named semaphores which might be used as system-wide mutexes but I don't know for sure how widely they are supported on all platforms - file-based locks seems to be pretty widely adopted.

If you feel strongly against using file locks on Windows I can rewrite CL with mutexes - otherwise, we can do it iteratively:

  1. Make initial version of this CL based on file locks to enable concurrent test runs.
  2. I will migrate solution to use mutex soon (within a month) .

If it's able to handle process crashing without leaving lock files around, maybe it's fine as is. I've added some additional suggestions. I think there's still one more issue that stands out, which is that it's probably better to use the existing file hadnle that is already contained in the Module rather than calling CreateFile(..OPEN_EXISTING..) but I think that will be difficult to fix, as it's not easy to access the HANDLE being used internally by the Module. I don't think it will be broken as is, so I'm mostly just adding this as a note. Eventually one of us on the Windows side can fix this when it becomes an issue (although you're more than welcome to try if you want). I think it's the DataBuffer in the ObjectFile that has a file handle open already, because it's mmap'ed part of the object file. So the best fix would be to somehow get that HANDLE that was used to create the memory map, and use that for the call to LockFileEx.

source/Host/windows/LockFileWindows.cpp
28–30

You can get by without creating an event handle at all if you use GetOverlappedResult() instead of WaitForSingleObject. I slightly prefer that to using the event handle, especially if this is going to be happening very frequently. If you decide to do that, just leave the hEvent initialized to 0 and change the WaitForSingleObject to GetOverlappedResult.

54

Remove this call to _close, as as m_file should refer to the original module file, not a separate "lock file" (see later where I elaborate on this).

92–94

Same as before, we can remove the event handle and just use GetOverlappedResult

source/Utility/ModuleCache.cpp
151–154

Per the previous comments, can move all of this code into LockFilePosix.

source/Utility/ModuleCache.h
76–77

So the way locking a file works on windows, is that if you want to lock "foo.exe", you get a handle to foo.exe and you lock that. You don't make ".lock/<lock file name>" and then lock that, which is what it looks like you're doing. Because if you do that, you've got a totally new file, and then the ranges that you're passing to LockFileEx and UnlockFileEx don't make sense in the context of this new file.

Does this work on Linux? How does it know that even though your fd refers to .lock/<lock file name>, the range you're specifying refers to the original file?

In any case, maybe the best thing to do is have all this logic about building up the lock file name be internal to the LockFilePosix / LockFileWindows implementation. Somehow you will have to pass a HANDLE to the actual module file. If there's already a HANDLE deep down inside of the ModuleSP somewhere, best to use that. Otherwise you can probably just call CreateFile(... OPEN_EXISTING ...) on the module spec. But all of this can be internal to the FileLock class. Just have a constructor that takes a FileSpec to the thing you want to lock. On the linux side do the subdirectory thing, and on the Windows side just call CreateFile() on that path directly.

zturner added inline comments.Apr 16 2015, 1:27 PM
source/Host/windows/LockFileWindows.cpp
54

Sorry, a little clarification on this. If you call CreateFile(...OPEN_EXISTING...) then change this to CloseHandle(). If you end up figuring out a way to get the existing handle that the Module is using, then don't call CloseHandle when you release the lock.

clayborg requested changes to this revision.Apr 16 2015, 3:30 PM
clayborg edited edge metadata.

This doesn't seem to take care of the case:

process 1: calls ModuleCache::GetAndPut() on line 134 and then gets a path to the file lock file and creates and locks it
process 1: Starts getting the file and writing it to disk by creating it and writes a half of the file's bytes to the cache location "/tmp/usr/lib/libfoo.so"
process 2: calls ModuleCache::GetAndPut() on line 134 and notices there is a "/tmp/usr/lib/libfoo.so" in the cache and it returns the incomplete copy of the file

Don't we need to take the lock before the first call to "auto error = Get (root_dir_spec" on line 134?

If this is an issue we could read to a temp file first ("/tmp/usr/lib/libfoo.so.<UUID>") and then move it back to "/tmp/usr/lib/libfoo.so" atomically after we are done writing the file contents, and then unlock the file lock. This would prevent people from getting an incomplete copy in process 2?

This revision now requires changes to proceed.Apr 16 2015, 3:30 PM

This doesn't seem to take care of the case:

process 1: calls ModuleCache::GetAndPut() on line 134 and then gets a path to the file lock file and creates and locks it
process 1: Starts getting the file and writing it to disk by creating it and writes a half of the file's bytes to the cache location "/tmp/usr/lib/libfoo.so"
process 2: calls ModuleCache::GetAndPut() on line 134 and notices there is a "/tmp/usr/lib/libfoo.so" in the cache and it returns the incomplete copy of the file

Don't we need to take the lock before the first call to "auto error = Get (root_dir_spec" on line 134?

If this is an issue we could read to a temp file first ("/tmp/usr/lib/libfoo.so.<UUID>") and then move it back to "/tmp/usr/lib/libfoo.so" atomically after we are done writing the file contents, and then unlock the file lock. This would prevent people from getting an incomplete copy in process 2?

Thanks for catching this - I missed some race conditions:(
Since there is additional race maybe introduced by lock file removal (when one process is about to delete lock file and another process is going to open it) I decided to follow Zach advise and use mutexes.

ovyalov updated this revision to Diff 23940.Apr 17 2015, 9:53 AM
ovyalov retitled this revision from Use file-based locks to synchronize access to ModuleCache. to Use named mutexes to synchronize access to ModuleCache..
ovyalov edited edge metadata.
  1. Converted solution to use named mutexes instead of file locks.
  2. Fixed race condition in GetAndPut by moving lock up to the beginning of the method.
zturner added inline comments.Apr 17 2015, 10:10 AM
include/lldb/Host/Mutex.h
36 ↗(On Diff #23940)

I think we should change from returning int to either returning or taking by reference an lldb_private::Error. The same applies for the rest of the functions.

The same applies for all the rest of the functions. I know you are just copying the old API format, but maybe if it's not too much additional churn this would be a good time to convert this API to use the Error class.

50 ↗(On Diff #23940)

If we change from int to lldb_private::Error, then this failure_message can go away. It's kind of a bad idea to have default parameters on virtual methods anwyay, because depending on who derives from it, you might get different default values depending on what type of pointer you call through, which is confusing.

260–261 ↗(On Diff #23940)

Just curious, what is the purpose of this Global prefix? Windows doesn't require it, does posix require this "/" at the beginning? It would be nice if the names were at least somewhat consistent. Also it would be good if the mutexes could be "namespaced" so to speak, so that if you run some shell command or tool to display all the mutexes on the system, we can easily identify an LLDB mutex from some other mutex. With that in mind, what about a template like "lldb-$pid-$mutex_name

source/Host/windows/Mutex.cpp
118–121 ↗(On Diff #23940)

HANDLE is already a typedef of void*, is this necessary?

137 ↗(On Diff #23940)

Might as well use nullptr instead of NULL for consistency.

ovyalov added inline comments.Apr 17 2015, 2:53 PM
include/lldb/Host/Mutex.h
36 ↗(On Diff #23940)

I support this idea of using Error instead of int but it seems too many things will be involved in change - ModuleList, Debugger, ConnectionFileDescriptor, GDBRemote* classes,...
Let's make it as a separate redesign CL.

260–261 ↗(On Diff #23940)

Global means that mutex will be visible within all user's Terminal sessions - it's rather optional thing.

I agree that templated name convention makes sense but I'm not sure about $pid - mutex/semaphore ownership is transferred from process to process and with such naming scheme $pid will reflect the pid of creator which might be already dead.
With $pid as a part of mutex template you can use this mutex only to synchronize threads within the same process but not different processes.

source/Host/windows/Mutex.cpp
118–121 ↗(On Diff #23940)

Removed

137 ↗(On Diff #23940)

Done

ovyalov updated this revision to Diff 23968.Apr 17 2015, 2:56 PM
ovyalov edited edge metadata.
  • Added mutex name lldb prefix
  • Minor fixes in windows::NamedMutex

Windows stuff looks good, only minor comments about the namespace thing.

include/lldb/Host/Mutex.h
260–261 ↗(On Diff #23940)

Makes sense about the pid. I didn't know about this Global\ Local\ thing before, but I looked it up and seems you're right. If I understand this correctly, then suppose 2 users are connected via Remote Desktop to a Windows machine, and you run LLDB. If you put the mutex in the global namespace, then both users will see the same mutex. If you just use the default though, each will have their own mutex even though the names are the same.

It seems like the latter is what we want. You're trying to solve the race condition in the case of all these different processes running in the same terminal session, so I don't see a need for cross-session synchronization. If I have it correct, then it seems we should remove the Global\.

Ahh, actually maybe we actually do want global in this case. I was thinking that different users would be trying to synchronize among different files, but it would probably be the same file.

Ahh, actually maybe we actually do want global in this case. I was thinking that different users would be trying to synchronize among different files, but it would probably be the same file.

I think we need to have Global prefix because if an user is logged in simultaneously via RDP and in interactive session he/she uses the same profile directory - so, in case of LLDB temp directory will be %HOMEPATH%\AppData\Temp\.... To make sure that both of them are synchronized on the same mutex to access temp directory Global prefix is required.

Let me put on hold this CL since there are some serious issues regarding using POSIX semaphores since they don't recover after process crash - will think more about it and will send a patch later.

ovyalov updated this revision to Diff 25088.May 6 2015, 2:33 PM
ovyalov retitled this revision from Use named mutexes to synchronize access to ModuleCache. to Use file locks to synchronize access to ModuleCache..
ovyalov edited the test plan for this revision. (Show Details)

Decided to stick to file locks since they support process crash recovery both in Posix and Windows implementations.
Main changes:

  • Use %UUID%/.lock file for synchronization purposes.
  • Download a module to %UUID%/.temp file so iut can be moved to a final file in the same directory without need to copy it.
  • Obtain a file lock as a very first step in GetAndPut - just as a safety precaution if rename operation isn't atomic.
clayborg accepted this revision.May 6 2015, 2:54 PM
clayborg edited edge metadata.

Looks good.

This revision is now accepted and ready to land.May 6 2015, 2:54 PM
ovyalov closed this revision.May 7 2015, 8:33 AM

AFFECTED FILES

/lldb/trunk/include/lldb/Host/LockFile.h
/lldb/trunk/include/lldb/Host/LockFileBase.h
/lldb/trunk/include/lldb/Host/posix/LockFilePosix.h
/lldb/trunk/include/lldb/Host/windows/LockFileWindows.h
/lldb/trunk/lldb.xcodeproj/project.pbxproj
/lldb/trunk/source/Host/CMakeLists.txt
/lldb/trunk/source/Host/common/LockFileBase.cpp
/lldb/trunk/source/Host/posix/LockFilePosix.cpp
/lldb/trunk/source/Host/windows/LockFileWindows.cpp
/lldb/trunk/source/Target/Platform.cpp
/lldb/trunk/source/Utility/ModuleCache.cpp
/lldb/trunk/source/Utility/ModuleCache.h

USERS

ovyalov (Author)

http://reviews.llvm.org/rL236736