Concurrent remote test runs may attempt to read and write from module cache simultaneously.
This CL intends to fix such races by using file-based locks - module id is used as a lock file name providing synchronization on per-module basis.
Details
Diff Detail
Event Timeline
Would a mutex not work? At least on Windows we can have mutexes that can synchronize among different processes. LockFile will lock a file for exclusive access by a process, but if another process fails to acquire the lock, we won't be able to distinguish the failure from an a different kind of error, like an I/O error.
Also I've never really liked the concept of lock files to begin with, because it requires a cleanup step which won't happen if the program crashes. If there's a way to make this work with a true synchronization primitive I think that would be better. On Windows you do this by just calling CreateMutex and passing a non-null value for the name. Not sure on other platforms.
Thanks for suggestion - mutex sounds like a good alternative. On the other hand, AFAIK file locks are capable to handle program crash properly and to revoke ownership from terminated process.
There are POSIX named semaphores which might be used as system-wide mutexes but I don't know for sure how widely they are supported on all platforms - file-based locks seems to be pretty widely adopted.
If you feel strongly against using file locks on Windows I can rewrite CL with mutexes - otherwise, we can do it iteratively:
- Make initial version of this CL based on file locks to enable concurrent test runs.
- I will migrate solution to use mutex soon (within a month) .
If it's able to handle process crashing without leaving lock files around, maybe it's fine as is. I've added some additional suggestions. I think there's still one more issue that stands out, which is that it's probably better to use the existing file hadnle that is already contained in the Module rather than calling CreateFile(..OPEN_EXISTING..) but I think that will be difficult to fix, as it's not easy to access the HANDLE being used internally by the Module. I don't think it will be broken as is, so I'm mostly just adding this as a note. Eventually one of us on the Windows side can fix this when it becomes an issue (although you're more than welcome to try if you want). I think it's the DataBuffer in the ObjectFile that has a file handle open already, because it's mmap'ed part of the object file. So the best fix would be to somehow get that HANDLE that was used to create the memory map, and use that for the call to LockFileEx.
source/Host/windows/LockFileWindows.cpp | ||
---|---|---|
27–29 ↗ | (On Diff #23869) | You can get by without creating an event handle at all if you use GetOverlappedResult() instead of WaitForSingleObject. I slightly prefer that to using the event handle, especially if this is going to be happening very frequently. If you decide to do that, just leave the hEvent initialized to 0 and change the WaitForSingleObject to GetOverlappedResult. |
53 ↗ | (On Diff #23869) | Remove this call to _close, as as m_file should refer to the original module file, not a separate "lock file" (see later where I elaborate on this). |
91–93 ↗ | (On Diff #23869) | Same as before, we can remove the event handle and just use GetOverlappedResult |
source/Utility/ModuleCache.cpp | ||
150–153 | Per the previous comments, can move all of this code into LockFilePosix. | |
source/Utility/ModuleCache.h | ||
75–76 ↗ | (On Diff #23869) | So the way locking a file works on windows, is that if you want to lock "foo.exe", you get a handle to foo.exe and you lock that. You don't make ".lock/<lock file name>" and then lock that, which is what it looks like you're doing. Because if you do that, you've got a totally new file, and then the ranges that you're passing to LockFileEx and UnlockFileEx don't make sense in the context of this new file. Does this work on Linux? How does it know that even though your fd refers to .lock/<lock file name>, the range you're specifying refers to the original file? In any case, maybe the best thing to do is have all this logic about building up the lock file name be internal to the LockFilePosix / LockFileWindows implementation. Somehow you will have to pass a HANDLE to the actual module file. If there's already a HANDLE deep down inside of the ModuleSP somewhere, best to use that. Otherwise you can probably just call CreateFile(... OPEN_EXISTING ...) on the module spec. But all of this can be internal to the FileLock class. Just have a constructor that takes a FileSpec to the thing you want to lock. On the linux side do the subdirectory thing, and on the Windows side just call CreateFile() on that path directly. |
source/Host/windows/LockFileWindows.cpp | ||
---|---|---|
53 ↗ | (On Diff #23869) | Sorry, a little clarification on this. If you call CreateFile(...OPEN_EXISTING...) then change this to CloseHandle(). If you end up figuring out a way to get the existing handle that the Module is using, then don't call CloseHandle when you release the lock. |
This doesn't seem to take care of the case:
process 1: calls ModuleCache::GetAndPut() on line 134 and then gets a path to the file lock file and creates and locks it
process 1: Starts getting the file and writing it to disk by creating it and writes a half of the file's bytes to the cache location "/tmp/usr/lib/libfoo.so"
process 2: calls ModuleCache::GetAndPut() on line 134 and notices there is a "/tmp/usr/lib/libfoo.so" in the cache and it returns the incomplete copy of the file
Don't we need to take the lock before the first call to "auto error = Get (root_dir_spec" on line 134?
If this is an issue we could read to a temp file first ("/tmp/usr/lib/libfoo.so.<UUID>") and then move it back to "/tmp/usr/lib/libfoo.so" atomically after we are done writing the file contents, and then unlock the file lock. This would prevent people from getting an incomplete copy in process 2?
Thanks for catching this - I missed some race conditions:(
Since there is additional race maybe introduced by lock file removal (when one process is about to delete lock file and another process is going to open it) I decided to follow Zach advise and use mutexes.
- Converted solution to use named mutexes instead of file locks.
- Fixed race condition in GetAndPut by moving lock up to the beginning of the method.
include/lldb/Host/Mutex.h | ||
---|---|---|
36 | I think we should change from returning int to either returning or taking by reference an lldb_private::Error. The same applies for the rest of the functions. The same applies for all the rest of the functions. I know you are just copying the old API format, but maybe if it's not too much additional churn this would be a good time to convert this API to use the Error class. | |
50 | If we change from int to lldb_private::Error, then this failure_message can go away. It's kind of a bad idea to have default parameters on virtual methods anwyay, because depending on who derives from it, you might get different default values depending on what type of pointer you call through, which is confusing. | |
260–261 | Just curious, what is the purpose of this Global prefix? Windows doesn't require it, does posix require this "/" at the beginning? It would be nice if the names were at least somewhat consistent. Also it would be good if the mutexes could be "namespaced" so to speak, so that if you run some shell command or tool to display all the mutexes on the system, we can easily identify an LLDB mutex from some other mutex. With that in mind, what about a template like "lldb-$pid-$mutex_name | |
source/Host/windows/Mutex.cpp | ||
118–121 | HANDLE is already a typedef of void*, is this necessary? | |
137 | Might as well use nullptr instead of NULL for consistency. |
include/lldb/Host/Mutex.h | ||
---|---|---|
36 | I support this idea of using Error instead of int but it seems too many things will be involved in change - ModuleList, Debugger, ConnectionFileDescriptor, GDBRemote* classes,... | |
260–261 | Global means that mutex will be visible within all user's Terminal sessions - it's rather optional thing. I agree that templated name convention makes sense but I'm not sure about $pid - mutex/semaphore ownership is transferred from process to process and with such naming scheme $pid will reflect the pid of creator which might be already dead. | |
source/Host/windows/Mutex.cpp | ||
118–121 | Removed | |
137 | Done |
Windows stuff looks good, only minor comments about the namespace thing.
include/lldb/Host/Mutex.h | ||
---|---|---|
260–261 | Makes sense about the pid. I didn't know about this Global\ Local\ thing before, but I looked it up and seems you're right. If I understand this correctly, then suppose 2 users are connected via Remote Desktop to a Windows machine, and you run LLDB. If you put the mutex in the global namespace, then both users will see the same mutex. If you just use the default though, each will have their own mutex even though the names are the same. It seems like the latter is what we want. You're trying to solve the race condition in the case of all these different processes running in the same terminal session, so I don't see a need for cross-session synchronization. If I have it correct, then it seems we should remove the Global\. |
Ahh, actually maybe we actually do want global in this case. I was thinking that different users would be trying to synchronize among different files, but it would probably be the same file.
I think we need to have Global prefix because if an user is logged in simultaneously via RDP and in interactive session he/she uses the same profile directory - so, in case of LLDB temp directory will be %HOMEPATH%\AppData\Temp\.... To make sure that both of them are synchronized on the same mutex to access temp directory Global prefix is required.
Let me put on hold this CL since there are some serious issues regarding using POSIX semaphores since they don't recover after process crash - will think more about it and will send a patch later.
Decided to stick to file locks since they support process crash recovery both in Posix and Windows implementations.
Main changes:
- Use %UUID%/.lock file for synchronization purposes.
- Download a module to %UUID%/.temp file so iut can be moved to a final file in the same directory without need to copy it.
- Obtain a file lock as a very first step in GetAndPut - just as a safety precaution if rename operation isn't atomic.
AFFECTED FILES
/lldb/trunk/include/lldb/Host/LockFile.h /lldb/trunk/include/lldb/Host/LockFileBase.h /lldb/trunk/include/lldb/Host/posix/LockFilePosix.h /lldb/trunk/include/lldb/Host/windows/LockFileWindows.h /lldb/trunk/lldb.xcodeproj/project.pbxproj /lldb/trunk/source/Host/CMakeLists.txt /lldb/trunk/source/Host/common/LockFileBase.cpp /lldb/trunk/source/Host/posix/LockFilePosix.cpp /lldb/trunk/source/Host/windows/LockFileWindows.cpp /lldb/trunk/source/Target/Platform.cpp /lldb/trunk/source/Utility/ModuleCache.cpp /lldb/trunk/source/Utility/ModuleCache.h
USERS
ovyalov (Author)
I think we should change from returning int to either returning or taking by reference an lldb_private::Error. The same applies for the rest of the functions.
The same applies for all the rest of the functions. I know you are just copying the old API format, but maybe if it's not too much additional churn this would be a good time to convert this API to use the Error class.