This is an archive of the discontinued LLVM Phabricator instance.

[libclang] Add API to set temporary directory location
AbandonedPublic

Authored by vedgy on Dec 10 2022, 12:04 PM.

Details

Reviewers
aaron.ballman
Summary

The added function clang_setTemporaryDirectory() works as expected in
KDevelop: https://invent.kde.org/kdevelop/kdevelop/-/merge_requests/283

Fixes: https://github.com/llvm/llvm-project/issues/51847

Diff Detail

Event Timeline

vedgy created this revision.Dec 10 2022, 12:04 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 10 2022, 12:04 PM
vedgy requested review of this revision.Dec 10 2022, 12:04 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptDec 10 2022, 12:04 PM
vedgy updated this revision to Diff 481913.Dec 11 2022, 8:38 AM
vedgy edited the summary of this revision. (Show Details)

Extract identical code from the two Path.inc files into Path.cpp

One of the Path.inc files is #include-d into this Path.cpp file and nowhere else.

I'm sympathetic to the problem you're trying to solve (not having to set environment variable for the temp directory), but I'm also not convinced this is the correct way to approach it. This feels like a configuration property of the libclang execution -- so it'd be nice to require it to be set up from clang_createIndex() (IIRC that's the entrypoint for CIndex functionality, but I'm not 100% sure), rather than an API that the user can call repeatedly. Did you consider a design like that?

clang/docs/ReleaseNotes.rst
863
clang/include/clang-c/Index.h
78–79

Should we mention anything about relative vs absolute path requirements? Separators?

llvm/include/llvm/Support/Path.h
423

Err, I'm not super excited about this new API. For starters, it's not setting the system temp directory at all (it doesn't make any modifications to the host system); instead it overrides the system temp directory. But also, this is pretty fragile due to allowing the user to override the temp directory after the compiler has already queried for the system temp directory, so now you get files in two different places. Further, it's fragile because the caller is responsible for keeping that pointer valid for the lifetime of the program. Finally, we don't allow you to override any other system directory that you can query (like the home directory).

vedgy added a comment.Jan 4 2023, 2:03 AM

This feels like a configuration property of the libclang execution -- so it'd be nice to require it to be set up from clang_createIndex() (IIRC that's the entrypoint for CIndex functionality, but I'm not 100% sure), rather than an API that the user can call repeatedly.

In order to keep backward compatibility, this would require introducing another createIndex function, e.g. clang_createIndexWithTempDir(). clang_disposeIndex() would have to un-override the temporary directory then. I'll have to check whether this un-overriding happens too early (before all preamble-*.pch files are removed). But notice that separate setup functions already exist: clang_CXIndex_setGlobalOptions(), clang_CXIndex_setInvocationEmissionPathOption(). The documentation for clang_setTemporaryDirectory() can recommend calling it before clang_createIndex().
Either alternative works for KDevelop, because it calls clang_createIndex() once.

clang/include/clang-c/Index.h
78–79

On Windows separators are converted automatically. I suppose we don't need to warn users not to pass Windows-style separators on Unix.

On Windows this function handles both relative and absolute paths. On Unix the existing implementation doesn't check whether the path is absolute or relative. Perhaps it assumes that the path in the environment variable is always absolute. Or absolute vs relative doesn't matter. I'll check what happens if a relative path is passed.

llvm/include/llvm/Support/Path.h
423

it's not setting the system temp directory at all (it doesn't make any modifications to the host system); instead it overrides the system temp directory.

Rename to override_system_temp_directory_erased_on_reboot()?

this is pretty fragile due to allowing the user to override the temp directory after the compiler has already queried for the system temp directory, so now you get files in two different places.

Unfortunately, I don't know how to prevent this. In practice calling clang_setTemporaryDirectory() before clang_createIndex() works perfectly well in KDevelop: the preamble-*.pch files are created later and never end up in the default temporary directory /tmp. The same issue applies to the current querying of the environment variable values repeatedly: the user can set environment variables at any time.

it's fragile because the caller is responsible for keeping that pointer valid for the lifetime of the program.

I'd love to duplicate the temporary directory path buffer in Path.cpp. The API would become much easier to use. But then the buffer would have to be destroyed when libclang is unloaded or on program shutdown, which is forbidden by LLVM Coding Standards: Do not use Static Constructors. That is, I think the following code in Path.cpp is not acceptable:

static SmallVectorImpl<char> &tempDirErasedOnRebootUtf8()
{
  static SmallVector<char> value;
  return value;
}

we don't allow you to override any other system directory that you can query (like the home directory).

Well, one has to start somewhere :) Seriously, overriding directories should be allowed only if it has a practical use case. Not just because overriding others is allowed.

vedgy added a comment.Jan 4 2023, 9:55 AM

clang_disposeIndex() would have to un-override the temporary directory then. I'll have to check whether this un-overriding happens too early (before all preamble-*.pch files are removed).

Checked by calling clang_setTemporaryDirectory(nullptr); right after clang_disposeIndex(m_index);. Works correctly in KDevelop: all preamble-*.pch files are removed from the overriding temporary directory on exit. So this alternative API is viable.

clang/include/clang-c/Index.h
78–79

Tested passing a relative path on GNU/Linux: works as expected. libclang stores and removes the preamble-*.pch in the same absolute path as QDir::absolutePath() returns in KDevelop. Since passing any reasonably formatted path to this function works, I suppose mentioning the path requirements in the documentation is unnecessary.

While testing, I noticed another requirement that should be mentioned in this documentation, as well as set_system_temp_directory_erased_on_reboot's documentation: the temporary directory must exist, because Clang doesn't create it. I couldn't find where Clang writes the preamble files when the temporary directory does not exist. Perhaps it doesn't write them to disk at all.

vedgy marked an inline comment as done and 2 inline comments as not done.Jan 6 2023, 1:23 AM
vedgy added inline comments.
llvm/include/llvm/Support/Path.h
423

A copy of user-provided temporary directory path buffer can be stored in class CIndexer. But this API in llvm/Support/Path.h would stay fragile.

Given that we already have other setters to set global state, I'm less concerned about adding another one. I hadn't realized we already introduced the dangers here. But we should document the expectation that the call be made before creating the index.

In terms of the C API, I think it'd make more sense to name in terms of "override" rather than "set", but I don't feel as strongly about it given the other setters. In terms of the C++ file system API, I think "override" makes the most sense though (we don't have setters to follow the naming convention for) because some systems do allow you to set the system directory.

In terms of memory ownership, WDYT of requiring the caller to handle this? e.g., calling set_system_temp_directory_erased_on_reboot will strdup a nonnull pointer and free the stored pointer when given nullptr.

vedgy marked an inline comment as not done.Jan 9 2023, 9:13 AM

In terms of the C API, I think it'd make more sense to name in terms of "override" rather than "set", but I don't feel as strongly about it given the other setters. In terms of the C++ file system API, I think "override" makes the most sense though (we don't have setters to follow the naming convention for) because some systems do allow you to set the system directory.

Let's keep the naming in C and C++ APIs consistent: clang_overrideTemporaryDirectory() and override_system_temp_directory_erased_on_reboot().

In terms of memory ownership, WDYT of requiring the caller to handle this? e.g., calling set_system_temp_directory_erased_on_reboot will strdup a nonnull pointer and free the stored pointer when given nullptr.

I like this idea. libclang-user code would become easier to use than it is now (though less easy compared to libclang managing memory itself). The libclang API documentation can require overriding the temp directory before creating an index and un-overriding it with nullptr after calling clang_disposeIndex().
Now in order to make this libclang API harder to misuse, I lean towards passing the temporary directory in clang_createIndexWithTempDir() and letting clang_disposeIndex() handle the un-overriding (call override_system_temp_directory_erased_on_reboot(nullptr)) automatically. Makes sense? I feel that the clang_createIndexWithTempDir() name could be improved, but don't know how...

What memory [de]allocation method should override_system_temp_directory_erased_on_reboot() use? new[] and delete[]? Or should strdup() from POSIX be used because it is defined as _strdup on Windows? (along with standard free())

vedgy added a comment.Jan 10 2023, 1:29 AM

Given that we already have other setters to set global state, I'm less concerned about adding another one. I hadn't realized we already introduced the dangers here. But we should document the expectation that the call be made before creating the index.

There is a difference: clang_CXIndex_setGlobalOptions() and clang_CXIndex_setInvocationEmissionPathOption() take a CXIndex argument and thus can only be called after creating an index. So requiring to call clang_overrideTemporaryDirectory() before creating an index, plus one more time with nullptr argument after disposing of the index, wouldn't be particularly consistent with other setters.

Given that we already have other setters to set global state, I'm less concerned about adding another one. I hadn't realized we already introduced the dangers here. But we should document the expectation that the call be made before creating the index.

There is a difference: clang_CXIndex_setGlobalOptions() and clang_CXIndex_setInvocationEmissionPathOption() take a CXIndex argument and thus can only be called after creating an index. So requiring to call clang_overrideTemporaryDirectory() before creating an index, plus one more time with nullptr argument after disposing of the index, wouldn't be particularly consistent with other setters.

Oh that is a good point! Apologies for not noticing that earlier -- that makes me wonder if a better approach here is to add a std::string to the CIndexer class to represent the temp path to use. I started investigating that idea and then I realized I have no idea what is calling system_temp_directory() in the first place. It doesn't seem to be anything from the C indexing API but it's possible this is coming from one of the other library components. Can you help me track down the call stack where we start creating the temporary files so I can better understand? My hope is that we can thread the information through rather than using a global setter, if possible.

vedgy added a comment.EditedJan 10 2023, 8:31 AM

Oh that is a good point! Apologies for not noticing that earlier -- that makes me wonder if a better approach here is to add a std::string to the CIndexer class to represent the temp path to use.

I have suggested the possibility in this review:

A copy of user-provided temporary directory path buffer can be stored in class CIndexer. But this API in llvm/Support/Path.h would stay fragile.

So the question is whether the LLVM API or CIndexer should store the buffer.

The only possible caller of system_temp_directory() used by libclang is llvm::sys::fs::createUniquePath(). This helper function is called by createUniqueEntity(), which in turn is called by several other helper functions, all in llvm/lib/Support/Path.cpp. Here is a backtrace from KDevelop built against LLVM at this review's branch:

#0 llvm::sys::path::system_temp_directory(bool, llvm::SmallVectorImpl<char>&) (ErasedOnReboot=true, Result=...) at llvm-project/llvm/lib/Support/Unix/Path.inc:1451
#1 0x00007fb372a55d60 in llvm::sys::fs::createUniquePath(llvm::Twine const&, llvm::SmallVectorImpl<char>&, bool) (Model=<optimized out>, ResultPath=..., MakeAbsolute=<optimized out>) at llvm-project/llvm/lib/Support/Path.cpp:811
#2 0x00007fb372a55eb7 in createUniqueEntity(llvm::Twine const&, int&, llvm::SmallVectorImpl<char>&, bool, FSEntity, llvm::sys::fs::OpenFlags, unsigned int) (Model=..., ResultFD=@0x7fb36effc2d0: 0, ResultPath=..., MakeAbsolute=<optimized out>, Type=FS_File, Flags=llvm::sys::fs::OF_None, Mode=384) at llvm-project/llvm/lib/Support/Path.cpp:181
#3 0x00007fb372a5623b in llvm::sys::fs::createTemporaryFile (Flags=llvm::sys::fs::OF_None, Type=FS_File, ResultPath=..., ResultFD=@0x7fb36effc2d0: 0, Model=...) at llvm-project/llvm/include/llvm/ADT/Twine.h:233
#4 llvm::sys::fs::createTemporaryFile(llvm::Twine const&, llvm::StringRef, int&, llvm::SmallVectorImpl<char>&, FSEntity, llvm::sys::fs::OpenFlags) (Prefix=<optimized out>, Suffix=..., ResultFD=@0x7fb36effc2d0: 0, ResultPath=..., Type=Type@entry=FS_File, Flags=llvm::sys::fs::OF_None) at llvm-project/llvm/lib/Support/Path.cpp:865
#5 0x00007fb372a56451 in llvm::sys::fs::createTemporaryFile(llvm::Twine const&, llvm::StringRef, int&, llvm::SmallVectorImpl<char>&, llvm::sys::fs::OpenFlags) (Prefix=<optimized out>, Suffix=..., ResultFD=<optimized out>, ResultPath=<optimized out>, Flags=<optimized out>) at llvm-project/llvm/lib/Support/Path.cpp:873
#6 0x00007fb3717cc1b7 in (anonymous namespace)::TempPCHFile::create () at llvm-project/llvm/include/llvm/ADT/Twine.h:233
#7 clang::PrecompiledPreamble::Build(clang::CompilerInvocation const&, llvm::MemoryBuffer const*, clang::PreambleBounds, clang::DiagnosticsEngine&, llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem>, std::shared_ptr<clang::PCHContainerOperations>, bool, clang::PreambleCallbacks&) (Invocation=..., MainFileBuffer=0x7fb35000a360, Bounds=Bounds@entry=..., Diagnostics=..., VFS=..., PCHContainerOps=std::shared_ptr<clang::PCHContainerOperations> (use count 5, weak count 0) = {...}, StoreInMemory=false, Callbacks=...) at llvm-project/clang/lib/Frontend/PrecompiledPreamble.cpp:421
#8 0x00007fb3717234a8 in clang::ASTUnit::getMainBufferWithPrecompiledPreamble(std::shared_ptr<clang::PCHContainerOperations>, clang::CompilerInvocation&, llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem>, bool, unsigned int) (this=0x7fb3505c44d0, PCHContainerOps=std::shared_ptr<clang::PCHContainerOperations> (use count 5, weak count 0) = {...}, PreambleInvocationIn=..., VFS=..., AllowRebuild=<optimized out>, MaxLines=0) at /usr/include/c++/12.2.0/bits/unique_ptr.h:191
#9 0x00007fb371729bd6 in clang::ASTUnit::LoadFromCompilerInvocation(std::shared_ptr<clang::PCHContainerOperations>, unsigned int, llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem>) (this=0x7fb3505c44d0, PCHContainerOps=std::shared_ptr<clang::PCHContainerOperations> (use count 5, weak count 0) = {...}, PrecompilePreambleAfterNParses=<optimized out>, VFS=...) at llvm-project/clang/lib/Frontend/ASTUnit.cpp:1685
#10 0x00007fb37172e359 in clang::ASTUnit::LoadFromCommandLine(char const**, char const**, std::shared_ptr<clang::PCHContainerOperations>, llvm::IntrusiveRefCntPtr<clang::DiagnosticsEngine>, llvm::StringRef, bool, clang::CaptureDiagsKind, llvm::ArrayRef<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, llvm::MemoryBuffer*> >, bool, unsigned int, clang::TranslationUnitKind, bool, bool, bool, clang::SkipFunctionBodiesScope, bool, bool, bool, bool, llvm::Optional<llvm::StringRef>, std::unique_ptr<clang::ASTUnit, std::default_delete<clang::ASTUnit> >*, llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem>) (ArgBegin=<optimized out>, ArgEnd=<optimized out>, PCHContainerOps=std::shared_ptr<clang::PCHContainerOperations> (empty) = {...}, Diags=..., ResourceFilesPath=..., OnlyLocalDecls=false, CaptureDiagnostics=clang::CaptureDiagsKind::All, RemappedFiles=..., RemappedFilesKeepOriginalName=true, PrecompilePreambleAfterNParses=1, TUKind=clang::TU_Complete, CacheCodeCompletionResults=true, IncludeBriefCommentsInCodeCompletion=false, AllowPCHWithCompilerErrors=true, SkipFunctionBodies=clang::SkipFunctionBodiesScope::None, SingleFileParse=false, UserFilesAreVolatile=true, ForSerialization=false, RetainExcludedConditionalBlocks=false, ModuleFormat=..., ErrAST=0x7fb36effd2d8, VFS=...) at llvm-project/clang/lib/Frontend/ASTUnit.cpp:1822
#11 0x00007fb37104e462 in clang_parseTranslationUnit_Impl(CXIndex, char const*, char const* const*, int, llvm::ArrayRef<CXUnsavedFile>, unsigned int, CXTranslationUnit*) (CIdx=0x55dbe16c9360, source_filename=<optimized out>, command_line_args=<optimized out>, num_command_line_args=<optimized out>, unsaved_files=..., options=781, out_TU=0x7fb35ca7f630) at /usr/include/c++/12.2.0/bits/stl_vector.h:987
#12 0x00007fb37104f2b4 in operator() (__closure=0x7fb37a7b6f40) at llvm-project/clang/tools/libclang/CIndex.cpp:3976
#13 llvm::function_ref<void()>::callback_fn<clang_parseTranslationUnit2FullArgv(CXIndex, char const*, char const* const*, int, CXUnsavedFile*, unsigned int, unsigned int, CXTranslationUnitImpl**)::<lambda()> >(intptr_t) (callable=callable@entry=140408830783296) at llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:45
#14 0x00007fb3729cc6e0 in llvm::function_ref<void ()>::operator()() const (this=<synthetic pointer>) at llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:68
#15 llvm::CrashRecoveryContext::RunSafely(llvm::function_ref<void ()>) (this=<optimized out>, Fn=...) at llvm-project/llvm/lib/Support/CrashRecoveryContext.cpp:433
#16 0x00007fb3729cc724 in RunSafelyOnThread_Dispatch(void*) (UserData=0x7fb37a7b6e80) at llvm-project/llvm/lib/Support/CrashRecoveryContext.cpp:514
#17 0x00007fb3729cc19a in llvm::thread::Apply<void (*)(void*), (anonymous namespace)::RunSafelyOnThreadInfo*, 0> (Callee=std::tuple containing = {...}) at llvm-project/llvm/include/llvm/Support/thread.h:42
#18 llvm::thread::GenericThreadProxy<std::tuple<void (*)(void*), (anonymous namespace)::RunSafelyOnThreadInfo*> > (Ptr=0x55dbe3921080) at llvm-project/llvm/include/llvm/Support/thread.h:50
#19 llvm::thread::ThreadProxy<std::tuple<void (*)(void*), (anonymous namespace)::RunSafelyOnThreadInfo*> >(void*) (Ptr=0x55dbe3921080) at llvm-project/llvm/include/llvm/Support/thread.h:60
#20 0x00007fb3e909f8fd in () at /usr/lib/libc.so.6
#21 0x00007fb3e9121a60 in () at /usr/lib/libc.so.6

Oh that is a good point! Apologies for not noticing that earlier -- that makes me wonder if a better approach here is to add a std::string to the CIndexer class to represent the temp path to use.

I have suggested the possibility in this review:

A copy of user-provided temporary directory path buffer can be stored in class CIndexer. But this API in llvm/Support/Path.h would stay fragile.

So the question is whether the LLVM API or CIndexer should store the buffer.

My goal is to not modify the LLVM path APIs at all.

The only possible caller of system_temp_directory() used by libclang is llvm::sys::fs::createUniquePath(). This helper function is called by createUniqueEntity(), which in turn is called by several other helper functions, all in llvm/lib/Support/Path.cpp. Here is a backtrace from KDevelop built against LLVM at this review's branch:

....
#6 0x00007fb3717cc1b7 in (anonymous namespace)::TempPCHFile::create () at llvm-project/llvm/include/llvm/ADT/Twine.h:233
#7 clang::PrecompiledPreamble::Build(clang::CompilerInvocation const&, llvm::MemoryBuffer const*, clang::PreambleBounds, clang::DiagnosticsEngine&, llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem>, std::shared_ptr<clang::PCHContainerOperations>, bool, clang::PreambleCallbacks&) (Invocation=..., MainFileBuffer=0x7fb35000a360, Bounds=Bounds@entry=..., Diagnostics=..., VFS=..., PCHContainerOps=std::shared_ptr<clang::PCHContainerOperations> (use count 5, weak count 0) = {...}, StoreInMemory=false, Callbacks=...) at llvm-project/clang/lib/Frontend/PrecompiledPreamble.cpp:421
#8 0x00007fb3717234a8 in clang::ASTUnit::getMainBufferWithPrecompiledPreamble(std::shared_ptr<clang::PCHContainerOperations>, clang::CompilerInvocation&, llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem>, bool, unsigned int) (this=0x7fb3505c44d0, PCHContainerOps=std::shared_ptr<clang::PCHContainerOperations> (use count 5, weak count 0) = {...}, PreambleInvocationIn=..., VFS=..., AllowRebuild=<optimized out>, MaxLines=0) at /usr/include/c++/12.2.0/bits/unique_ptr.h:191
#9 0x00007fb371729bd6 in clang::ASTUnit::LoadFromCompilerInvocation(std::shared_ptr<clang::PCHContainerOperations>, unsigned int, llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem>) (this=0x7fb3505c44d0, PCHContainerOps=std::shared_ptr<clang::PCHContainerOperations> (use count 5, weak count 0) = {...}, PrecompilePreambleAfterNParses=<optimized out>, VFS=...) at llvm-project/clang/lib/Frontend/ASTUnit.cpp:1685
#10 0x00007fb37172e359 in clang::ASTUnit::LoadFromCommandLine(char const**, char const**, std::shared_ptr<clang::PCHContainerOperations>, llvm::IntrusiveRefCntPtr<clang::DiagnosticsEngine>, llvm::StringRef, bool, clang::CaptureDiagsKind, llvm::ArrayRef<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, llvm::MemoryBuffer*> >, bool, unsigned int, clang::TranslationUnitKind, bool, bool, bool, clang::SkipFunctionBodiesScope, bool, bool, bool, bool, llvm::Optional<llvm::StringRef>, std::unique_ptr<clang::ASTUnit, std::default_delete<clang::ASTUnit> >*, llvm::IntrusiveRefCntPtr<llvm::vfs::FileSystem>) (ArgBegin=<optimized out>, ArgEnd=<optimized out>, PCHContainerOps=std::shared_ptr<clang::PCHContainerOperations> (empty) = {...}, Diags=..., ResourceFilesPath=..., OnlyLocalDecls=false, CaptureDiagnostics=clang::CaptureDiagsKind::All, RemappedFiles=..., RemappedFilesKeepOriginalName=true, PrecompilePreambleAfterNParses=1, TUKind=clang::TU_Complete, CacheCodeCompletionResults=true, IncludeBriefCommentsInCodeCompletion=false, AllowPCHWithCompilerErrors=true, SkipFunctionBodies=clang::SkipFunctionBodiesScope::None, SingleFileParse=false, UserFilesAreVolatile=true, ForSerialization=false, RetainExcludedConditionalBlocks=false, ModuleFormat=..., ErrAST=0x7fb36effd2d8, VFS=...) at llvm-project/clang/lib/Frontend/ASTUnit.cpp:1822
#11 0x00007fb37104e462 in clang_parseTranslationUnit_Impl(CXIndex, char const*, char const* const*, int, llvm::ArrayRef<CXUnsavedFile>, unsigned int, CXTranslationUnit*) (CIdx=0x55dbe16c9360, source_filename=<optimized out>, command_line_args=<optimized out>, num_command_line_args=<optimized out>, unsaved_files=..., options=781, out_TU=0x7fb35ca7f630) at /usr/include/c++/12.2.0/bits/stl_vector.h:987
#12 0x00007fb37104f2b4 in operator() (__closure=0x7fb37a7b6f40) at llvm-project/clang/tools/libclang/CIndex.cpp:3976
#13 llvm::function_ref<void()>::callback_fn<clang_parseTranslationUnit2FullArgv(CXIndex, char const*, char const* const*, int, CXUnsavedFile*, unsigned int, unsigned int, CXTranslationUnitImpl**)::<lambda()> >(intptr_t) (callable=callable@entry=140408830783296) at llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:45
#14 0x00007fb3729cc6e0 in llvm::function_ref<void ()>::operator()() const (this=<synthetic pointer>) at llvm-project/llvm/include/llvm/ADT/STLFunctionalExtras.h:68
#15 llvm::CrashRecoveryContext::RunSafely(llvm::function_ref<void ()>) (this=<optimized out>, Fn=...) at llvm-project/llvm/lib/Support/CrashRecoveryContext.cpp:433
#16 0x00007fb3729cc724 in RunSafelyOnThread_Dispatch(void*) (UserData=0x7fb37a7b6e80) at llvm-project/llvm/lib/Support/CrashRecoveryContext.cpp:514
#17 0x00007fb3729cc19a in llvm::thread::Apply<void (*)(void*), (anonymous namespace)::RunSafelyOnThreadInfo*, 0> (Callee=std::tuple containing = {...}) at llvm-project/llvm/include/llvm/Support/thread.h:42
#18 llvm::thread::GenericThreadProxy<std::tuple<void (*)(void*), (anonymous namespace)::RunSafelyOnThreadInfo*> > (Ptr=0x55dbe3921080) at llvm-project/llvm/include/llvm/Support/thread.h:50
#19 llvm::thread::ThreadProxy<std::tuple<void (*)(void*), (anonymous namespace)::RunSafelyOnThreadInfo*> >(void*) (Ptr=0x55dbe3921080) at llvm-project/llvm/include/llvm/Support/thread.h:60
#20 0x00007fb3e909f8fd in () at /usr/lib/libc.so.6
#21 0x00007fb3e9121a60 in () at /usr/lib/libc.so.6

Thank you for this trace, that helps identify where I think this functionality should live, which is CIndexer

At a point where we have a CIndexer object, we eventually call ASTUnit::LoadFromCommandLine() and ASTUnit has a member FileSystemOptions FileSystemOpts, and FileSystemOptions has a std::string for the working directory to use. I think we should store the temp directory in here as well, and when ASTUnit::getMainBufferWithPrecompiledPreamble() build the preamble, it can pass that information along to TempPCHFile to avoid needing to ask the system for the temp directory. This does mean we store two copies of the string (one in CIndexer and one in FileSystemOptions, but I think the improved ownership semantics for the C API makes that duplication somewhat reasonable.

Does this idea seem plausible to you?

At a point where we have a CIndexer object, we eventually call ASTUnit::LoadFromCommandLine() and ASTUnit has a member FileSystemOptions FileSystemOpts, and FileSystemOptions has a std::string for the working directory to use. I think we should store the temp directory in here as well, and when ASTUnit::getMainBufferWithPrecompiledPreamble() build the preamble, it can pass that information along to TempPCHFile to avoid needing to ask the system for the temp directory.

The path would have to be passed into several functions. TempPCHFile::create() would have to use createUniqueFile() instead of createTemporaryFile(). My greatest concern with this improved design is that libclang may use the temporary directory for other purposes - if not yet, then in the future. Another issue is with the FileSystemOptions part: this class is widely used in Clang. So adding a member to it could adversely affect performance of unrelated code.

This does mean we store two copies of the string (one in CIndexer and one in FileSystemOptions, but I think the improved ownership semantics for the C API makes that duplication somewhat reasonable.

If LLVM does not have its own shared buffer type, a std::shared_ptr<const char[]> could be used instead of std::string. Or SmallString (not shared, but avoids allocations given a not too long overriding path).

At a point where we have a CIndexer object, we eventually call ASTUnit::LoadFromCommandLine() and ASTUnit has a member FileSystemOptions FileSystemOpts, and FileSystemOptions has a std::string for the working directory to use. I think we should store the temp directory in here as well, and when ASTUnit::getMainBufferWithPrecompiledPreamble() build the preamble, it can pass that information along to TempPCHFile to avoid needing to ask the system for the temp directory.

The path would have to be passed into several functions.

Understood, and hopefully that doesn't make this a viral slog.

TempPCHFile::create() would have to use createUniqueFile() instead of createTemporaryFile(). My greatest concern with this improved design is that libclang may use the temporary directory for other purposes - if not yet, then in the future.

Is your concern essentially that someone will add a new use to put files in a temp directory and not have access to this information from ASTUnit? Or do you have other concerns in mind?

We should certainly investigate whether there are other uses of temp files from libclang as part of these changes. It's possible we'll find a situation that makes my suggestion untenable because we don't have easy access to the information we need.

Another issue is with the FileSystemOptions part: this class is widely used in Clang. So adding a member to it could adversely affect performance of unrelated code.

It's good to keep an eye on that, but I'm not too worried about the overhead from it (I don't see uses in AST nodes). We can keep an eye on https://llvm-compile-time-tracker.com/ to see if there is a surprising compile time increase though.

vedgy added a comment.Jan 11 2023, 8:16 AM

Is your concern essentially that someone will add a new use to put files in a temp directory and not have access to this information from ASTUnit? Or do you have other concerns in mind?

We should certainly investigate whether there are other uses of temp files from libclang as part of these changes. It's possible we'll find a situation that makes my suggestion untenable because we don't have easy access to the information we need.

The temporary directory could be used, now or in future, by some support code, which is used indirectly by libclang. I found the following uses potentially relevant to libclang:

  • Driver::GetTemporaryDirectory(StringRef Prefix) calls llvm::sys::fs::createUniqueDirectory(Prefix, Path);
  • StandardInstrumentations => DotCfgChangeReporter calls sys::fs::createUniquePath("cfgdot-%%%%%%.dot", SV, true);
  • There are plenty of createTemporaryFile() uses in llvm and clang. Some of them are likely used by libclang.

I don't know how to efficiently check whether or not each of these indirect system_temp_directory() uses is in turn used by libclang. Even if they aren't know, they might be later, when libclang API grows.

On a different note, do you think overriding temporary directory path is useful only to libclang and not likely to be useful to any other LLVM API users?

Another issue is with the FileSystemOptions part: this class is widely used in Clang. So adding a member to it could adversely affect performance of unrelated code.

It's good to keep an eye on that, but I'm not too worried about the overhead from it (I don't see uses in AST nodes). We can keep an eye on https://llvm-compile-time-tracker.com/ to see if there is a surprising compile time increase though.

[in case we pursue this design approach, which I currently don't feel is right] Why not add a temporary path data member into class ASTUnit under the FileSystemOptions FileSystemOpts member, not inside it? So that other code is not burdened with unused struct member, and to be able to use a more suitable type, like SmallString for the temporary path buffer.

Is your concern essentially that someone will add a new use to put files in a temp directory and not have access to this information from ASTUnit? Or do you have other concerns in mind?

We should certainly investigate whether there are other uses of temp files from libclang as part of these changes. It's possible we'll find a situation that makes my suggestion untenable because we don't have easy access to the information we need.

The temporary directory could be used, now or in future, by some support code, which is used indirectly by libclang. I found the following uses potentially relevant to libclang:

  • Driver::GetTemporaryDirectory(StringRef Prefix) calls llvm::sys::fs::createUniqueDirectory(Prefix, Path);
  • StandardInstrumentations => DotCfgChangeReporter calls sys::fs::createUniquePath("cfgdot-%%%%%%.dot", SV, true);
  • There are plenty of createTemporaryFile() uses in llvm and clang. Some of them are likely used by libclang.

I don't know how to efficiently check whether or not each of these indirect system_temp_directory() uses is in turn used by libclang. Even if they aren't know, they might be later, when libclang API grows.

Oof, that is more exposure than I was thinking we had...

On a different note, do you think overriding temporary directory path is useful only to libclang and not likely to be useful to any other LLVM API users?

I don't think there are any in-tree needs for that functionality, and the solution is fragile enough that I'd like to avoid it if possible (for example, it also makes the API much harder to use across threads because now it's accessing global state). That said, I think the libclang use case you have is compelling and worth solving in-tree if we can (so others don't have to find similar workarounds).

Another issue is with the FileSystemOptions part: this class is widely used in Clang. So adding a member to it could adversely affect performance of unrelated code.

It's good to keep an eye on that, but I'm not too worried about the overhead from it (I don't see uses in AST nodes). We can keep an eye on https://llvm-compile-time-tracker.com/ to see if there is a surprising compile time increase though.

[in case we pursue this design approach, which I currently don't feel is right] Why not add a temporary path data member into class ASTUnit under the FileSystemOptions FileSystemOpts member, not inside it? So that other code is not burdened with unused struct member, and to be able to use a more suitable type, like SmallString for the temporary path buffer.

That's certainly an option, but the design of that would be a bit strange in that we have some file system options in one place and other file system options in another place. I think ultimately, we're going to want them all to live on FileSystemOptions. That said, I'm going to rope in some more folks for a wider selection of opinions on how/if to proceed. CC @dblaikie @MaskRay @d0k

Is your concern essentially that someone will add a new use to put files in a temp directory and not have access to this information from ASTUnit? Or do you have other concerns in mind?

We should certainly investigate whether there are other uses of temp files from libclang as part of these changes. It's possible we'll find a situation that makes my suggestion untenable because we don't have easy access to the information we need.

The temporary directory could be used, now or in future, by some support code, which is used indirectly by libclang. I found the following uses potentially relevant to libclang:

  • Driver::GetTemporaryDirectory(StringRef Prefix) calls llvm::sys::fs::createUniqueDirectory(Prefix, Path);
  • StandardInstrumentations => DotCfgChangeReporter calls sys::fs::createUniquePath("cfgdot-%%%%%%.dot", SV, true);
  • There are plenty of createTemporaryFile() uses in llvm and clang. Some of them are likely used by libclang.

I don't know how to efficiently check whether or not each of these indirect system_temp_directory() uses is in turn used by libclang. Even if they aren't know, they might be later, when libclang API grows.

Oof, that is more exposure than I was thinking we had...

On a different note, do you think overriding temporary directory path is useful only to libclang and not likely to be useful to any other LLVM API users?

I don't think there are any in-tree needs for that functionality, and the solution is fragile enough that I'd like to avoid it if possible (for example, it also makes the API much harder to use across threads because now it's accessing global state). That said, I think the libclang use case you have is compelling and worth solving in-tree if we can (so others don't have to find similar workarounds).

Another issue is with the FileSystemOptions part: this class is widely used in Clang. So adding a member to it could adversely affect performance of unrelated code.

It's good to keep an eye on that, but I'm not too worried about the overhead from it (I don't see uses in AST nodes). We can keep an eye on https://llvm-compile-time-tracker.com/ to see if there is a surprising compile time increase though.

[in case we pursue this design approach, which I currently don't feel is right] Why not add a temporary path data member into class ASTUnit under the FileSystemOptions FileSystemOpts member, not inside it? So that other code is not burdened with unused struct member, and to be able to use a more suitable type, like SmallString for the temporary path buffer.

That's certainly an option, but the design of that would be a bit strange in that we have some file system options in one place and other file system options in another place. I think ultimately, we're going to want them all to live on FileSystemOptions. That said, I'm going to rope in some more folks for a wider selection of opinions on how/if to proceed. CC @dblaikie @MaskRay @d0k

Yeah, I think, at a quick glance, I'd lean towards FilesystemOptions too - it's mostly used as a member of FileManager, which already has a lot of members/fairly large footprint, so I wouldn't worry too much about this having a huge impact on the total memory usage/perf/etc.

I'm not sure how I feel about the general premise (though I don't have enough context/ownership to want to veto any direction/progress here, just thoughts in case they resonate):

  1. Should clang be doing something better with these temp files anyway, no matter the directory they go in? (setting them for cleanup at process exit or the like - I think we have such a filesystem abstraction, maybe that only works on Windows? I'm not sure)
  2. If there isn't a better general way to avoid creating temp trash that's a problem, I'd have thought that KDevelop/other tools would have issues with other temp files too, and want a more general solution (I'd have thought changing the process's temp directory, and changing it back for user process launches, would be sufficient - so it can cleanup after itself for all files, not just these particular clang-created ones)
MaskRay added a comment.EditedJan 11 2023, 3:45 PM

libclang functions like clang_reparseTranslationUnit_Impl call clang/lib/Frontend/ASTUnit.cpp:1397 and build the preamble with /*StoreInMemory=*/false.
If StoreInMemory is configurable (true), then you can avoid the temporary file preamble-*.pch.

clang/lib/Frontend/ASTUnit.cpp is used by clang/lib/Tooling/Tooling.cpp and libclang. Personally I think an unconditional /*StoreInMemory=*/true is fine. (The concern is slightly memory usage increase).

vedgy added a subscriber: milianw.Jan 12 2023, 12:51 AM
  1. Should clang be doing something better with these temp files anyway, no matter the directory they go in? (setting them for cleanup at process exit or the like - I think we have such a filesystem abstraction, maybe that only works on Windows? I'm not sure)

That'd be great, but appears unfeasible on GNU/Linux in case of a crash: https://stackoverflow.com/questions/471344/guaranteed-file-deletion-upon-program-termination-c-c

  1. If there isn't a better general way to avoid creating temp trash that's a problem, I'd have thought that KDevelop/other tools would have issues with other temp files too, and want a more general solution (I'd have thought changing the process's temp directory, and changing it back for user process launches, would be sufficient - so it can cleanup after itself for all files, not just these particular clang-created ones)

I do plan to put other temporary files KDevelop generates in the same session-specific temporary directory and clean it on start. But modifying KDevelop's temporary directory environment variable has been rejected during code review for good reason. As the bug this review aims to fix puts it:

setting the environment variables is problematic, because they are inherited by the IDE's code and all child processes it spawns (compiler, build system and user-provided executables). The IDE must then remove the temporary directory environment variable from each child process where it can cause undesirable behavior.

libclang functions like clang_reparseTranslationUnit_Impl call clang/lib/Frontend/ASTUnit.cpp:1397 and build the preamble with /*StoreInMemory=*/false.
If StoreInMemory is configurable (true), then you can avoid the temporary file preamble-*.pch.

clang/lib/Frontend/ASTUnit.cpp is used by clang/lib/Tooling/Tooling.cpp and libclang. Personally I think an unconditional /*StoreInMemory=*/true is fine. (The concern is slightly memory usage increase).

This is a simple and promising solution. But maybe it should be configurable. The StoreInMemory option was introduced for the benefit of clangd, and clangd still stores the preambles on disk by default. See D39843. I'm pinging another KDevelop developer who may know better how this should work in KDevelop: @milianw

  1. Should clang be doing something better with these temp files anyway, no matter the directory they go in? (setting them for cleanup at process exit or the like - I think we have such a filesystem abstraction, maybe that only works on Windows? I'm not sure)

That'd be great, but appears unfeasible on GNU/Linux in case of a crash: https://stackoverflow.com/questions/471344/guaranteed-file-deletion-upon-program-termination-c-c

  1. If there isn't a better general way to avoid creating temp trash that's a problem, I'd have thought that KDevelop/other tools would have issues with other temp files too, and want a more general solution (I'd have thought changing the process's temp directory, and changing it back for user process launches, would be sufficient - so it can cleanup after itself for all files, not just these particular clang-created ones)

I do plan to put other temporary files KDevelop generates in the same session-specific temporary directory and clean it on start. But modifying KDevelop's temporary directory environment variable has been rejected during code review for good reason. As the bug this review aims to fix puts it:

setting the environment variables is problematic, because they are inherited by the IDE's code and all child processes it spawns (compiler, build system and user-provided executables). The IDE must then remove the temporary directory environment variable from each child process where it can cause undesirable behavior.

Yeah, sorry, I didn't entirely follow/understand the discussion on the linked bug - part of the reason I brought it up here.

It seemed like the places where kdevelop would want to suppress the temp dir cleanup would be enumerable/more within kdevelop's control than instances of libraries kdevelop is using wanting to create their own unexposed temp files. But, yeah, would be a bunch of work to go and touch all those places. Ah well *shrug*

vedgy added a comment.Jan 12 2023, 8:52 PM

It seemed like the places where kdevelop would want to suppress the temp dir cleanup would be enumerable/more within kdevelop's control than instances of libraries kdevelop is using wanting to create their own unexposed temp files. But, yeah, would be a bunch of work to go and touch all those places. Ah well *shrug*

That may be true. But a few small-size generated files slipping into the system temporary directory is not much of a problem. /tmp is cleared on shutdown after all. Causing bugs in a user executable run from KDevelop is much worse.

After a discussion under the corresponding KDevelop merge request, I can see 4-6 alternative ways to address the temporary directory issue:

  1. Add an option to store the preamble-*.pch files in RAM instead of /tmp and add a corresponding option in KDevelop configuration UI. This would work perfectly for me, provided I don't change my mind and decide to turn this option off, in which case I'll be back to square one.
  2. Add an option to store the preamble-*.pch files in a custom directory instead of a general temporary directory. The option could be named generally (2a: overrideTemporaryDirectory) or specially (2b: setPreambleStoragePath). If the option is named generally, other temporary files created by libclang could be identified in the future and placed in the same directory without changing the API.
  3. 1 and 2 - the options can be propagated between identical end points together, so this combination is natural.
  4. The current patch. This is the easiest (already implemented) and most reliable (the temporary directory is definitely and completely overridden) approach. But there are thread safety concerns due to the introduction of global state.

If the 4th option is unacceptable, I lean towards the option 3a (1 and 2a), because the amount of implementation work should not be much greater than 1 alone or 2a alone. If the 4th option is unacceptable, I suppose a separate review request based on the current LLVM main branch should be created and this one closed, right?

After a discussion under the corresponding KDevelop merge request, I can see 4-6 alternative ways to address the temporary directory issue:

  1. Add an option to store the preamble-*.pch files in RAM instead of /tmp and add a corresponding option in KDevelop configuration UI. This would work perfectly for me, provided I don't change my mind and decide to turn this option off, in which case I'll be back to square one.
  2. Add an option to store the preamble-*.pch files in a custom directory instead of a general temporary directory. The option could be named generally (2a: overrideTemporaryDirectory) or specially (2b: setPreambleStoragePath). If the option is named generally, other temporary files created by libclang could be identified in the future and placed in the same directory without changing the API.
  3. 1 and 2 - the options can be propagated between identical end points together, so this combination is natural.
  4. The current patch. This is the easiest (already implemented) and most reliable (the temporary directory is definitely and completely overridden) approach. But there are thread safety concerns due to the introduction of global state.

If the 4th option is unacceptable, I lean towards the option 3a (1 and 2a), because the amount of implementation work should not be much greater than 1 alone or 2a alone. If the 4th option is unacceptable, I suppose a separate review request based on the current LLVM main branch should be created and this one closed, right?

From a design perspective, my preference is to not add new APIs at the file system support layer in LLVM to override the temporary directory but instead allow individual components to override the decision where to put files. Overriding a system directory at the LLVM support level feels unclean to me because 1) threading concerns (mostly related to lock performance), 2) consistency concerns (put files in temp directory, override temp directory, put additional files in the new directory), 3) principle of least surprise. To me, the LLVM layer is responsible for interfacing with the system and asking for the system temp directory should get you the same answer as querying for that from the OS directly.

So my preference is for components to have an option to pick a different location as part of their API layer, rather than at the lower level. Based on that, I'm definitely in support of #1, and I'm in support of one of the options for #2. I lean towards #2b over #2a due to wanting individual overrides rather than a blanket override (e.g., we should be able to put preamble files in a different location than we put, say, crash logs).

@dblaikie does that sound reasonable to you?

vedgy added a comment.EditedJan 19 2023, 8:49 AM

I lean towards #2b over #2a due to wanting individual overrides rather than a blanket override (e.g., we should be able to put preamble files in a different location than we put, say, crash logs).

Crash logs don't really belong to a common temporary directory. Are they actually placed into a temporary directory erased on reboot? I'd prefer coarser categories, like temp dir erased-on-reboot and not-erased-on-reboot, similar to void system_temp_directory(bool erasedOnReboot in Path.h. This definitely makes more sense to the use case of removing these temporary files on next application start. Why would a libclang-using program care about various purposes of its internally-used temporary files?

A practical disadvantage of individual overrides: each new override requires implementing new API; every libclang user interested in overriding the temp dir has to adapt and use the new API.

After a discussion under the corresponding KDevelop merge request, I can see 4-6 alternative ways to address the temporary directory issue:

  1. Add an option to store the preamble-*.pch files in RAM instead of /tmp and add a corresponding option in KDevelop configuration UI. This would work perfectly for me, provided I don't change my mind and decide to turn this option off, in which case I'll be back to square one.
  2. Add an option to store the preamble-*.pch files in a custom directory instead of a general temporary directory. The option could be named generally (2a: overrideTemporaryDirectory) or specially (2b: setPreambleStoragePath). If the option is named generally, other temporary files created by libclang could be identified in the future and placed in the same directory without changing the API.
  3. 1 and 2 - the options can be propagated between identical end points together, so this combination is natural.
  4. The current patch. This is the easiest (already implemented) and most reliable (the temporary directory is definitely and completely overridden) approach. But there are thread safety concerns due to the introduction of global state.

If the 4th option is unacceptable, I lean towards the option 3a (1 and 2a), because the amount of implementation work should not be much greater than 1 alone or 2a alone. If the 4th option is unacceptable, I suppose a separate review request based on the current LLVM main branch should be created and this one closed, right?

From a design perspective, my preference is to not add new APIs at the file system support layer in LLVM to override the temporary directory but instead allow individual components to override the decision where to put files. Overriding a system directory at the LLVM support level feels unclean to me because 1) threading concerns (mostly related to lock performance), 2) consistency concerns (put files in temp directory, override temp directory, put additional files in the new directory), 3) principle of least surprise. To me, the LLVM layer is responsible for interfacing with the system and asking for the system temp directory should get you the same answer as querying for that from the OS directly.

I've mixed feelings about this, but yeah, I guess the issues with global state are pretty undesirable. I don't worry too much about (2) - doesn't feel too problematic to me (for any individual file, presumably the implementation recorded the name of the file if they were going to access it again - so still be using the old path if necessary).

But, yeah, lack of any "system" abstraction that could be passed into all the various LLVM/MC/AST contexts to swap out the implementation limits the options a bit to more ad-hoc/case-by-case solutions. I feel like that's not a great solution to KDevelop's general problem, though - they're dealing with one particularly big file, but it doesn't feel very principled to me to fix that one compared to all the other (albeit smaller) temp files & maybe at some point in the future some bigger temp files are created elsewhere...

So my preference is for components to have an option to pick a different location as part of their API layer, rather than at the lower level. Based on that, I'm definitely in support of #1, and I'm in support of one of the options for #2. I lean towards #2b over #2a due to wanting individual overrides rather than a blanket override (e.g., we should be able to put preamble files in a different location than we put, say, crash logs).

@dblaikie does that sound reasonable to you?

(1) seems OK-ish, I guess there's some question as to what the tradeoff is for that option - does that blow out memory usage of the client/kdevelop? But I guess it's probably fine.

vedgy added a comment.EditedJan 19 2023, 10:58 AM

(1) seems OK-ish, I guess there's some question as to what the tradeoff is for that option - does that blow out memory usage of the client/kdevelop? But I guess it's probably fine.

At least one KDevelop user who has limited RAM and keeps /tmp on disk (not tmpfs) protested vehemently against unconditional storing of the temporary files in RAM. Several gigabytes of RAM can be valuable on some systems. Storing huge files in a temporary directory is definitely more flexible. But to users who keep /tmp on tmpfs, storing the preamble files in RAM is preferable, because memory is always freed right after a KDevelop crash, not on next start of the same KDevelop session (which can occur much later).

That is why I plan to implement the option in libclang and in KDevelop UI. I also thought about auto-detecting whether /tmp resides in tmpfs, but that's an application's, not libclang's concern. Making this behavior optional in libclang would be consistent with clangd.

I've mixed feelings about this, but yeah, I guess the issues with global state are pretty undesirable. I don't worry too much about (2) - doesn't feel too problematic to me (for any individual file, presumably the implementation recorded the name of the file if they were going to access it again - so still be using the old path if necessary).

FWIW, #2 relates to #3 for me -- it's not that I worry the compiler will guess the wrong path, it's more that I worry users will get confused because some temp files go in one place while other temp files go in another place, and they file issues we track down to timing issues.

But, yeah, lack of any "system" abstraction that could be passed into all the various LLVM/MC/AST contexts to swap out the implementation limits the options a bit to more ad-hoc/case-by-case solutions. I feel like that's not a great solution to KDevelop's general problem, though - they're dealing with one particularly big file, but it doesn't feel very principled to me to fix that one compared to all the other (albeit smaller) temp files & maybe at some point in the future some bigger temp files are created elsewhere...

So my preference is for components to have an option to pick a different location as part of their API layer, rather than at the lower level. Based on that, I'm definitely in support of #1, and I'm in support of one of the options for #2. I lean towards #2b over #2a due to wanting individual overrides rather than a blanket override (e.g., we should be able to put preamble files in a different location than we put, say, crash logs).

@dblaikie does that sound reasonable to you?

(1) seems OK-ish, I guess there's some question as to what the tradeoff is for that option - does that blow out memory usage of the client/kdevelop? But I guess it's probably fine.

Do you think we should do one of the options for (2) or do you think (1) should be sufficient?

I've mixed feelings about this, but yeah, I guess the issues with global state are pretty undesirable. I don't worry too much about (2) - doesn't feel too problematic to me (for any individual file, presumably the implementation recorded the name of the file if they were going to access it again - so still be using the old path if necessary).

FWIW, #2 relates to #3 for me -- it's not that I worry the compiler will guess the wrong path, it's more that I worry users will get confused because some temp files go in one place while other temp files go in another place, and they file issues we track down to timing issues.

I'd be willing to roll the dice on that - my guess is it wouldn't be too bad - but just guessing. (as all good/difficult design discussions are - debating predictions of the future based on our respective experience)

But, yeah, lack of any "system" abstraction that could be passed into all the various LLVM/MC/AST contexts to swap out the implementation limits the options a bit to more ad-hoc/case-by-case solutions. I feel like that's not a great solution to KDevelop's general problem, though - they're dealing with one particularly big file, but it doesn't feel very principled to me to fix that one compared to all the other (albeit smaller) temp files & maybe at some point in the future some bigger temp files are created elsewhere...

So my preference is for components to have an option to pick a different location as part of their API layer, rather than at the lower level. Based on that, I'm definitely in support of #1, and I'm in support of one of the options for #2. I lean towards #2b over #2a due to wanting individual overrides rather than a blanket override (e.g., we should be able to put preamble files in a different location than we put, say, crash logs).

@dblaikie does that sound reasonable to you?

(1) seems OK-ish, I guess there's some question as to what the tradeoff is for that option - does that blow out memory usage of the client/kdevelop? But I guess it's probably fine.

Do you think we should do one of the options for (2) or do you think (1) should be sufficient?

If (1) is sufficient for KDevelop's needs and already implemented in some form for clangd, if I understand correctly, that seems the cheapest/least involved?

(1) seems OK-ish, I guess there's some question as to what the tradeoff is for that option - does that blow out memory usage of the client/kdevelop? But I guess it's probably fine.

Do you think we should do one of the options for (2) or do you think (1) should be sufficient?

If (1) is sufficient for KDevelop's needs and already implemented in some form for clangd, if I understand correctly, that seems the cheapest/least involved?

Not sufficient for all KDevelop users. Namely it doesn't work for low-RAM systems where /tmp is on disk to save RAM.

The bool (1) and the path (2) options can be passed through API layers together in a struct. They can both be named generally (preferStoringTempFilesInMemory, setTemporaryDirectory) or specifically (storePreamblesInMemory, setPreambleStoragePath).

Don't let me hold this up - I think it all feels a bit too ad-hoc for my own preferences (feels like there should be fairly general solutions to this - rather than playing whack-a-mole on only the biggest temporary files - both in terms of the options KDevelop developers have considered/accepted/declined, and in terms of what's being proposed to be added to Clang), but I don't feel strongly enough about any of that to veto/push very hard here.

vedgy added a comment.Jan 20 2023, 1:52 AM

One idea discussed in comments to the KDevelop merge request, which I haven't mentioned here is this: remove the preamble files immediately after creating and opening them. This is safe on Unix-like OSes, because every file that is still open will not actually be deleted but its directory entry cleared. It's an old Unix trick to mark (open) files as volatile. You could do this with any opened file (that you don't want to be able to close and then reopen) immediately after creating it in which case it'll get cleaned up even in case of a hard crash. However, my analysis of clang/lib/Frontend/PrecompiledPreamble.cpp shows that such unlinking isn't straightforward for the preamble files, e.g. because of PrecompiledPreamble::getSize(), which checks the file size at the preamble file path.

Don't let me hold this up - I think it all feels a bit too ad-hoc for my own preferences (feels like there should be fairly general solutions to this - rather than playing whack-a-mole on only the biggest temporary files - both in terms of the options KDevelop developers have considered/accepted/declined, and in terms of what's being proposed to be added to Clang), but I don't feel strongly enough about any of that to veto/push very hard here.

I think the general solution would be what's proposed here -- allow overriding the system temp directory at the LLVM filesystem API level. But that continues to feel wrong to me -- asking for the system temp directory should get you the system temp directory. For example, nothing says that libclang's need for the temp directory is the same as someone else using the LLVM filesystem APIs within the same library but for totally unrelated purposes. It seems like the wrong layer at which to add this logic -- the base API should get you the system directory, and the caller should be responsible for deciding if they wanted something different than that or not. I realize that turns into a bit of a game of whack-a-mole to find all of the places where we make such decisions, but that seems like the more correct approach to me. If we want this to be controlled by the programmer (and it sounds like KDevelop has very good reasons to want this to be controlled), I think it should become a part of the API interface rather than a hidden one-time setter that impacts the entire process.

erichkeane added inline comments.
llvm/lib/Support/Unix/Path.inc
1452

So I was asked to take a look at this, and I believe that changing this function is absolutely the wrong approach. The purpose of this API is to be used to get what the system/terminal owner has set as the temporary directory. Allowing a user of the API to change the meaning of this is improper.

If we want to change where certain files are stored, the logic for that needs to happen at a higher level than this.

Don't let me hold this up - I think it all feels a bit too ad-hoc for my own preferences (feels like there should be fairly general solutions to this - rather than playing whack-a-mole on only the biggest temporary files - both in terms of the options KDevelop developers have considered/accepted/declined, and in terms of what's being proposed to be added to Clang), but I don't feel strongly enough about any of that to veto/push very hard here.

I think the general solution would be what's proposed here -- allow overriding the system temp directory at the LLVM filesystem API level. But that continues to feel wrong to me -- asking for the system temp directory should get you the system temp directory. For example, nothing says that libclang's need for the temp directory is the same as someone else using the LLVM filesystem APIs within the same library but for totally unrelated purposes. It seems like the wrong layer at which to add this logic -- the base API should get you the system directory, and the caller should be responsible for deciding if they wanted something different than that or not. I realize that turns into a bit of a game of whack-a-mole to find all of the places where we make such decisions, but that seems like the more correct approach to me. If we want this to be controlled by the programmer (and it sounds like KDevelop has very good reasons to want this to be controlled), I think it should become a part of the API interface rather than a hidden one-time setter that impacts the entire process.

I think the only problem I have with changing the lower layer is the race condition/making decisions for totally unrelated clients. If there was some "LLVMSystemContext" that was passed into every LLVM API & one client could create/configure that with one temp directory and use it for LLVM, Clang, etc, and another client in the same process could use another, that'd be fine. I guess that would allow the independent configuration, but also the singular configuration when that's what the client desires. But that's impractical/would be too much engineering for this issue.

*shrug* Ah well.

vedgy added a comment.Jan 25 2023, 1:03 AM

Now that a consensus has been reached that this revision is to be discarded, can we please resume the discussion of which alternative should be implemented in the replacement revision?

Now that a consensus has been reached that this revision is to be discarded, can we please resume the discussion of which alternative should be implemented in the replacement revision?

Based on the discussion here, I think the preference is to modify FileSystemOptions to store an override for the temp directory, and updating APIs so that the temp directory can be specified by a new CIndex constructor function. (If anyone disagrees with that as the approach we'd like to take, please speak up!)

vedgy added a comment.EditedJan 25 2023, 10:51 PM

3 out of 4 alternatives remain:

  1. Add an option to store the preamble-*.pch files in RAM instead of /tmp and add a corresponding option in KDevelop configuration UI. This would work perfectly for me, provided I don't change my mind and decide to turn this option off, in which case I'll be back to square one.
  2. Add an option to store the preamble-*.pch files in a custom directory instead of a general temporary directory. The option could be named generally (2a: overrideTemporaryDirectory) or specially (2b: setPreambleStoragePath). If the option is named generally, other temporary files created by libclang could be identified in the future and placed in the same directory without changing the API.
  3. 1 and 2 - the options can be propagated between identical end points together, so this combination is natural.

I think #3 is the way to go. @aaron.ballman has agreed to this.

#3a vs #3b hasn't been decided upon yet:

The bool (1) and the path (2) options can be passed through API layers together in a struct. They can both be named generally (preferStoringTempFilesInMemory, setTemporaryDirectory) or specifically (storePreamblesInMemory, setPreambleStoragePath).

@aaron.ballman has put forth arguments in favor of specific names and separate API for different temporary directory uses. I have replied with counterarguments in favor of general temporary storage API.

modify FileSystemOptions to store an override for the temp directory

I have mentioned that FileSystemOptions is widely used and only class ASTUnit would need the temp directory (and possibly a flag whether to prefer RAM storage). The wide usage in itself is not considered a reason not to add data members.

ASTUnit::FileSystemOpts is used in several places in ASTUnit.cpp:

FileSystemOpts = Clang->getFileSystemOpts();
AST->FileSystemOpts = CI->getFileSystemOpts();
AST->FileMgr = new FileManager(AST->FileSystemOpts, VFS);
AST->FileSystemOpts = FileMgr->getFileSystemOpts();

I don't know whether propagating the new options to and from CompilerInstance, CompilerInvocation and FileManager is a good idea. As of now, only ASTUnit::getMainBufferWithPrecompiledPreamble() is going to use the new options.

and updating APIs so that the temp directory can be specified by a new CIndex constructor function.

Now that the temporary directory options are going to be propagated explicitly, I am no longer sure another constructor function is preferable to separate option setter(s). I don't see any use for temporary directory location in the definition of clang_createIndex(). And modifying the temporary directory location multiple times during program execution is no longer a problem: an updated location will be used for new preambles. The store-in-memory bool option in particular should be possible to modify at any time, because it can be an option in an IDE's UI.

3 out of 4 alternatives remain:

  1. Add an option to store the preamble-*.pch files in RAM instead of /tmp and add a corresponding option in KDevelop configuration UI. This would work perfectly for me, provided I don't change my mind and decide to turn this option off, in which case I'll be back to square one.
  2. Add an option to store the preamble-*.pch files in a custom directory instead of a general temporary directory. The option could be named generally (2a: overrideTemporaryDirectory) or specially (2b: setPreambleStoragePath). If the option is named generally, other temporary files created by libclang could be identified in the future and placed in the same directory without changing the API.
  3. 1 and 2 - the options can be propagated between identical end points together, so this combination is natural.

I think #3 is the way to go. @aaron.ballman has agreed to this.

I can live with it.

#3a vs #3b hasn't been decided upon yet:

The bool (1) and the path (2) options can be passed through API layers together in a struct. They can both be named generally (preferStoringTempFilesInMemory, setTemporaryDirectory) or specifically (storePreamblesInMemory, setPreambleStoragePath).

@aaron.ballman has put forth arguments in favor of specific names and separate API for different temporary directory uses. I have replied with counterarguments in favor of general temporary storage API.

My preference is still for specific API names. If we want to do something general to all temp files, I think FileSystemOptions is the way to go.

modify FileSystemOptions to store an override for the temp directory

I have mentioned that FileSystemOptions is widely used and only class ASTUnit would need the temp directory (and possibly a flag whether to prefer RAM storage). The wide usage in itself is not considered a reason not to add data members.

ASTUnit::FileSystemOpts is used in several places in ASTUnit.cpp:

FileSystemOpts = Clang->getFileSystemOpts();
AST->FileSystemOpts = CI->getFileSystemOpts();
AST->FileMgr = new FileManager(AST->FileSystemOpts, VFS);
AST->FileSystemOpts = FileMgr->getFileSystemOpts();

I don't know whether propagating the new options to and from CompilerInstance, CompilerInvocation and FileManager is a good idea. As of now, only ASTUnit::getMainBufferWithPrecompiledPreamble() is going to use the new options.

and updating APIs so that the temp directory can be specified by a new CIndex constructor function.

Now that the temporary directory options are going to be propagated explicitly, I am no longer sure another constructor function is preferable to separate option setter(s). I don't see any use for temporary directory location in the definition of clang_createIndex(). And modifying the temporary directory location multiple times during program execution is no longer a problem: an updated location will be used for new preambles. The store-in-memory bool option in particular should be possible to modify at any time, because it can be an option in an IDE's UI.

My concern with not using a constructor is that, because this is the C API, we will be supporting the functionality for a long time even if we do switch to using a global override in FileSystemOptions where you would need to set the information up before executing the compiler. To me, these paths are something that should not be possible to change once the compiler has started executing. (That also solves the multithreading problem where multiple threads are fighting over what the temp directory should be and stepping on each other's toes.)

vedgy added a comment.Jan 30 2023, 9:59 PM

My preference is still for specific API names. If we want to do something general to all temp files, I think FileSystemOptions is the way to go.

My concern with not using a constructor is that, because this is the C API, we will be supporting the functionality for a long time even if we do switch to using a global override in FileSystemOptions where you would need to set the information up before executing the compiler. To me, these paths are something that should not be possible to change once the compiler has started executing. (That also solves the multithreading problem where multiple threads are fighting over what the temp directory should be and stepping on each other's toes.)

As I understand, this leaves two choices:

  1. Specific preamble API names, two separate non-constructor setters; the option values are stored in a separate struct (or even as two separate data members), not inside FileSystemOptions.
  2. General temporary-storage arguments (possibly combined in a struct) to a new overloaded constructor function; the option values are stored inside the FileSystemOptions struct.

The second alternative is likely more difficult to implement, more risky and less convenient to use (the store-in-memory bool option cannot be modified at any time). Perhaps it should be delayed until (and if) we learn of other temporary files libclang creates? A downside of implementing the first option now is that the specific API would have to be supported for a long time, even after the general temporary-file API is implemented.

My preference is still for specific API names. If we want to do something general to all temp files, I think FileSystemOptions is the way to go.

My concern with not using a constructor is that, because this is the C API, we will be supporting the functionality for a long time even if we do switch to using a global override in FileSystemOptions where you would need to set the information up before executing the compiler. To me, these paths are something that should not be possible to change once the compiler has started executing. (That also solves the multithreading problem where multiple threads are fighting over what the temp directory should be and stepping on each other's toes.)

As I understand, this leaves two choices:

  1. Specific preamble API names, two separate non-constructor setters; the option values are stored in a separate struct (or even as two separate data members), not inside FileSystemOptions.
  2. General temporary-storage arguments (possibly combined in a struct) to a new overloaded constructor function; the option values are stored inside the FileSystemOptions struct.

The second alternative is likely more difficult to implement, more risky and less convenient to use (the store-in-memory bool option cannot be modified at any time). Perhaps it should be delayed until (and if) we learn of other temporary files libclang creates? A downside of implementing the first option now is that the specific API would have to be supported for a long time, even after the general temporary-file API is implemented.

I still think the general solution is where we ultimately want to get to and that makes me hesitant to go with specific preamble API names despite that being the direction you prefer. If this wasn't the C APIs, I'd be less concerned because we make no stability guarantees about our C++ interfaces. But we try really hard not to break the C APIs, so adding the specific names is basically committing to not only the APIs but their semantics. I think that makes implementing the general solution slightly more awkward because these are weird special cases that barely get tested in-tree, so they'd be very easy to overlook and accidentally break.

Is there a middle ground where, instead of #2 for general temporary storage, we went with #2 but with compiler-specific directories instead of system directories. e.g., don't let the caller set the temp directory, but do let the caller set the preamble directory (which defaults to the temp directory) so long as it's set before invoking the compiler? This still won't let you change options mid-run, but it also seems like it should have less risk of affecting other components while still solving the thread safety issues. I'm not certain if it's any easier to implement, but I think it does save you from modifying FileSystemOptions. As a separate item, we could then consider adding a new C API to let you toggle the in-memory vs on-disk functionality after exploring that it won't cause other problems because nobody considered the possibility that it's not a stable value for the compiler invocation.

vedgy added a comment.Jan 31 2023, 8:43 PM

Is there a middle ground where, instead of #2 for general temporary storage, we went with #2 but with compiler-specific directories instead of system directories. e.g., don't let the caller set the temp directory, but do let the caller set the preamble directory (which defaults to the temp directory) so long as it's set before invoking the compiler? This still won't let you change options mid-run, but it also seems like it should have less risk of affecting other components while still solving the thread safety issues. I'm not certain if it's any easier to implement, but I think it does save you from modifying FileSystemOptions. As a separate item, we could then consider adding a new C API to let you toggle the in-memory vs on-disk functionality after exploring that it won't cause other problems because nobody considered the possibility that it's not a stable value for the compiler invocation.

OK, so I'm going to implement overriding the preamble directory in clang_createIndexWithPreambleStoragePath (or clang_createIndex2 or clang_createIndexExt or?); try to keep it simple and not modify FileSystemOptions; deal with the in-memory option separately later. Abandon this revision now and create another one once ready?

Is there a middle ground where, instead of #2 for general temporary storage, we went with #2 but with compiler-specific directories instead of system directories. e.g., don't let the caller set the temp directory, but do let the caller set the preamble directory (which defaults to the temp directory) so long as it's set before invoking the compiler? This still won't let you change options mid-run, but it also seems like it should have less risk of affecting other components while still solving the thread safety issues. I'm not certain if it's any easier to implement, but I think it does save you from modifying FileSystemOptions. As a separate item, we could then consider adding a new C API to let you toggle the in-memory vs on-disk functionality after exploring that it won't cause other problems because nobody considered the possibility that it's not a stable value for the compiler invocation.

OK, so I'm going to implement overriding the preamble directory in clang_createIndexWithPreambleStoragePath (or clang_createIndex2 or clang_createIndexExt or?); try to keep it simple and not modify FileSystemOptions; deal with the in-memory option separately later. Abandon this revision now and create another one once ready?

That sounds like a good plan to me. I wonder if we want to name it something like clang_createIndexWithOptions (or something generic like that), and give it a versioned structure of options along the lines of:

struct CIndexOptions {
  uint32_t Size; // sizeof(struct CIndexOptions), used for option versioning
  const char *PreambleStoragePath; // This pointer can be freed after creating the index
};

and define the function to return an error if Size < sizeof(struct CIndexOptions). This should allow us to add additional options to the structure without having to introduce a new constructor API each time.

vedgy added a comment.Feb 1 2023, 7:25 AM

That sounds like a good plan to me. I wonder if we want to name it something like clang_createIndexWithOptions (or something generic like that), and give it a versioned structure of options along the lines of:

struct CIndexOptions {
  uint32_t Size; // sizeof(struct CIndexOptions), used for option versioning
  const char *PreambleStoragePath; // This pointer can be freed after creating the index
};

and define the function to return an error if Size < sizeof(struct CIndexOptions). This should allow us to add additional options to the structure without having to introduce a new constructor API each time.

Would this be the recommended usage of such an API?

  1. Call clang_createIndexWithOptions.
  2. If it returns nullptr index, report an error in the application (e.g. print a warning or show an error in the UI) and fall back to code paths that support older Clang versions, beginning with calling the older constructor clang_createIndex.

Is assigning sizeof(CIndexOptions) to Size the API user's responsibility or should libclang define an inline function clang_initializeCIndexOptions?

That sounds like a good plan to me. I wonder if we want to name it something like clang_createIndexWithOptions (or something generic like that), and give it a versioned structure of options along the lines of:

struct CIndexOptions {
  uint32_t Size; // sizeof(struct CIndexOptions), used for option versioning
  const char *PreambleStoragePath; // This pointer can be freed after creating the index
};

and define the function to return an error if Size < sizeof(struct CIndexOptions). This should allow us to add additional options to the structure without having to introduce a new constructor API each time.

Would this be the recommended usage of such an API?

  1. Call clang_createIndexWithOptions.
  2. If it returns nullptr index, report an error in the application (e.g. print a warning or show an error in the UI) and fall back to code paths that support older Clang versions, beginning with calling the older constructor clang_createIndex.

Yeah, I would imagine robust code to look like:

CIndexOptions Opts;
Opts.Size = sizeof(Opts);
Opts.PreambleStoragePath = somePathPtr;
CIndex Idx = clang_createIndexWithOptions(/*excludeDeclsFromPCH=*/1, /*displayDiagnostics=*/1, &Opts);
if (!Idx) {
  Idx = clang_createIndex(/*excludeDeclsFromPCH=*/1, /*displayDiagnostics=*/1);
  if (!Idx)
    handleError();
}

Is assigning sizeof(CIndexOptions) to Size the API user's responsibility or should libclang define an inline function clang_initializeCIndexOptions?

The user's responsibility. There's three scenarios when a field is added to the structure: 1) library and app are matching versions, 2) library is newer than app, 3) app is newer than library. #1 is the happy path most folks will hopefully be on. For #2 and #3, the app will either be sending more or less data than the library expects, but the library will know how much of the structure is valid based on the size field. If the size is too small and the library can't get data it needs, it can catch that and report an error instead of reading past a buffer. And if the size is too large, the library can ignore anything it doesn't know about or it can give an error due to not knowing how to support the semantics of the newer interface.

vedgy added a comment.Feb 1 2023, 7:55 AM

There's three scenarios when a field is added to the structure: 1) library and app are matching versions, 2) library is newer than app, 3) app is newer than library. #1 is the happy path most folks will hopefully be on. For #2 and #3, the app will either be sending more or less data than the library expects, but the library will know how much of the structure is valid based on the size field. If the size is too small and the library can't get data it needs, it can catch that and report an error instead of reading past a buffer. And if the size is too large, the library can ignore anything it doesn't know about or it can give an error due to not knowing how to support the semantics of the newer interface.

In the second case, the library ideally should assume that the missing struct members are not specified and behave as the corresponding older version. In the third case, the app can support older libclang versions by passing successively smaller structs until libclang returns a valid CIndex.

vedgy added a comment.Feb 1 2023, 8:01 AM

Perhaps the struct should contain the value of CINDEX_VERSION_MINOR instead of the sizeof? This way, libclang can change the meaning of a struct member without changing the size of the struct. For example, replace PreambleStoragePath with TemporaryDirectoryPath. Such a change could even be quiet, because setting the more general temporary directory path would likely be welcome or at least acceptable to the API users.

There's three scenarios when a field is added to the structure: 1) library and app are matching versions, 2) library is newer than app, 3) app is newer than library. #1 is the happy path most folks will hopefully be on. For #2 and #3, the app will either be sending more or less data than the library expects, but the library will know how much of the structure is valid based on the size field. If the size is too small and the library can't get data it needs, it can catch that and report an error instead of reading past a buffer. And if the size is too large, the library can ignore anything it doesn't know about or it can give an error due to not knowing how to support the semantics of the newer interface.

In the second case, the library ideally should assume that the missing struct members are not specified and behave as the corresponding older version.

I think agree.

In the third case, the app can support older libclang versions by passing successively smaller structs until libclang returns a valid CIndex.

That's how it tends to work when I've used APIs like this before (this is a pretty common pattern in the Win32 C APIs).

Perhaps the struct should contain the value of CINDEX_VERSION_MINOR instead of the sizeof? This way, libclang can change the meaning of a struct member without changing the size of the struct. For example, replace PreambleStoragePath with TemporaryDirectoryPath. Such a change could even be quiet, because setting the more general temporary directory path would likely be welcome or at least acceptable to the API users.

We try to keep libclang APIs source and ABI stable; changing the name or meaning of the structure member would defeat source stability. We could still use the minor version regardless, but then the minor version must continue to be monotonic even if we bump the major version (which we said we expect not to do, but "expect not to" != "will never").

vedgy added a comment.EditedFeb 2 2023, 12:11 AM

Another disadvantage of using CINDEX_VERSION_MINOR instead of the sizeof is that an older libclang version won't know about compatibility of newer versions with itself. But this reasoning brought me to a different thought: when a program is compiled against a libclang version X, it should not be expected to be runtime-compatible with a libclang version older than X. For example, KDevelop uses CINDEX_VERSION_MINOR to decide whether or not certain libclang API is available.

I suspect modifying the exposed struct's size will violate ODR in C++ programs that use libclang. Quote from the cppreference ODR page:

Note: in C, there is no program-wide ODR for types, and even extern declarations of the same variable in different translation units may have different types as long as they are compatible. In C++, the source-code tokens used in declarations of the same type must be the same as described above: if one .cpp file defines struct S { int x; }; and the other .cpp file defines struct S { int y; };, the behavior of the program that links them together is undefined. This is usually resolved with unnamed namespaces.

I think the answer to this StackOverflow question confirms my suspicion.

In order to avoid the undefined behavior, libclang can use the same approach as for other structs/classes: make the struct opaque and provide functions clang_createIndexOptions(), clang_disposeIndexOptions(), clang_setTempDirPath(CXIndexOptions options, const char *path).

int excludeDeclarationsFromPCH and int displayDiagnostics currently passed to clang_createIndex() can also be included in the struct and acquire their own setter functions. Then only a single argument will be passed to clang_createIndexWithOptions(): CXIndexOptions.


My preference is still for specific API names. If we want to do something general to all temp files, I think FileSystemOptions is the way to go.

My concern with not using a constructor is that, because this is the C API, we will be supporting the functionality for a long time even if we do switch to using a global override in FileSystemOptions where you would need to set the information up before executing the compiler. To me, these paths are something that should not be possible to change once the compiler has started executing. (That also solves the multithreading problem where multiple threads are fighting over what the temp directory should be and stepping on each other's toes.)

As I understand, this leaves two choices:

  1. Specific preamble API names, two separate non-constructor setters; the option values are stored in a separate struct (or even as two separate data members), not inside FileSystemOptions.
  2. General temporary-storage arguments (possibly combined in a struct) to a new overloaded constructor function; the option values are stored inside the FileSystemOptions struct.

The second alternative is likely more difficult to implement, more risky and less convenient to use (the store-in-memory bool option cannot be modified at any time). Perhaps it should be delayed until (and if) we learn of other temporary files libclang creates? A downside of implementing the first option now is that the specific API would have to be supported for a long time, even after the general temporary-file API is implemented.

I still think the general solution is where we ultimately want to get to and that makes me hesitant to go with specific preamble API names despite that being the direction you prefer. If this wasn't the C APIs, I'd be less concerned because we make no stability guarantees about our C++ interfaces. But we try really hard not to break the C APIs, so adding the specific names is basically committing to not only the APIs but their semantics. I think that makes implementing the general solution slightly more awkward because these are weird special cases that barely get tested in-tree, so they'd be very easy to overlook and accidentally break.

Is there a middle ground where, instead of #2 for general temporary storage, we went with #2 but with compiler-specific directories instead of system directories. e.g., don't let the caller set the temp directory, but do let the caller set the preamble directory (which defaults to the temp directory) so long as it's set before invoking the compiler? This still won't let you change options mid-run, but it also seems like it should have less risk of affecting other components while still solving the thread safety issues. I'm not certain if it's any easier to implement, but I think it does save you from modifying FileSystemOptions. As a separate item, we could then consider adding a new C API to let you toggle the in-memory vs on-disk functionality after exploring that it won't cause other problems because nobody considered the possibility that it's not a stable value for the compiler invocation.

After thinking some more, I'd like to revisit this decision. You agree that the general solution setTempDirPath() is an ultimate goal. But the rejection of tampering with the return value of system_temp_directory() means that overriding the temporary path will likely never be perfectly reliable. So how about a more sincere general solution: setPreferredTempDirPath()? The documentation would say that libclang tries its best to place temporary files into the specified directory, but some might end up in the system temporary directory instead.

This non-strict Preferred contract opens a path to design simplification. [correct me if I am wrong] clang_createIndex() does not need and is not likely to need a temporary directory path itself. So the documentation can recommend calling a non-constructor clang_CXIndex_setPreferredTempDirPath(CXIndex, const char *path) function right after the call to clang_createIndex(). And possibly caution that calling clang_CXIndex_setPreferredTempDirPath() later can result in fewer temporary files ending up in the overridden directory. If clang_createIndex()does need the temporary directory path in the future (unlikely I think), then the discussed overload of the CXIndex constructor function can be added. Such an API would be consistent with an existing libclang function:

/**
 * Sets the invocation emission path option in a CXIndex.
 *
 * The invocation emission path specifies a path which will contain log
 * files for certain libclang invocations. A null value (default) implies that
 * libclang invocations are not logged..
 */
CINDEX_LINKAGE void
clang_CXIndex_setInvocationEmissionPathOption(CXIndex, const char *Path);

The FileSystemOptions struct is an implementation detail not exposed in libclang. So the temporary directory path can be stored outside of it for now, if it simplifies the implementation. If and when the temporary directory path becomes more widely used and has to be passed to more classes, it could be moved inside the FileSystemOptions struct without modifying the libclang API.

Another disadvantage of using CINDEX_VERSION_MINOR instead of the sizeof is that an older libclang version won't know about compatibility of newer versions with itself. But this reasoning brought me to a different thought: when a program is compiled against a libclang version X, it should not be expected to be runtime-compatible with a libclang version older than X. For example, KDevelop uses CINDEX_VERSION_MINOR to decide whether or not certain libclang API is available.

I suspect modifying the exposed struct's size will violate ODR in C++ programs that use libclang. Quote from the cppreference ODR page:

Note: in C, there is no program-wide ODR for types, and even extern declarations of the same variable in different translation units may have different types as long as they are compatible. In C++, the source-code tokens used in declarations of the same type must be the same as described above: if one .cpp file defines struct S { int x; }; and the other .cpp file defines struct S { int y; };, the behavior of the program that links them together is undefined. This is usually resolved with unnamed namespaces.

I think the answer to this StackOverflow question confirms my suspicion.

Oh wow, I know the person who asked that question 12 years ago, that's always a fun thing to run into randomly. :-D

Given that this idiom (using size of a structure to version it) is used in practice all over the place, I'm not really worried about violating the ODR. Because the library is using the size information provided to determine what's safe to access, the only actionable UB comes from setting the size field incorrectly because then the library may access out of bounds memory. But an optimizer has no real opportunity for optimization here because we're talking about a structure being passed across a DSO boundary (so not even the linker sees both the library and the application in this scenario, so LTO also doesn't come into play).

(If this idiom wasn't so entrenched in industry, we could use a tagged union instead, but conceptually that's an identical solution to this one because the Size field is acting as the tag to determine which union members are accessible. I don't think we need to go to that much trouble though.)

vedgy abandoned this revision.Feb 6 2023, 9:54 AM

D143418 implements my latest idea described in the previous comment. Let us continue the discussion there.