This is an archive of the discontinued LLVM Phabricator instance.

[libclang] Add API to override preamble storage path
ClosedPublic

Authored by vedgy on Feb 6 2023, 9:51 AM.

Details

Summary

TempPCHFile::create() calls llvm::sys::fs::createTemporaryFile() to
create a file named preamble-*.pch in a system temporary directory. This
commit allows overriding the directory where these often many and large
preamble-*.pch files are stored.

The referenced bug report requests the ability to override the temporary
directory path used by libclang. However, overriding the return value of
llvm::sys::path::system_temp_directory() was rejected during code review
as improper and because it would negatively affect multithreading
performance. Finding all places where libclang uses the temporary
directory is very difficult. Therefore this commit is limited to
override libclang's single known use of the temporary directory.

This commit allows to override the preamble storage path only during
CXIndex construction to avoid multithreading issues and ensure that all
preambles are stored in the same directory. For the same multithreading
and consistency reasons, this commit deprecates
clang_CXIndex_setGlobalOptions() and
clang_CXIndex_setInvocationEmissionPathOption() in favor of specifying
these options during CXIndex construction.

Adding a new CXIndex constructor function each time a new initialization
argument is needed leads to either a large number of function parameters
unneeded by most libclang users or to an exponential number of overloads
that support different usage requirements. Therefore this commit
introduces a new extensible struct CXIndexOptions and a general function
clang_createIndexWithOptions().

A libclang user passes a desired preamble storage path to
clang_createIndexWithOptions(), which stores it in
CIndexer::PreambleStoragePath. Whenever
clang_parseTranslationUnit_Impl() is called, it passes
CIndexer::PreambleStoragePath to ASTUnit::LoadFromCommandLine(), which
stores this argument in ASTUnit::PreambleStoragePath. Whenever
ASTUnit::getMainBufferWithPrecompiledPreamble() is called, it passes
ASTUnit::PreambleStoragePath to PrecompiledPreamble::Build().
PrecompiledPreamble::Build() forwards the corresponding StoragePath
argument to TempPCHFile::create(). If StoragePath is not empty,
TempPCHFile::create() stores the preamble-*.pch file in the directory at
the specified path rather than in the system temporary directory.

The analysis below proves that this passing around of the
PreambleStoragePath string is sufficient to guarantee that the libclang
user override is used in TempPCHFile::create(). The analysis ignores API
uses in test code.

TempPCHFile::create() is called only in PrecompiledPreamble::Build().
PrecompiledPreamble::Build() is called only in two places: one in
clangd, which is not used by libclang, and one in
ASTUnit::getMainBufferWithPrecompiledPreamble().
ASTUnit::getMainBufferWithPrecompiledPreamble() is called in 3 places:

  1. ASTUnit::LoadFromCompilerInvocation() [analyzed below].
  2. ASTUnit::Reparse(), which in turn is called only from

clang_reparseTranslationUnit_Impl(), which in turn is called only from
clang_reparseTranslationUnit(). clang_reparseTranslationUnit() is never
called in LLVM code, but is part of public libclang API. This function's
documentation requires its translation unit argument to have been built
with clang_createTranslationUnitFromSourceFile().
clang_createTranslationUnitFromSourceFile() delegates its work to
clang_parseTranslationUnit(), which delegates to
clang_parseTranslationUnit2(), which delegates to
clang_parseTranslationUnit2FullArgv(), which delegates to
clang_parseTranslationUnit_Impl(), which passes
CIndexer::PreambleStoragePath to the ASTUnit it creates.

  1. ASTUnit::CodeComplete() passes AllowRebuild = false to

ASTUnit::getMainBufferWithPrecompiledPreamble(), which makes it return
nullptr before calling PrecompiledPreamble::Build().

Both ASTUnit::LoadFromCompilerInvocation() overloads (one of which
delegates its work to another) call
ASTUnit::getMainBufferWithPrecompiledPreamble() only if their argument
PrecompilePreambleAfterNParses > 0. LoadFromCompilerInvocation() is
called in:

  1. ASTBuilderAction::runInvocation() keeps the default parameter value

of PrecompilePreambleAfterNParses = 0, meaning that the preamble file is
never created from here.

  1. ASTUnit::LoadFromCommandLine().

ASTUnit::LoadFromCommandLine() is called in two places:

  1. CrossTranslationUnitContext::ASTLoader::loadFromSource() keeps the

default parameter value of PrecompilePreambleAfterNParses = 0, meaning
that the preamble file is never created from here.

  1. clang_parseTranslationUnit_Impl(), which passes

CIndexer::PreambleStoragePath to the ASTUnit it creates.

Therefore, the overridden preamble storage path is always used in
TempPCHFile::create().

TempPCHFile::create() uses PreambleStoragePath in the same way as
LibclangInvocationReporter() uses InvocationEmissionPath. The existing
documentation for clang_CXIndex_setInvocationEmissionPathOption() does
not specify ownership, encoding, separator or relative vs absolute path
requirements. So the documentation for
CXIndexOptions::PreambleStoragePath doesn't either. The assumptions are:

  • no ownership transfer;
  • UTF-8 encoding;
  • native separators.

Both relative and absolute paths are supported.

The added API works as expected in KDevelop:
https://invent.kde.org/kdevelop/kdevelop/-/merge_requests/283

Fixes: https://github.com/llvm/llvm-project/issues/51847

Diff Detail

Event Timeline

vedgy created this revision.Feb 6 2023, 9:51 AM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 6 2023, 9:51 AM
vedgy requested review of this revision.Feb 6 2023, 9:51 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptFeb 6 2023, 9:51 AM
vedgy updated this revision to Diff 495372.Feb 6 2023, 9:36 PM
vedgy edited the summary of this revision. (Show Details)

Address an old inline review comment
https://reviews.llvm.org/D139774#inline-1360634 and add a comment.

vedgy added a comment.Feb 7 2023, 10:28 PM

I hoped to avoid applying an unrelated git clang-format's include reordering, but the CI fails because of that. Will apply it along with changes requested during code review.

I hoped to avoid applying an unrelated git clang-format's include reordering, but the CI fails because of that. Will apply it along with changes requested during code review.

FWIW, you can ignore any CI failure related to formatting. It's an (annoying) bug that it gets reported as a failure. Our coding style guide prefers matching the surrounding style when it conflicts with the usual coding style, so it's not uncommon to have these kind of fake failures.

From the previous review (https://reviews.llvm.org/D139774):

After thinking some more, I'd like to revisit this decision. You agree that the general solution setTempDirPath() is an ultimate goal. But the rejection of tampering with the return value of system_temp_directory() means that overriding the temporary path will likely never be perfectly reliable.

Just to make sure we're on the same page, the reason we don't want to change system_temp_directory() is because that API is expected to get you the system temp directory, so an override in that function would break expectations. Instead, we want callers of that API to decide whether they really want the system temp directory or whether they want some other directory that perhaps the user provides. That keeps a clean separation of concerns -- the system APIs get you the system information and the things that wish to put files elsewhere can still do so.

So how about a more sincere general solution: setPreferredTempDirPath()? The documentation would say that libclang tries its best to place temporary files into the specified directory, but some might end up in the system temporary directory instead.

I don't think those semantics are reasonable. "Tries its best" is fine for things that are heuristic based, like deciding when to do an analysis cut point, but is too non-deterministic for something like "where do files get placed". I think it's reasonable that we either:

  1. expose an option to say "I prefer preambles to be put in this directory" for libclang, or
  2. expose an option to say "I prefer using this directory over the temp directory for all files" for libclang,

I thought we agreed that #2 is too high of risk because it requires auditing libclang for all the places it decides to put something into the temp directory, and so we were going with #1 and doing it as part of the cindex constructor so that users cannot run into the surprise issues (multithreading, files changing location mid-run) with an independent setter.

Such an API would be consistent with an existing libclang function:

That is interesting to note, but I'm not certain that was a good decision as it still has the same concerns. In the original review for that (https://reviews.llvm.org/D40527), it was observed that libclang is not thread-safe, which is accurate (Clang itself is not particularly thread-safe but there have been occasional efforts to drive us in that direction or at least not drive us away from it), but it didn't discuss the fact that this design potentially leads to files in different locations mid-run. It also didn't get nearly as much review discussion as this API, so there's that. CC @arphaman and @jkorous in case they want to weigh in on the new API, since they've worked in this area.

vedgy added a subscriber: dblaikie.Feb 9 2023, 11:27 PM

So how about a more sincere general solution: setPreferredTempDirPath()? The documentation would say that libclang tries its best to place temporary files into the specified directory, but some might end up in the system temporary directory instead.

I don't think those semantics are reasonable. "Tries its best" is fine for things that are heuristic based, like deciding when to do an analysis cut point, but is too non-deterministic for something like "where do files get placed". I think it's reasonable that we either:

  1. expose an option to say "I prefer preambles to be put in this directory" for libclang, or
  2. expose an option to say "I prefer using this directory over the temp directory for all files" for libclang,

I thought we agreed that #2 is too high of risk because it requires auditing libclang for all the places it decides to put something into the temp directory, and so we were going with #1 and doing it as part of the cindex constructor so that users cannot run into the surprise issues (multithreading, files changing location mid-run) with an independent setter.

This revision implements #2, but warns about possible imperfections of the implementation. KDevelop (and other potential users of the added API that need to override the temp dir location for the same reason) is not interested in overriding each temp dir usage separately. For this only known use case an imperfectly implemented clang_CXIndex_setPreferredTempDirPath() is much more convenient than a perfectly correct clang_CXIndex_setPreambleStoragePath(). With the general API name, if another temp dir use is overridden in the future, neither libclang API nor the code that uses it would have to be adjusted.

From a design perspective, my preference is to not add new APIs at the file system support layer in LLVM to override the temporary directory but instead allow individual components to override the decision where to put files. Overriding a system directory at the LLVM support level feels unclean to me because 1) threading concerns (mostly related to lock performance), 2) consistency concerns (put files in temp directory, override temp directory, put additional files in the new directory), 3) principle of least surprise. To me, the LLVM layer is responsible for interfacing with the system and asking for the system temp directory should get you the same answer as querying for that from the OS directly.

I've mixed feelings about this, but yeah, I guess the issues with global state are pretty undesirable. I don't worry too much about (2) - doesn't feel too problematic to me (for any individual file, presumably the implementation recorded the name of the file if they were going to access it again - so still be using the old path if necessary).

But, yeah, lack of any "system" abstraction that could be passed into all the various LLVM/MC/AST contexts to swap out the implementation limits the options a bit to more ad-hoc/case-by-case solutions. I feel like that's not a great solution to KDevelop's general problem, though - they're dealing with one particularly big file, but it doesn't feel very principled to me to fix that one compared to all the other (albeit smaller) temp files & maybe at some point in the future some bigger temp files are created elsewhere...

Based on this comment and on current preamble storage path code, I thought files-changing-location-mid-run should not be a problem.

Does the multithreading issue mean that clang_parseTranslationUnit_Impl() can be called in a non-main thread and thus concurrently with clang_CXIndex_setPreferredTempDirPath()?

So how about a more sincere general solution: setPreferredTempDirPath()? The documentation would say that libclang tries its best to place temporary files into the specified directory, but some might end up in the system temporary directory instead.

I don't think those semantics are reasonable. "Tries its best" is fine for things that are heuristic based, like deciding when to do an analysis cut point, but is too non-deterministic for something like "where do files get placed". I think it's reasonable that we either:

  1. expose an option to say "I prefer preambles to be put in this directory" for libclang, or
  2. expose an option to say "I prefer using this directory over the temp directory for all files" for libclang,

I thought we agreed that #2 is too high of risk because it requires auditing libclang for all the places it decides to put something into the temp directory, and so we were going with #1 and doing it as part of the cindex constructor so that users cannot run into the surprise issues (multithreading, files changing location mid-run) with an independent setter.

This revision implements #2, but warns about possible imperfections of the implementation.

I don't think "tries hard" is a good enough guarantee for where files are placed. I'm not comfortable with the security posture of that approach as it could potentially leak sensitive information (try to override the temp directory to something that's chroot jailed, get sensitive files showing up in the system temp directory anyway).

From a design perspective, my preference is to not add new APIs at the file system support layer in LLVM to override the temporary directory but instead allow individual components to override the decision where to put files. Overriding a system directory at the LLVM support level feels unclean to me because 1) threading concerns (mostly related to lock performance), 2) consistency concerns (put files in temp directory, override temp directory, put additional files in the new directory), 3) principle of least surprise. To me, the LLVM layer is responsible for interfacing with the system and asking for the system temp directory should get you the same answer as querying for that from the OS directly.

I've mixed feelings about this, but yeah, I guess the issues with global state are pretty undesirable. I don't worry too much about (2) - doesn't feel too problematic to me (for any individual file, presumably the implementation recorded the name of the file if they were going to access it again - so still be using the old path if necessary).

But, yeah, lack of any "system" abstraction that could be passed into all the various LLVM/MC/AST contexts to swap out the implementation limits the options a bit to more ad-hoc/case-by-case solutions. I feel like that's not a great solution to KDevelop's general problem, though - they're dealing with one particularly big file, but it doesn't feel very principled to me to fix that one compared to all the other (albeit smaller) temp files & maybe at some point in the future some bigger temp files are created elsewhere...

Based on this comment and on current preamble storage path code, I thought files-changing-location-mid-run should not be a problem.

Does the multithreading issue mean that clang_parseTranslationUnit_Impl() can be called in a non-main thread and thus concurrently with clang_CXIndex_setPreferredTempDirPath()?

Yes. However, I don't think that situation is officially supported (it's more that we don't want to knowingly make it harder to support in the future).

I don't think "tries hard" is a good enough guarantee for where files are placed. I'm not comfortable with the security posture of that approach as it could potentially leak sensitive information (try to override the temp directory to something that's chroot jailed, get sensitive files showing up in the system temp directory anyway).

That's why the function's documentation explicitly doesn't guarantee anything. It should be safe to assume that security-sensitive users would at least read the documentation. If this function and potential future function like it are named specifically, a responsible security use case wouldn't be better off. The only safety advantage would be to those who don't bother reading the documentation. But why should we care much about such hypothetical careless security needs?

Does the multithreading issue mean that clang_parseTranslationUnit_Impl() can be called in a non-main thread and thus concurrently with clang_CXIndex_setPreferredTempDirPath()?

Yes. However, I don't think that situation is officially supported (it's more that we don't want to knowingly make it harder to support in the future).

All right. Let's do it via a new constructor then. Unfortunately, supporting different CXIndexOptions struct sizes/versions would complicate the libclang implementation and the libclang user code. But the alternative of adding a new constructor function each time can either grow to a large number of function parameters unneeded by most users or to an exponential number of overloads that support different usage requirements.

How about including existing options, which should be set in constructor, in the initial struct version and deprecating the corresponding setters?

typedef struct CXIndexOptions {
  uint32_t size; // sizeof(struct CIndexOptions), used for option versioning
  unsigned globalOptions;
  const char *preferredTempDirPath;
  const char *invocationEmissionPath;
} CXIndexOptions;

The naming of struct data members is inconsistent in libclang's Index.h. They start with a lower-case letter in about half of the structs. Which naming scheme should the members of the new struct use?

I don't think "tries hard" is a good enough guarantee for where files are placed. I'm not comfortable with the security posture of that approach as it could potentially leak sensitive information (try to override the temp directory to something that's chroot jailed, get sensitive files showing up in the system temp directory anyway).

That's why the function's documentation explicitly doesn't guarantee anything. It should be safe to assume that security-sensitive users would at least read the documentation. If this function and potential future function like it are named specifically, a responsible security use case wouldn't be better off. The only safety advantage would be to those who don't bother reading the documentation. But why should we care much about such hypothetical careless security needs?

Security isn't a one-shot thing, but something you want in-depth defense for. Telling users "well you were careless about security so this is your fault" is still user-hostile even if you can point to documentation they should have read. (A whole lot of people use APIs without reading the documentation, unfortunately.)

Does the multithreading issue mean that clang_parseTranslationUnit_Impl() can be called in a non-main thread and thus concurrently with clang_CXIndex_setPreferredTempDirPath()?

Yes. However, I don't think that situation is officially supported (it's more that we don't want to knowingly make it harder to support in the future).

All right. Let's do it via a new constructor then. Unfortunately, supporting different CXIndexOptions struct sizes/versions would complicate the libclang implementation and the libclang user code. But the alternative of adding a new constructor function each time can either grow to a large number of function parameters unneeded by most users or to an exponential number of overloads that support different usage requirements.

How about including existing options, which should be set in constructor, in the initial struct version and deprecating the corresponding setters?

I think that makes a lot of sense.

typedef struct CXIndexOptions {
  uint32_t size; // sizeof(struct CIndexOptions), used for option versioning
  unsigned globalOptions;
  const char *preferredTempDirPath;

In terms of documenting the structure, should we document this member as only impacting preamble files currently, should we rename this to be preamble-path specific, or should we try to use this preferred path in all the places we need the temp directory?

(I lean towards renaming to preamble-path specific -- then users don't have the problem of upgrading to a newer CIndex potentially changing the behavior of where non-preamble files are stored, and we don't have to do the work up-front to audit other temp file uses.)

const char *invocationEmissionPath;

} CXIndexOptions;

The naming of struct data members is inconsistent in libclang's Index.h. They start with a lower-case letter in about half of the structs. Which naming scheme should the members of the new struct use?

Wow, it's almost an even split between styles from what I'm seeing, that's lovely. Let's go with the LLVM coding style recommendation because there's not a clear "winner" used in the same file. (So start with an uppercase letter instead of lower case.)

vedgy added a comment.Feb 13 2023, 9:41 AM

How about including existing options, which should be set in constructor, in the initial struct version and deprecating the corresponding setters?

I think that makes a lot of sense.

How to deprecate the setters? Add DEPRECATED in the documentations as is already done in two places in Index.h?

`const char *preferredTempDirPath;`

In terms of documenting the structure, should we document this member as only impacting preamble files currently, should we rename this to be preamble-path specific, or should we try to use this preferred path in all the places we need the temp directory?

(I lean towards renaming to preamble-path specific -- then users don't have the problem of upgrading to a newer CIndex potentially changing the behavior of where non-preamble files are stored, and we don't have to do the work up-front to audit other temp file uses.)

Looks like an imperfect general implementation is unacceptable. So I'll rename this data member to PreambleStoragePath.

How about including existing options, which should be set in constructor, in the initial struct version and deprecating the corresponding setters?

I think that makes a lot of sense.

How to deprecate the setters? Add DEPRECATED in the documentations as is already done in two places in Index.h?

I think that is reasonable. (Maybe someday in the future we should use attributes to give users better diagnostics, but not as part of this patch.)

`const char *preferredTempDirPath;`

In terms of documenting the structure, should we document this member as only impacting preamble files currently, should we rename this to be preamble-path specific, or should we try to use this preferred path in all the places we need the temp directory?

(I lean towards renaming to preamble-path specific -- then users don't have the problem of upgrading to a newer CIndex potentially changing the behavior of where non-preamble files are stored, and we don't have to do the work up-front to audit other temp file uses.)

Looks like an imperfect general implementation is unacceptable. So I'll rename this data member to PreambleStoragePath.

I like that direction, thank you.

vedgy added a comment.EditedFeb 13 2023, 10:04 PM

uint32_t Size; // = sizeof(struct CIndexOptions), used for option versioning

  1. uint32_t was introduced in C99. Can/should it be used in Index.h? Only built-in [unsigned] (int|long) types are currently used in this file.
  1. Should int excludeDeclarationsFromPCH and int displayDiagnostics currently passed to clang_createIndex() also be included in the struct? Then only a single argument will be passed to clang_createIndexWithOptions(): CXIndexOptions.
  1. clang_createIndex() initializes global options based on environment variable values:
if (getenv("LIBCLANG_BGPRIO_INDEX"))
    CIdxr->setCXGlobalOptFlags(CIdxr->getCXGlobalOptFlags() |
                               CXGlobalOpt_ThreadBackgroundPriorityForIndexing);
  if (getenv("LIBCLANG_BGPRIO_EDIT"))
    CIdxr->setCXGlobalOptFlags(CIdxr->getCXGlobalOptFlags() |
                               CXGlobalOpt_ThreadBackgroundPriorityForEditing);

The recommended in documentation usage of clang_CXIndex_setGlobalOptions is:

* \code
* CXIndex idx = ...;
* clang_CXIndex_setGlobalOptions(idx,
*     clang_CXIndex_getGlobalOptions(idx) |
*     CXGlobalOpt_ThreadBackgroundPriorityForIndexing);
* \endcode

So making these options part of struct CXIndexOptions and deprecating clang_CXIndex_setGlobalOptions requires introducing another global function that would read the environment variables:

CINDEX_LINKAGE unsigned clang_getDefaultGlobalOptions();

Is this the right approach?

uint32_t Size; // = sizeof(struct CIndexOptions), used for option versioning

  1. uint32_t was introduced in C99. Can/should it be used in Index.h? Only built-in [unsigned] (int|long) types are currently used in this file.

That is a really good question. I couldn't spot anything existing in the header file that requires C99 or later (no // comments, no C99 types like _Bool or uint32_t, etc). I think the common pattern would be to use unsigned, which is also variably-sized (as are all the integer types, technically, thanks to CHAR_BIT), but we do have one existing use of size_t in the file, which is probably the best type to use there given that we expect people to assign the results of sizeof into it. WDYT about using size_t?

I don't have the historical context to know whether we expect this header to be C89 compatible, so it's not clear to me how disruptive it would be to use stdint.h. One alternative would be to use stdint.h if it's available (via __has_include) and otherwise fallback on a C89 custom definition of uint32_t, but that strikes me as being more likely to introduce problems on systems we don't test on or for compilers we don't test with.

  1. Should int excludeDeclarationsFromPCH and int displayDiagnostics currently passed to clang_createIndex() also be included in the struct? Then only a single argument will be passed to clang_createIndexWithOptions(): CXIndexOptions.

I think that makes sense to me. It does raise the question of whether we want to pack these boolean-like fields together, as in:

struct CXIndexOptions {
  size_t Size;

  int ExcludeDeclsFromPCH : 1;
  int DisplayDiagnostics : 1;
  int Reserved : 30;

  const char *PreambleStoragePath;
  ...
};

This makes it a little less likely to need to grow the structure when adding new options.

  1. clang_createIndex() initializes global options based on environment variable values:
if (getenv("LIBCLANG_BGPRIO_INDEX"))
    CIdxr->setCXGlobalOptFlags(CIdxr->getCXGlobalOptFlags() |
                               CXGlobalOpt_ThreadBackgroundPriorityForIndexing);
  if (getenv("LIBCLANG_BGPRIO_EDIT"))
    CIdxr->setCXGlobalOptFlags(CIdxr->getCXGlobalOptFlags() |
                               CXGlobalOpt_ThreadBackgroundPriorityForEditing);

The recommended in documentation usage of clang_CXIndex_setGlobalOptions is:

* \code
* CXIndex idx = ...;
* clang_CXIndex_setGlobalOptions(idx,
*     clang_CXIndex_getGlobalOptions(idx) |
*     CXGlobalOpt_ThreadBackgroundPriorityForIndexing);
* \endcode

So making these options part of struct CXIndexOptions and deprecating clang_CXIndex_setGlobalOptions requires introducing another global function that would read the environment variables:

CINDEX_LINKAGE unsigned clang_getDefaultGlobalOptions();

Is this the right approach?

Hmm, to make this patch easier, I think we might want to leave the environment variable behavior alone and not shift these into the options structure (yet?). Naively, I think it makes sense for these to eventually live in the options structure, but we could expose them in a few different ways (an option to prefer the env variable over a manual value as it is today or an option to prefer the manual value over the env variable for folks who want more hermetic behavior). WDYT? My opinion here isn't super strong, so if you have a strong desire to deprecate and add a replacement API, I think that's a defensible course to take.

vedgy added a comment.Feb 14 2023, 8:36 AM

uint32_t Size; // = sizeof(struct CIndexOptions), used for option versioning

  1. uint32_t was introduced in C99. Can/should it be used in Index.h? Only built-in [unsigned] (int|long) types are currently used in this file.

That is a really good question. I couldn't spot anything existing in the header file that requires C99 or later (no // comments, no C99 types like _Bool or uint32_t, etc). I think the common pattern would be to use unsigned, which is also variably-sized (as are all the integer types, technically, thanks to CHAR_BIT), but we do have one existing use of size_t in the file, which is probably the best type to use there given that we expect people to assign the results of sizeof into it. WDYT about using size_t?

I don't have the historical context to know whether we expect this header to be C89 compatible, so it's not clear to me how disruptive it would be to use stdint.h. One alternative would be to use stdint.h if it's available (via __has_include) and otherwise fallback on a C89 custom definition of uint32_t, but that strikes me as being more likely to introduce problems on systems we don't test on or for compilers we don't test with.

size_t indeed makes logical sense for this member as that's the return type of sizeof. size_t is two times larger than unsigned on x86_64, but I don't think the size of this struct has any chance of impacting performance. Though it wouldn't hurt to pack the size and the boolean options in a single-pointer-sized region on x86_64. After all, this struct's size will never reach UINT_MAX. I slightly prefer unsigned due to my efficiency inclinations :). What do you prefer? Is there any benefit in using a fixed-size integer type here?

  1. Should int excludeDeclarationsFromPCH and int displayDiagnostics currently passed to clang_createIndex() also be included in the struct? Then only a single argument will be passed to clang_createIndexWithOptions(): CXIndexOptions.

I think that makes sense to me. It does raise the question of whether we want to pack these boolean-like fields together, as in:

struct CXIndexOptions {
  size_t Size;

  int ExcludeDeclsFromPCH : 1;
  int DisplayDiagnostics : 1;
  int Reserved : 30;

  const char *PreambleStoragePath;
  ...
};

This makes it a little less likely to need to grow the structure when adding new options.

When we add new options, the struct's size must grow in order to distinguish different struct versions and prevent undefined behavior! If a member is added within the same LLVM release, it and other members added in that release can be reordered and packed to minimize the size. For example, I plan to add a bool StorePreamblesInMemory in LLVM 17. While adding it, I can reorder and pack anything I like as the whole struct is introduced in this version. But there is no need to reserve anything, I think. This is assuming we don't need to support compatibility at each intermediate commit of libclang.

  1. clang_createIndex() initializes global options based on environment variable values:
if (getenv("LIBCLANG_BGPRIO_INDEX"))
    CIdxr->setCXGlobalOptFlags(CIdxr->getCXGlobalOptFlags() |
                               CXGlobalOpt_ThreadBackgroundPriorityForIndexing);
  if (getenv("LIBCLANG_BGPRIO_EDIT"))
    CIdxr->setCXGlobalOptFlags(CIdxr->getCXGlobalOptFlags() |
                               CXGlobalOpt_ThreadBackgroundPriorityForEditing);

The recommended in documentation usage of clang_CXIndex_setGlobalOptions is:

* \code
* CXIndex idx = ...;
* clang_CXIndex_setGlobalOptions(idx,
*     clang_CXIndex_getGlobalOptions(idx) |
*     CXGlobalOpt_ThreadBackgroundPriorityForIndexing);
* \endcode

So making these options part of struct CXIndexOptions and deprecating clang_CXIndex_setGlobalOptions requires introducing another global function that would read the environment variables:

CINDEX_LINKAGE unsigned clang_getDefaultGlobalOptions();

Is this the right approach?

Hmm, to make this patch easier, I think we might want to leave the environment variable behavior alone and not shift these into the options structure (yet?). Naively, I think it makes sense for these to eventually live in the options structure, but we could expose them in a few different ways (an option to prefer the env variable over a manual value as it is today or an option to prefer the manual value over the env variable for folks who want more hermetic behavior). WDYT? My opinion here isn't super strong, so if you have a strong desire to deprecate and add a replacement API, I think that's a defensible course to take.

OK, let's skip the global options member for now. I'll add a \todo about this.

size_t indeed makes logical sense for this member as that's the return type of sizeof. size_t is two times larger than unsigned on x86_64, but I don't think the size of this struct has any chance of impacting performance.

Agreed.

Though it wouldn't hurt to pack the size and the boolean options in a single-pointer-sized region on x86_64. After all, this struct's size will never reach UINT_MAX. I slightly prefer unsigned due to my efficiency inclinations :). What do you prefer? Is there any benefit in using a fixed-size integer type here?

I've been trying to think of benefits for using a fixed-size integer type and the closest I can come is the consistency of the structure size across targets, but I don't think we need that consistency. I don't have a strong preference for unsigned vs size_t, so how about we go with your slight preference for unsigned unless someone finds a reason to use something else?

  1. Should int excludeDeclarationsFromPCH and int displayDiagnostics currently passed to clang_createIndex() also be included in the struct? Then only a single argument will be passed to clang_createIndexWithOptions(): CXIndexOptions.

I think that makes sense to me. It does raise the question of whether we want to pack these boolean-like fields together, as in:

struct CXIndexOptions {
  size_t Size;

  int ExcludeDeclsFromPCH : 1;
  int DisplayDiagnostics : 1;
  int Reserved : 30;

  const char *PreambleStoragePath;
  ...
};

This makes it a little less likely to need to grow the structure when adding new options.

When we add new options, the struct's size must grow in order to distinguish different struct versions and prevent undefined behavior!

Oh gosh you're absolutely right, that was a think-o on my part.

OK, let's skip the global options member for now. I'll add a \todo about this.

SGTM, thank you!

  1. clang_createIndex() initializes global options based on environment variable values:
if (getenv("LIBCLANG_BGPRIO_INDEX"))
    CIdxr->setCXGlobalOptFlags(CIdxr->getCXGlobalOptFlags() |
                               CXGlobalOpt_ThreadBackgroundPriorityForIndexing);
  if (getenv("LIBCLANG_BGPRIO_EDIT"))
    CIdxr->setCXGlobalOptFlags(CIdxr->getCXGlobalOptFlags() |
                               CXGlobalOpt_ThreadBackgroundPriorityForEditing);

The recommended in documentation usage of clang_CXIndex_setGlobalOptions is:

* \code
* CXIndex idx = ...;
* clang_CXIndex_setGlobalOptions(idx,
*     clang_CXIndex_getGlobalOptions(idx) |
*     CXGlobalOpt_ThreadBackgroundPriorityForIndexing);
* \endcode

So making these options part of struct CXIndexOptions and deprecating clang_CXIndex_setGlobalOptions requires introducing another global function that would read the environment variables:

CINDEX_LINKAGE unsigned clang_getDefaultGlobalOptions();

Is this the right approach?

Hmm, to make this patch easier, I think we might want to leave the environment variable behavior alone and not shift these into the options structure (yet?). Naively, I think it makes sense for these to eventually live in the options structure, but we could expose them in a few different ways (an option to prefer the env variable over a manual value as it is today or an option to prefer the manual value over the env variable for folks who want more hermetic behavior). WDYT? My opinion here isn't super strong, so if you have a strong desire to deprecate and add a replacement API, I think that's a defensible course to take.

On second thought, the proposed clang_getDefaultGlobalOptions() API already offers users a choice to either prefer or override each option value from the env. variables, just like the existing clang_CXIndex_[gs]etGlobalOptions API. The environment variables are binary options: an existence, not value, of an env. variable determines an initial option value. So I don't understand what are the two different ways to expose these options.

I've been trying to think of benefits for using a fixed-size integer type and the closest I can come is the consistency of the structure size across targets, but I don't think we need that consistency. I don't have a strong preference for unsigned vs size_t, so how about we go with your slight preference for unsigned unless someone finds a reason to use something else?

Sounds good. Here is my current WIP API version:

typedef struct CXIndexOptions {
  /**
   * The size of struct CIndexOptions used for option versioning.
   *
   * Always assign sizeof(CIndexOptions) to this member right after
   * creating a CXIndexOptions object.
   */
  unsigned Size;
  /**
   * \see clang_createIndex()
   */
  int ExcludeDeclarationsFromPCH : 1;
  int DisplayDiagnostics : 1;
  /**
   * The path to a directory, in which to store temporary PCH files. If null or
   * empty, the default system temporary directory is used. These PCH files are
   * deleted on clean exit but stay on disk if the program crashes or is killed.
   *
   * Libclang does not create the directory at the specified path in the file
   * system. Therefore it must exist, or storing PCH files will fail.
   */
  const char *PreambleStoragePath;
  /**
   * Specifies a path which will contain log files for certain libclang
   * invocations. A null value implies that libclang invocations are not logged.
   */
  const char *InvocationEmissionPath;
} CXIndexOptions;

/**
 * Provides a shared context for creating translation units.
 *
 * Call this function instead of clang_createIndex() if you need to configure
 * the additional options in CXIndexOptions.
 *
 * \returns The created index or null in case of error, such as an unsupported
 * value of options->Size.
 *
 * For example:
 * \code
 * CXIndex createIndex(const char *ApplicationTemporaryPath) {
 *   const int ExcludeDeclarationsFromPCH = 1;
 *   const int DisplayDiagnostics = 1;
 * #if CINDEX_VERSION_MINOR >= 64
 *   CIndexOptions Opts;
 *   Opts.Size = sizeof(CIndexOptions);
 *   Opts.ExcludeDeclarationsFromPCH = ExcludeDeclarationsFromPCH;
 *   Opts.DisplayDiagnostics = DisplayDiagnostics;
 *   Opts.PreambleStoragePath = ApplicationTemporaryPath;
 *   Opts.InvocationEmissionPath = NULL;
 *   CIndex Idx = clang_createIndexWithOptions(&Opts);
 *   if (Idx)
 *     return Idx;
 *   fprintf(stderr, "clang_createIndexWithOptions() failed. "
 *                   "CINDEX_VERSION_MINOR = %d, sizeof(CIndexOptions) = %u",
 *           CINDEX_VERSION_MINOR, Opts.Size);
 * #else
 *   (void)ApplicationTemporaryPath;
 * #endif
 *
 *   return clang_createIndex(ExcludeDeclarationsFromPCH, DisplayDiagnostics);
 * }
 * \endcode
 *
 * \sa clang_createIndex()
 */
CINDEX_LINKAGE CXIndex
clang_createIndexWithOptions(const CXIndexOptions *options);
  1. clang_createIndex() initializes global options based on environment variable values:
if (getenv("LIBCLANG_BGPRIO_INDEX"))
    CIdxr->setCXGlobalOptFlags(CIdxr->getCXGlobalOptFlags() |
                               CXGlobalOpt_ThreadBackgroundPriorityForIndexing);
  if (getenv("LIBCLANG_BGPRIO_EDIT"))
    CIdxr->setCXGlobalOptFlags(CIdxr->getCXGlobalOptFlags() |
                               CXGlobalOpt_ThreadBackgroundPriorityForEditing);

The recommended in documentation usage of clang_CXIndex_setGlobalOptions is:

* \code
* CXIndex idx = ...;
* clang_CXIndex_setGlobalOptions(idx,
*     clang_CXIndex_getGlobalOptions(idx) |
*     CXGlobalOpt_ThreadBackgroundPriorityForIndexing);
* \endcode

So making these options part of struct CXIndexOptions and deprecating clang_CXIndex_setGlobalOptions requires introducing another global function that would read the environment variables:

CINDEX_LINKAGE unsigned clang_getDefaultGlobalOptions();

Is this the right approach?

Hmm, to make this patch easier, I think we might want to leave the environment variable behavior alone and not shift these into the options structure (yet?). Naively, I think it makes sense for these to eventually live in the options structure, but we could expose them in a few different ways (an option to prefer the env variable over a manual value as it is today or an option to prefer the manual value over the env variable for folks who want more hermetic behavior). WDYT? My opinion here isn't super strong, so if you have a strong desire to deprecate and add a replacement API, I think that's a defensible course to take.

On second thought, the proposed clang_getDefaultGlobalOptions() API already offers users a choice to either prefer or override each option value from the env. variables, just like the existing clang_CXIndex_[gs]etGlobalOptions API. The environment variables are binary options: an existence, not value, of an env. variable determines an initial option value. So I don't understand what are the two different ways to expose these options.

I was thinking of the env variable as determining the initial option value, so the two options I saw were "the the env variable provides the final resulting value used by the program" and "the env variable provides the default value used by the program". But you're right, the current behavior is that the presence of the env variable determines the behavior, not the value of the env variable.

The question really boils down to: if the user passes in an option which says "don't use thread background priority" to the call to create index, AND there is an environment variable saying "use thread background priority", who should win? And does the answer differ if the option says "use thread background priority" and the environment variable does not exist?

Sounds good. Here is my current WIP API version:

This looks sensible to me.

vedgy added a comment.EditedFeb 23 2023, 10:14 AM

On second thought, the proposed clang_getDefaultGlobalOptions() API already offers users a choice to either prefer or override each option value from the env. variables, just like the existing clang_CXIndex_[gs]etGlobalOptions API. The environment variables are binary options: an existence, not value, of an env. variable determines an initial option value. So I don't understand what are the two different ways to expose these options.

I was thinking of the env variable as determining the initial option value, so the two options I saw were "the the env variable provides the final resulting value used by the program" and "the env variable provides the default value used by the program". But you're right, the current behavior is that the presence of the env variable determines the behavior, not the value of the env variable.

The question really boils down to: if the user passes in an option which says "don't use thread background priority" to the call to create index, AND there is an environment variable saying "use thread background priority", who should win? And does the answer differ if the option says "use thread background priority" and the environment variable does not exist?

The purpose of adding the GlobalOptions member to CXIndexOptions and deprecating clang_CXIndex_setGlobalOptions is to improve libclang's thread safety. The proposed approach keeps the meaning of the environment variables and enables straightforward porting of existing uses. A good reason to add the GlobalOptions member now is to reduce the number of supported struct CXIndexOptions versions.

If someone decides to prioritize API users vs environment variables differently, then new 3-state (unset, disabled, enabled) environment variables could be introduced and (possibly) the existing ones deprecated. Such a change may require deprecating the proposed clang_getDefaultGlobalOptions() API addition and introducing a more informative replacement.

The question is whether such environment variable behavior changes are likely to be proposed any time soon and how important is this possibility compared to the potential thread safety improvement. I suppose if no one has proposed such environment variable changes yet, then the current behavior works well enough for all libclang users. Just searched for LIBCLANG_BGPRIO_INDEX, LIBCLANG_BGPRIO_EDIT, CXGlobalOpt_ThreadBackgroundPriorityForIndexing, CXGlobalOpt_ThreadBackgroundPriorityForEditing, clang_CXIndex_setGlobalOptions and clang_CXIndex_getGlobalOptions on the LLVM Project's issue tracker. Found no results for any of these strings.

On second thought, the proposed clang_getDefaultGlobalOptions() API already offers users a choice to either prefer or override each option value from the env. variables, just like the existing clang_CXIndex_[gs]etGlobalOptions API. The environment variables are binary options: an existence, not value, of an env. variable determines an initial option value. So I don't understand what are the two different ways to expose these options.

I was thinking of the env variable as determining the initial option value, so the two options I saw were "the the env variable provides the final resulting value used by the program" and "the env variable provides the default value used by the program". But you're right, the current behavior is that the presence of the env variable determines the behavior, not the value of the env variable.

The question really boils down to: if the user passes in an option which says "don't use thread background priority" to the call to create index, AND there is an environment variable saying "use thread background priority", who should win? And does the answer differ if the option says "use thread background priority" and the environment variable does not exist?

The purpose of adding the GlobalOptions member to CXIndexOptions and deprecating clang_CXIndex_setGlobalOptions is to improve libclang's thread safety. The proposed approach keeps the meaning of the environment variables and enables straightforward porting of existing uses. A good reason to add the GlobalOptions member now is to reduce the number of supported struct CXIndexOptions versions.

If someone decides to prioritize API users vs environment variables differently, then new 3-state (unset, disabled, enabled) environment variables could be introduced and (possibly) the existing ones deprecated. Such a change may require deprecating the proposed clang_getDefaultGlobalOptions() API addition and introducing a more informative replacement.

The question is whether such environment variable behavior changes are likely to be proposed any time soon and how important is this possibility compared to the potential thread safety improvement. I suppose if no one has proposed such environment variable changes yet, then the current behavior works well enough for all libclang users. Just searched for LIBCLANG_BGPRIO_INDEX, LIBCLANG_BGPRIO_EDIT, CXGlobalOpt_ThreadBackgroundPriorityForIndexing, CXGlobalOpt_ThreadBackgroundPriorityForEditing, clang_CXIndex_setGlobalOptions and clang_CXIndex_getGlobalOptions on the LLVM Project's issue tracker. Found no results for any of these strings.

I'm not seeing a whole lot of usage here either: https://sourcegraph.com/search?q=context:global+lang:c+-file:.*libclang.*+-file:Index.h+LIBCLANG_BGPRIO_INDEX%7CLIBCLANG_BGPRIO_EDIT%7CCXGlobalOpt_ThreadBackgroundPriorityForIndexing%7CCXGlobalOpt_ThreadBackgroundPriorityForEditing%7Cclang_CXIndex_setGlobalOptions%7Cclang_CXIndex_getGlobalOptions&patternType=regexp&sm=1&groupBy=repo

So my intuition is that the current behavior works well enough and I doubt anyone's going to propose changes to it in the future.

vedgy added a comment.Feb 24 2023, 8:33 AM

So my intuition is that the current behavior works well enough and I doubt anyone's going to propose changes to it in the future.

So adding the GlobalOptions member to CXIndexOptions, adding the clang_getDefaultGlobalOptions() function and deprecating clang_CXIndex_setGlobalOptions in this revision is fine, right?

So my intuition is that the current behavior works well enough and I doubt anyone's going to propose changes to it in the future.

So adding the GlobalOptions member to CXIndexOptions, adding the clang_getDefaultGlobalOptions() function and deprecating clang_CXIndex_setGlobalOptions in this revision is fine, right?

I think so, yes.

vedgy updated this revision to Diff 500552.Feb 26 2023, 5:24 AM
vedgy retitled this revision from [libclang] Add API to set preferred temp dir path to [libclang] Add API to override preamble storage path.
vedgy edited the summary of this revision. (Show Details)

Reimplement the API as discussed in review comments

vedgy added inline comments.Feb 26 2023, 5:36 AM
clang/tools/c-index-test/c-index-test.c
79 ↗(On Diff #500552)

When a libclang user needs to override a single option in CXIndexOptions, [s]he has to set every member of the struct explicitly. When new options are added, each libclang user needs to update the code that sets the options under CINDEX_VERSION_MINOR #ifs. Accidentally omitting even one member assignment risks undefined or wrong behavior. How about adding an inline helper function CXIndexOptions clang_getDefaultIndexOptions(), which assigns default values to all struct members? Libclang users can then call this function to obtain the default configuration, then tweak only the members they want to override.

If this suggestion is to be implemented, how to deal with the differences of the inline keyword in C and C++?

2079 ↗(On Diff #500552)

Not sure -1 is the right value to return here. This return value becomes the exit code of the c-index-test executable.

vedgy added inline comments.Feb 26 2023, 7:00 AM
clang/include/clang-c/Index.h
319

The type is size_t instead of the agreed upon unsigned, because the addition of unsigned GlobalOptions below means that unsigned Size no longer reduces the overall struct's size.

aaron.ballman added inline comments.Feb 27 2023, 11:55 AM
clang/include/clang-c/Index.h
319

SGTM, thanks for the explanation

353

Don't want this to be a K&R C interface in pre-C2x modes.

clang/tools/c-index-test/c-index-test.c
79 ↗(On Diff #500552)

By default, 0 should give you the most reasonable default behavior for most of the existing options (and new options should follow the same pattern). Ideally, users should be able to do:

CXIndexOptions Opts;
memset(&Opts, 0, sizeof(Opts));
Opts.Size = sizeof(Opts);
Opts.Whatever = 12;
CXIndex Idx = clang_createIndexWithOptions(&Opts);

Global options defaulting to 0 is fine (uses regular thread priorities), we don't think want to default to excluding declarations from PCH, and we want to use the default preamble and invocation emission paths (if any). The only option that nonzero as a default *might* make sense for is displaying diagnostics, but even that seems reasonable to expect the developer to manually enable.

So I don't know that we need a function to get us default indexing options as 0 should be a reasonable default for all of the options. As we add new options, we need to be careful to add them in backwards compatible ways where 0 means "do the most likely thing".

WDYT?

2079 ↗(On Diff #500552)

I think -1 is fine -- the important thing is a nonzero return code so it's logged as an error rather than a valid result.

clang/tools/libclang/CIndex.cpp
3689
3699–3700

I think we want this to be > and not !=, maybe.

If the sizes are equal, we're on the happy path.

If the options from the caller are smaller than the options we know about, that should be okay because we won't attempt read the options not provided and instead rely on the default behavior being reasonable.

If the options from the caller are larger than the options we know about, we have to assume the user set an option we can't handle, and thus failing the request is reasonable.

So the way I'm envisioning us reading the options is:

if (options->Size >= offsetof(CXIndexOptions, FieldWeCareAbout))
  do_something(options->FieldWeCareAbout);
vedgy added inline comments.Feb 27 2023, 10:09 PM
clang/tools/c-index-test/c-index-test.c
79 ↗(On Diff #500552)

The disadvantages of committing to defaulting to 0:

  1. The usage you propose is still more verbose and error-prone than
CXIndexOptions Opts = clang_getDefaultIndexOptions();
Opts.Whatever = 12;
CXIndex Idx = clang_createIndexWithOptions(&Opts);
  1. The memset would look very unclean in modern C++ code.
  2. The 0 commitment may force unnatural naming of a future option to invert its meaning.

The advantages:

  1. No need to implement the inline function now.
  2. Faster, but I am sure this performance difference doesn't matter. Even a non-inline function call itself (even if it did nothing) to clang_createIndexWithOptions() should take longer than assigning a few values to built-in-typed members.

Another advantage of not having to remember to update the inline function's implementation when new options are added is counterbalanced by the need to be careful to add new options in backwards compatible way where 0 is the default.

Any other advantages of the 0 that I miss? Maybe there are some advantages for C users, but I suspect most libclang users are C++.

2079 ↗(On Diff #500552)

I have noticed that these functions sometimes return -1 and sometimes 1 on different errors. I thought that perhaps there is some difference in these two return values, but I couldn't figure out what it might be. Since telling the difference is not straightforward, you are probably right that there is no meaningful difference.

clang/tools/libclang/CIndex.cpp
3699–3700

You are probably right looking forward.

I have opted for != because this is the initial struct version. So right now a smaller struct size means something very unexpected or a failure to assign correct size on the user side. The advantage of != is then more reliable catching of bugs in freshly written user code that uses the struct and an added flexibility of being able to reduce the struct's size in the next revision.

aaron.ballman added inline comments.Feb 28 2023, 5:32 AM
clang/tools/c-index-test/c-index-test.c
79 ↗(On Diff #500552)

The usage you propose is still more verbose and error-prone than

CXIndexOptions Opts = clang_getDefaultIndexOptions();
Opts.Whatever = 12;
CXIndex Idx = clang_createIndexWithOptions(&Opts);

I see it as being roughly the same amount of verbosity and proneness to error, but maybe that's just me.

The memset would look very unclean in modern C++ code.

I don't see this as a problem. 1) std::memset gets plenty of usage in modern C++ still, 2) you can also initialize with = { sizeof(CXIndexOptions) }; and rely on the zero init for subsequent members after the first (personally, I think that's too clever, but reasonable people may disagree), 3) this is a C API, folks using C++ may wish to wrap it in a better interface for C++ anyway.

The 0 commitment may force unnatural naming of a future option to invert its meaning.

I do worry about this one, especially given there's no way to predict what future options we'll need. But it also forces us to think about what the correct default behavior should be, whereas if we push it off to a helper function, we make it harder for everyone who didn't know to use that helper function for whatever reason.

Any other advantages of the 0 that I miss? Maybe there are some advantages for C users, but I suspect most libclang users are C++.

Familiarity for those who are used to dealing with existing C APIs that behave this way (like Win32 APIs), but that can probably be argued either way. My guess is that all libclang users are having to work through the C interface, and some decently large amount of them do so from languages other than C. But whether that means most are C++ users or not is questionable -- I know I've seen uses from Python and Java as well as C++.

I'm not strongly opposed to having an API to get the default values, but I don't see a lot of value in it either. It's something we could always add later if we find there's a need for it in practice.

vedgy added inline comments.Feb 28 2023, 9:09 AM
clang/tools/c-index-test/c-index-test.c
79 ↗(On Diff #500552)

I see it as being roughly the same amount of verbosity and proneness to error, but maybe that's just me.

Calling the default-value function is one requirement. Assigning sizeof and calling memset - 2. Plus the user has to pass the correct arguments to memset.

  1. std::memset gets plenty of usage in modern C++ still

Yes, there is usage, but modern C++ guidelines recommend replacing such usages with standard algorithms whenever possible. memset-ting an object generally risks undefined behavior.

whereas if we push it off to a helper function, we make it harder for everyone who didn't know to use that helper function for whatever reason.

If a user doesn't read the documentation, [s]he would likely not guess that memset needs to be called either. So I don't see a disadvantage of the helper function here.

It's something we could always add later if we find there's a need for it in practice.

We could add it later. But if we advise using memset in the first version, introducing a struct member with non-zero default value would risk breaking backward compatibility for users that don't track libclang API changes closely. It would also force all users that do notice the change to update their code with #if CINDEX_VERSION_MINOR >= XY to use either memset when compiling against libclang versions that don't yet have the helper function, or the helper function.

Introducing the helper function from the beginning affords more flexibility and space for future libclang API changes with less risk of breaking backward compatibility. If someone decides to ignore the helper function and memset the struct instead, breaking such usage should not be an obstacle/concern to future libclang changes.

aaron.ballman added inline comments.Feb 28 2023, 10:37 AM
clang/tools/c-index-test/c-index-test.c
79 ↗(On Diff #500552)

I see it as being roughly the same amount of verbosity and proneness to error, but maybe that's just me.

Calling the default-value function is one requirement. Assigning sizeof and calling memset - 2. Plus the user has to pass the correct arguments to memset.

Both of which can be done in a single one-liner if brevity is that important for your coding style.

Yes, there is usage, but modern C++ guidelines recommend replacing such usages with standard algorithms whenever possible. memset-ting an object generally risks undefined behavior.

I don't see how that can be much of a real risk in a C interface.

If a user doesn't read the documentation, [s]he would likely not guess that memset needs to be called either. So I don't see a disadvantage of the helper function here.

Agreed.

We could add it later. But if we advise using memset in the first version, introducing a struct member with non-zero default value would risk breaking backward compatibility ...

My recommendation is that we don't add the API now and we leave a comment in the options struct reminding folks about the importance of new options being expressed such that 0 gives the correct default behavior. As you say, it is a risk that we break the model in the future, but it seems like a somewhat low risk.

vedgy updated this revision to Diff 501559.Mar 1 2023, 10:00 AM

Address review comments

vedgy marked 2 inline comments as done.Mar 1 2023, 10:09 AM
vedgy added inline comments.
clang/include/clang-c/Index.h
329

When I almost finished the requested changes, I remembered that the return value of clang_getDefaultGlobalOptions() depends on environment variables, and thus 0 is not necessarily the default. Adjusted the changes and updated this revision.

Does the extra requirement to non-zero initialize this second member sway your opinion on the usefulness of the helper function inline CXIndexOptions clang_getDefaultIndexOptions()? Note that there may be same (environment) or other important reasons why future new options couldn't be zeroes by default.

clang/tools/c-index-test/c-index-test.c
79 ↗(On Diff #500552)

Added and modified documentation, adapted usage in tests.

clang/tools/libclang/CIndex.cpp
3699–3700

Added your recommended way to add new options in comments.

aaron.ballman added inline comments.Mar 1 2023, 12:02 PM
clang/include/clang-c/Index.h
329

Thinking out loud a bit... (potentially bad idea incoming)

What if we dropped clang_getDefaultGlobalOptions() and instead made a change to CXGlobalOptFlags:

typedef enum {
  /**
   * Used to indicate that the default CXIndex options are used. By default, no
   * global options will be used. However, environment variables may change which
   * global options are in effect at runtime.
   */
  CXGlobalOpt_Default = 0x0,

  /**
   * Used to indicate that threads that libclang creates for indexing
   * purposes should use background priority.
   *
   * Affects #clang_indexSourceFile, #clang_indexTranslationUnit,
   * #clang_parseTranslationUnit, #clang_saveTranslationUnit.
   */
  CXGlobalOpt_ThreadBackgroundPriorityForIndexing = 0x1,

  /**
   * Used to indicate that threads that libclang creates for editing
   * purposes should use background priority.
   *
   * Affects #clang_reparseTranslationUnit, #clang_codeCompleteAt,
   * #clang_annotateTokens
   */
  CXGlobalOpt_ThreadBackgroundPriorityForEditing = 0x2,

  /**
   * Used to indicate that all threads that libclang creates should use
   * background priority.
   */
  CXGlobalOpt_ThreadBackgroundPriorityForAll =
      CXGlobalOpt_ThreadBackgroundPriorityForIndexing |
      CXGlobalOpt_ThreadBackgroundPriorityForEditing,

  /**
   * Used to indicate that no global options should be used, even
   * in the presence of environment variables.
   */
  CXGlobalOpt_None = 0xFFFFFFFF
} CXGlobalOptFlags;

so that when the user passes 0 they get the previous behavior.

clang_CXIndex_setGlobalOptions() would remain deprecated. clang_CXIndex_getGlobalOptions() would be interesting though -- would it return CXGlobalOpt_None or CXGlobalOpt_Default in the event the index was created without any global options? Hmmm.

Err, actually, I suppose this won't work too well because then code silently changes behavior if it does clang_CXIndex_setGlobalOptions(CXGlobalOpt_None); because that would change from "do what the environment says" to "ignore the environment". But I have to wonder whether anyone actually *does* that or not... my intuition is that folks would not call clang_CXIndex_setGlobalOptions() at all unless they were setting an option to a non-none value. We could rename CXGlobalOpt_None to CXGlobalOpt_Nothing (or something along those lines) to force a compilation error, but that's a bit of a nuclear option for what's supposed to be a stable API.

So I'm on the fence, I guess. I'd still prefer for zero to give sensible defaults and I don't think there's enough use of the global options + environment variables to matter. But I also don't like silently breaking code, so my idea above may be a nonstarter.

I suppose another possible idea is: deprecate the notion of global options enum and setter/getter entirely, add two new fields to CXIndexOptions:

typedef enum {
  CXChoice_Default = 0,
  CXChoice_Enabled = 1,
  CXChoice_Disabled = 2
} CXChoice;

...
unsigned ThreadPriorityBackgroundForIndexing;
unsigned ThreadPriorityBackgroundForEditing;
...

so that 0 gives the correct default behavior based on environment variable. There would be no global setter or getter for this information (and we'd eventually remove clang_CXIndex_[gs]etGlobalOptions()).

vedgy added inline comments.Mar 2 2023, 12:42 AM
clang/include/clang-c/Index.h
329

I suppose this won't work too well because then code silently changes behavior if it does clang_CXIndex_setGlobalOptions(CXGlobalOpt_None); because that would change from "do what the environment says" to "ignore the environment".

No, the current consequence of such a call already is to ignore the environment. What would change is the consequence of calling clang_CXIndex_setGlobalOptions(0); - from "ignore the environment" to "do what the environment says".

But I have to wonder whether anyone actually *does* that or not... my intuition is that folks would not call clang_CXIndex_setGlobalOptions() at all unless they were setting an option to a non-none value.

I agree. Possible unlikely reasons to call clang_CXIndex_setGlobalOptions(0) are:

  1. in case the environment variables are set for some other program;
  2. in case setting the environment variables had been useful in the past but not in the latest and greatest version of an IDE;
  3. if background priority is never useful for an IDE.

I suppose another possible idea is: deprecate the notion of global options enum and setter/getter entirely, add two new fields to CXIndexOptions

This appears to be a great idea to me. The notion of CXGlobalOptFlags somewhat conflicts with the new CXIndexOptions struct, in which two other boolean options are represented by bit-fields.

I think we can forbid from the start calling clang_CXIndex_[gs]etGlobalOptions() if the index is created via the new function clang_createIndexWithOptions.

If 3-state environment variables (unspecified/on/off) are introduced in the future, CXChoice could be extended with CXChoice_FromEnvironmentOrEnabled = 3 to indicate that if the environment variable is present, its value should be respected, otherwise the thread priority should be enabled.

CXChoice cannot possibly have many valid values. So how about:

unsigned char ThreadPriorityBackgroundForIndexing;
unsigned char ThreadPriorityBackgroundForEditing;

Then size_t Size could become unsigned Size and all non-pointer options would fit into 8 bytes on x86_64.

Did you reorder the words in the variable names intentionally? CXGlobalOpt_ThreadBackgroundPriorityForIndexing => ThreadPriorityBackgroundForIndexing

vedgy marked an inline comment as not done.Mar 2 2023, 3:37 AM
vedgy added inline comments.
clang/include/clang-c/Index.h
329

Perhaps clang_CXIndex_getGlobalOptions() should not be deprecated. It would allow the application to learn which priorities are actually used by libclang.

vedgy marked an inline comment as not done.Mar 2 2023, 3:39 AM
aaron.ballman added inline comments.Mar 2 2023, 11:49 AM
clang/include/clang-c/Index.h
329

I think we can forbid from the start calling clang_CXIndex_[gs]etGlobalOptions() if the index is created via the new function clang_createIndexWithOptions.

I think that would be great, at least for the setter. As you mention, we might not want to deprecate the getter (so long as it reports accurate information to the caller) because that could be useful still. However, I was also thinking that same logic could apply (at least in theory) to other options in the structure and so we might want to deprecate it with a replacement that gets all of the options from the index at once. However, that felt like scope creep for this already long-running patch.

CXChoice cannot possibly have many valid values. So how about:

unsigned char ThreadPriorityBackgroundForIndexing;
unsigned char ThreadPriorityBackgroundForEditing;

Then size_t Size could become unsigned Size and all non-pointer options would fit into 8 bytes on x86_64.

Hmmm that could work, but how about:

/* Stores a value of type CXChoice. */
int ThreadPriorityBackgroundForIndexing : 2;
/* Stores a value of type CXChoice. */
int ThreadPriorityBackgroundForEditing : 2;

so that they pack together with the existing bit-fields?

Did you reorder the words in the variable names intentionally? CXGlobalOpt_ThreadBackgroundPriorityForIndexing => ThreadPriorityBackgroundForIndexing

Nope, sorry for the confusion, that was a think-o on my part.

dblaikie removed a subscriber: dblaikie.Mar 2 2023, 2:21 PM
vedgy added inline comments.Mar 2 2023, 11:49 PM
clang/include/clang-c/Index.h
329

I think that would be great, at least for the setter. As you mention, we might not want to deprecate the getter (so long as it reports accurate information to the caller) because that could be useful still. However, I was also thinking that same logic could apply (at least in theory) to other options in the structure and so we might want to deprecate it with a replacement that gets all of the options from the index at once. However, that felt like scope creep for this already long-running patch.

For the time being, only the two global options depend on the environment. Though one could get the system temporary directory path if PreambleStoragePath is not overridden. Anyway, there is no known use case for getting such information, so no need to implement this right now, especially since clang_CXIndex_getGlobalOptions() already exists and would work correctly without modifications.

CXChoice cannot possibly have many valid values. So how about:

unsigned char ThreadPriorityBackgroundForIndexing;
unsigned char ThreadPriorityBackgroundForEditing;

Then size_t Size could become unsigned Size and all non-pointer options would fit into 8 bytes on x86_64.

Hmmm that could work, but how about:

/* Stores a value of type CXChoice. */
int ThreadPriorityBackgroundForIndexing : 2;
/* Stores a value of type CXChoice. */
int ThreadPriorityBackgroundForEditing : 2;

so that they pack together with the existing bit-fields?

The tighter packing wouldn't reduce the size of the struct further in LLVM 17, because most likely only one more boolean (StorePreamblesInMemory) will be added to the struct in this release. In future releases the struct size would have to increase for versioning purposes anyway. But the 2-bitness would reduce the flexibility of CXChoice. I have already invented a 4th possible enumerator. Someone could come up with a fifth in the future. So I'd prefer unsigned char or at least : 4 (4 bits) for these non-boolean options.

aaron.ballman added inline comments.Mar 3 2023, 8:08 AM
clang/include/clang-c/Index.h
329

The tighter packing wouldn't reduce the size of the struct further in LLVM 17, because most likely only one more boolean (StorePreamblesInMemory) will be added to the struct in this release.

Okay, fair point, let's stick with unsigned char then (tbh, I don't know that I'd be upset if it was int either -- this structure is passed by pointer anyway).

vedgy updated this revision to Diff 502376.Mar 4 2023, 9:37 AM

Replace clang_getDefaultGlobalOptions() with CXChoice as discussed.
A few unrelated small improvements.

The changes look correct to me, but precommit CI isn't able to apply the patch cleanly to test. Can you rebase the patch so it applies so we can smoke test it before acceptance?

vedgy updated this revision to Diff 502632.Mar 6 2023, 7:01 AM
vedgy edited the summary of this revision. (Show Details)

Clean rebase on main branch where the base revision D143415 has already landed.

aaron.ballman accepted this revision.Mar 6 2023, 11:49 AM

Thank you, this LGTM! I have to head out shortly, so I'll land this on your behalf tomorrow when I have the time to babysit the postcommit build farm. However, if you'd like to request commit access for yourself, I think that'd be reasonable as well: https://llvm.org/docs/DeveloperPolicy.html#obtaining-commit-access Let me know which route you'd prefer going.

This revision is now accepted and ready to land.Mar 6 2023, 11:49 AM
vedgy added a comment.Mar 7 2023, 1:54 AM

Thank you, this LGTM! I have to head out shortly, so I'll land this on your behalf tomorrow when I have the time to babysit the postcommit build farm. However, if you'd like to request commit access for yourself, I think that'd be reasonable as well: https://llvm.org/docs/DeveloperPolicy.html#obtaining-commit-access Let me know which route you'd prefer going.

https://llvm.org/docs/Phabricator.html#committing-a-change says:

Using the Arcanist tool can simplify the process of committing reviewed code as it will retrieve reviewers, the Differential Revision, etc from the review and place it in the commit message. You may also commit an accepted change directly using git push, per the section in the getting started guide.

But how to use the Arcanist tool to push reviewed changes is not elaborated. As far as I can tell from the Arcanist documentation, if I have the commit in my local main branch which is currently checked out, I simply need to run arc land without arguments. Hopefully the Git pre-push hook I have set up will run in this case.

I have no idea how to babysit the postcommit build farm, and so wouldn't commit when you don't have time anyway. I plan to make only one more change to libclang in the foreseeable future, so not sure learning to handle postcommit issues is justified. I'll leave this to your discretion as I have no idea how difficult and time-consuming this work is, compared to learning how to do it.

This revision was automatically updated to reflect the committed changes.

Thank you, this LGTM! I have to head out shortly, so I'll land this on your behalf tomorrow when I have the time to babysit the postcommit build farm. However, if you'd like to request commit access for yourself, I think that'd be reasonable as well: https://llvm.org/docs/DeveloperPolicy.html#obtaining-commit-access Let me know which route you'd prefer going.

https://llvm.org/docs/Phabricator.html#committing-a-change says:

Using the Arcanist tool can simplify the process of committing reviewed code as it will retrieve reviewers, the Differential Revision, etc from the review and place it in the commit message. You may also commit an accepted change directly using git push, per the section in the getting started guide.

But how to use the Arcanist tool to push reviewed changes is not elaborated. As far as I can tell from the Arcanist documentation, if I have the commit in my local main branch which is currently checked out, I simply need to run arc land without arguments. Hopefully the Git pre-push hook I have set up will run in this case.

I have no idea how to babysit the postcommit build farm, and so wouldn't commit when you don't have time anyway. I plan to make only one more change to libclang in the foreseeable future, so not sure learning to handle postcommit issues is justified. I'll leave this to your discretion as I have no idea how difficult and time-consuming this work is, compared to learning how to do it.

Thanks for the explanation, I'm happy to land on your behalf (esp if you don't plan to make many more changes in the future). I landed the changes in cc929590ad305f0d068709c7c7999f5fc6118dc9. I'm not of much help with arcanist as I've never actually used it, but I think you're correct about only needing to do arc land. In terms of babysitting the build farm, there's not too much to it -- most bots are configured to send an email to the folks on the blamelist when an issue is found. It's mostly a matter of being around to check your email, but you can manually watch from https://lab.llvm.org/buildbot/#/builders if you enjoy that sort of thing.

vedgy added a comment.Mar 7 2023, 6:30 AM

A clang-ppc64le-rhel build failed:

54.897 [28/169/148] Building CXX object tools/clang/unittests/libclang/CMakeFiles/libclangTests.dir/LibclangTest.cpp.o
FAILED: tools/clang/unittests/libclang/CMakeFiles/libclangTests.dir/LibclangTest.cpp.o 
/home/buildbots/clang.15.0.4/bin/clang++ --gcc-toolchain=/usr -DGTEST_HAS_RTTI=0 -D_DEBUG -D_GLIBCXX_ASSERTIONS -D_GNU_SOURCE -D_LIBCPP_ENABLE_ASSERTIONS -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/home/buildbots/docker-RHEL84-buildbot/SetupBot/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/build/tools/clang/unittests/libclang -I/home/buildbots/docker-RHEL84-buildbot/SetupBot/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/llvm-project/clang/unittests/libclang -I/home/buildbots/docker-RHEL84-buildbot/SetupBot/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/llvm-project/clang/include -I/home/buildbots/docker-RHEL84-buildbot/SetupBot/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/build/tools/clang/include -I/home/buildbots/docker-RHEL84-buildbot/SetupBot/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/build/include -I/home/buildbots/docker-RHEL84-buildbot/SetupBot/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/llvm-project/llvm/include -I/home/buildbots/docker-RHEL84-buildbot/SetupBot/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/llvm-project/third-party/unittest/googletest/include -I/home/buildbots/docker-RHEL84-buildbot/SetupBot/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/llvm-project/third-party/unittest/googlemock/include -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wc++98-compat-extra-semi -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -fno-common -Woverloaded-virtual -Wno-nested-anon-types -O3 -DNDEBUG  -Wno-variadic-macros -Wno-gnu-zero-variadic-macro-arguments -fno-exceptions -funwind-tables -fno-rtti -UNDEBUG -Wno-suggest-override -std=c++17 -MD -MT tools/clang/unittests/libclang/CMakeFiles/libclangTests.dir/LibclangTest.cpp.o -MF tools/clang/unittests/libclang/CMakeFiles/libclangTests.dir/LibclangTest.cpp.o.d -o tools/clang/unittests/libclang/CMakeFiles/libclangTests.dir/LibclangTest.cpp.o -c /home/buildbots/docker-RHEL84-buildbot/SetupBot/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/llvm-project/clang/unittests/libclang/LibclangTest.cpp
/home/buildbots/docker-RHEL84-buildbot/SetupBot/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/llvm-project/clang/unittests/libclang/LibclangTest.cpp:488:50: error: missing field 'ThreadBackgroundPriorityForIndexing' initializer [-Werror,-Wmissing-field-initializers]
    CXIndexOptions Opts = {sizeof(CXIndexOptions)};
                                                 ^
1 error generated.

Same on ppc64le-lld-multistage-test.

Do you know why partial struct initialization is unsupported on this platform?
A simple workaround is to use memset in this single place. I am preparing a patch.

A clang-ppc64le-rhel build failed:

54.897 [28/169/148] Building CXX object tools/clang/unittests/libclang/CMakeFiles/libclangTests.dir/LibclangTest.cpp.o
FAILED: tools/clang/unittests/libclang/CMakeFiles/libclangTests.dir/LibclangTest.cpp.o 
/home/buildbots/clang.15.0.4/bin/clang++ --gcc-toolchain=/usr -DGTEST_HAS_RTTI=0 -D_DEBUG -D_GLIBCXX_ASSERTIONS -D_GNU_SOURCE -D_LIBCPP_ENABLE_ASSERTIONS -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -I/home/buildbots/docker-RHEL84-buildbot/SetupBot/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/build/tools/clang/unittests/libclang -I/home/buildbots/docker-RHEL84-buildbot/SetupBot/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/llvm-project/clang/unittests/libclang -I/home/buildbots/docker-RHEL84-buildbot/SetupBot/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/llvm-project/clang/include -I/home/buildbots/docker-RHEL84-buildbot/SetupBot/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/build/tools/clang/include -I/home/buildbots/docker-RHEL84-buildbot/SetupBot/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/build/include -I/home/buildbots/docker-RHEL84-buildbot/SetupBot/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/llvm-project/llvm/include -I/home/buildbots/docker-RHEL84-buildbot/SetupBot/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/llvm-project/third-party/unittest/googletest/include -I/home/buildbots/docker-RHEL84-buildbot/SetupBot/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/llvm-project/third-party/unittest/googlemock/include -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wc++98-compat-extra-semi -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -Wctad-maybe-unsupported -fdiagnostics-color -ffunction-sections -fdata-sections -fno-common -Woverloaded-virtual -Wno-nested-anon-types -O3 -DNDEBUG  -Wno-variadic-macros -Wno-gnu-zero-variadic-macro-arguments -fno-exceptions -funwind-tables -fno-rtti -UNDEBUG -Wno-suggest-override -std=c++17 -MD -MT tools/clang/unittests/libclang/CMakeFiles/libclangTests.dir/LibclangTest.cpp.o -MF tools/clang/unittests/libclang/CMakeFiles/libclangTests.dir/LibclangTest.cpp.o.d -o tools/clang/unittests/libclang/CMakeFiles/libclangTests.dir/LibclangTest.cpp.o -c /home/buildbots/docker-RHEL84-buildbot/SetupBot/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/llvm-project/clang/unittests/libclang/LibclangTest.cpp
/home/buildbots/docker-RHEL84-buildbot/SetupBot/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/llvm-project/clang/unittests/libclang/LibclangTest.cpp:488:50: error: missing field 'ThreadBackgroundPriorityForIndexing' initializer [-Werror,-Wmissing-field-initializers]
    CXIndexOptions Opts = {sizeof(CXIndexOptions)};
                                                 ^
1 error generated.

Same on ppc64le-lld-multistage-test.

Do you know why partial struct initialization is unsupported on this platform?
A simple workaround is to use memset in this single place. I am preparing a patch.

It's a two-stage build. The first stage passed, but the second stage is built with -Werror, which is why it fails. I addressed the issue in df8f8f76207df40dca11c9c0c2328d6b3dfba9ca, so no need for you to worry about making a patch. I went with {} to perform the zero initialization instead of memset, but either should work fine.

vedgy added a comment.Mar 7 2023, 6:50 AM

Thank you for the quick build fix. KDevelop's CMakeLists.txt disables this warning by adding the -Wno-missing-field-initializers flag. That's why I haven't noticed the warning while building KDevelop. Either I missed the warning while building LLVM or it is also disabled in a default GNU/Linux x86_64 build.

Thank you for the quick build fix.

You're welcome!

KDevelop's CMakeLists.txt disables this warning by adding the -Wno-missing-field-initializers flag. That's why I haven't noticed the warning while building KDevelop. Either I missed the warning while building LLVM or it is also disabled in a default GNU/Linux x86_64 build.

https://github.com/llvm/llvm-project/blob/main/llvm/cmake/modules/HandleLLVMOptions.cmake#L730

vedgy added a comment.Mar 7 2023, 8:03 AM

I am implementing the StorePreamblesInMemory bool option discussed earlier. It would be nice to be able to modify it at any time, because it can be an option in an IDE's UI and requiring to restart an IDE for the option change to take effect is undesirable. In order to make the setter thread-safe, the option can be stored as std::atomic<bool> StorePreamblesInMemory in class CIndexer and stored/loaded with memory_order_relaxed. Setting this option would apply only to subsequent clang_parseTranslationUnit_Impl() calls. The option can also be added to CXIndexOptions in order to allow setting its initial value reliably (not worrying whether it could be used before the setter gets called after index construction).

Is adding both the setter and the CXIndexOptions member OK or would you prefer only one of these two?

I am implementing the StorePreamblesInMemory bool option discussed earlier. It would be nice to be able to modify it at any time, because it can be an option in an IDE's UI and requiring to restart an IDE for the option change to take effect is undesirable. In order to make the setter thread-safe, the option can be stored as std::atomic<bool> StorePreamblesInMemory in class CIndexer and stored/loaded with memory_order_relaxed. Setting this option would apply only to subsequent clang_parseTranslationUnit_Impl() calls.

Hmmm, don't relaxed loads and stores still have the potential to be racey? I thought you needed a release on the store and an acquire on the load (or sequential consistency), but this is definitely not my area of expertise.

The option can also be added to CXIndexOptions in order to allow setting its initial value reliably (not worrying whether it could be used before the setter gets called after index construction).

Is adding both the setter and the CXIndexOptions member OK or would you prefer only one of these two?

It's a bit odd to me that we're deprecating the global option setters to turn around and add a new global option setter. The old interfaces are not thread safe and the new one is thread safe, so I get why the desire exists and is reasonable, but (personally) I'd like to get rid of the option state setters entirely someday. What I was envisioning was that the index creation would get global options and if we wanted per-TU options, we'd add an interface akin to clang_parseTranslationUnit() which takes another options struct for per-TU options. (Perhaps the default values for those options come from the global options -- e.g., DisplayDiagnostics is set at the global level but could be overridden for a single TU). Do you have thoughts about that approach?

vedgy added a comment.Mar 7 2023, 11:04 AM

Hmmm, don't relaxed loads and stores still have the potential to be racey? I thought you needed a release on the store and an acquire on the load (or sequential consistency), but this is definitely not my area of expertise.

The acquire/release semantics is needed if the atomic variable guards access to another non-atomic variable or the atomic pointer guards access to its non-atomic pointed-to value. The relaxed order means no guarantees about this variable's interaction with other atomic or non-atomic variables. But even the relaxed order prevents data races on the variable itself, which is sufficient here.

The option can also be added to CXIndexOptions in order to allow setting its initial value reliably (not worrying whether it could be used before the setter gets called after index construction).

Is adding both the setter and the CXIndexOptions member OK or would you prefer only one of these two?

It's a bit odd to me that we're deprecating the global option setters to turn around and add a new global option setter. The old interfaces are not thread safe and the new one is thread safe, so I get why the desire exists and is reasonable, but (personally) I'd like to get rid of the option state setters entirely someday.

I also thought about the deprecated old interfaces today. clang_CXIndex_setGlobalOptions() could be undeprecated by similarly making CIndexer::Options atomic. Safely setting std::string members would require a mutex.

What I was envisioning was that the index creation would get global options and if we wanted per-TU options, we'd add an interface akin to clang_parseTranslationUnit() which takes another options struct for per-TU options. (Perhaps the default values for those options come from the global options -- e.g., DisplayDiagnostics is set at the global level but could be overridden for a single TU). Do you have thoughts about that approach?

This would allow to specify whether to store each individual preamble in memory, which is more specific than necessary for my use case. This would shift the burden of passing around the StorePreamblesInMemory value to libclang users. Certainly more difficult to implement both in libclang and in KDevelop.

Advantages of getting rid of the option state setters entirely and passing options to per-TU APIs:

  1. More flexibility (per-TU option specification).
  2. Possibly more efficient (no need for atomics and mutexes).
  3. Clearer to API users which TUs the modified options apply to and which TUs (created earlier) keep the original options.
  4. Less risk of inconsistencies and bugs in libclang. Such as somehow passing a new option value to an old TU with unpredictable consequences.

Do the listed advantages explain your preference?

I am not sure what I would prefer from a hypothetical standpoint of a libclang maintainer. And you definitely have more experience in this area. So I won't argue for the index option state setters.

I'll probably implement the uncontroversial CXIndexOptions member first. And then, if I think implementing it is worth the effort, continue discussing clang_parseTranslationUnitWithOptions().

vedgy added inline comments.Mar 9 2023, 2:06 AM
clang/include/clang-c/Index.h
349

Assigning true to int : 1 bit-fields in C++ code produces a GCC warning:

warning: overflow in conversion from ‘int’ to ‘signed char:1’ changes value from ‘1’ to ‘-1’ [-Woverflow]

Following a suggestion in a comment to https://github.com/llvm/llvm-project/issues/53253, I replaced this int with unsigned and the warning disappeared. Same for int DisplayDiagnostics : 1. Should this type change be included in the upcoming StorePreamblesInMemory revision?

vedgy marked an inline comment as not done.Mar 9 2023, 5:54 AM
vedgy added inline comments.
clang/include/clang-c/Index.h
349

Or should this change be done in a separate revision, on which the StorePreamblesInMemory would be based?

I also implemented two other changes to the struct CXIndexOptions (mostly documentation/comments). Should these all be in separate revisions or combined into one?

Hmmm, don't relaxed loads and stores still have the potential to be racey? I thought you needed a release on the store and an acquire on the load (or sequential consistency), but this is definitely not my area of expertise.

The acquire/release semantics is needed if the atomic variable guards access to another non-atomic variable or the atomic pointer guards access to its non-atomic pointed-to value. The relaxed order means no guarantees about this variable's interaction with other atomic or non-atomic variables. But even the relaxed order prevents data races on the variable itself, which is sufficient here.

Ah, thank you for the explanation!

The option can also be added to CXIndexOptions in order to allow setting its initial value reliably (not worrying whether it could be used before the setter gets called after index construction).

Is adding both the setter and the CXIndexOptions member OK or would you prefer only one of these two?

It's a bit odd to me that we're deprecating the global option setters to turn around and add a new global option setter. The old interfaces are not thread safe and the new one is thread safe, so I get why the desire exists and is reasonable, but (personally) I'd like to get rid of the option state setters entirely someday.

I also thought about the deprecated old interfaces today. clang_CXIndex_setGlobalOptions() could be undeprecated by similarly making CIndexer::Options atomic. Safely setting std::string members would require a mutex.

Yeah, we could make it thread safe, but I still don't think setters are a good design approach. To me, these global options should be ones that are set when creating the index and be immutable from there on. (This mirrors how language options work in Clang itself -- we have a ton of language options, but they get set up by the compiler frontend and are (generally) immutable from there on. When we need to allow for different options, the interface accepts a LangOptions object, as in: https://github.com/llvm/llvm-project/blob/main/clang/include/clang/Analysis/CFG.h#L1081)

What I was envisioning was that the index creation would get global options and if we wanted per-TU options, we'd add an interface akin to clang_parseTranslationUnit() which takes another options struct for per-TU options. (Perhaps the default values for those options come from the global options -- e.g., DisplayDiagnostics is set at the global level but could be overridden for a single TU). Do you have thoughts about that approach?

This would allow to specify whether to store each individual preamble in memory, which is more specific than necessary for my use case. This would shift the burden of passing around the StorePreamblesInMemory value to libclang users. Certainly more difficult to implement both in libclang and in KDevelop.

Advantages of getting rid of the option state setters entirely and passing options to per-TU APIs:

  1. More flexibility (per-TU option specification).
  2. Possibly more efficient (no need for atomics and mutexes).
  3. Clearer to API users which TUs the modified options apply to and which TUs (created earlier) keep the original options.
  4. Less risk of inconsistencies and bugs in libclang. Such as somehow passing a new option value to an old TU with unpredictable consequences.

Do the listed advantages explain your preference?

Yes, thank you! #3/#4 are really the biggest advantages, to me. By passing in the options explicitly to the TU instead of having setters, we reduce the complexity of the interface because it no longer becomes possible for an option to change mid-parse. This in turn makes testing far easier because we don't have to come up with test coverage for "what if the option was X and it got switched to Y before calling this function" kind of situations.

I am not sure what I would prefer from a hypothetical standpoint of a libclang maintainer. And you definitely have more experience in this area. So I won't argue for the index option state setters.

I'll probably implement the uncontroversial CXIndexOptions member first. And then, if I think implementing it is worth the effort, continue discussing clang_parseTranslationUnitWithOptions().

I think that's a good approach. You are a consumer of libclang, so having your perspective about what makes the interface easier for you to use is also really helpful in figuring out a good design. Thank you for being willing to share your experiences with using libclang on KDevelop!

clang/include/clang-c/Index.h
349

Assigning true to int : 1 bit-fields in C++ code produces a GCC warning:

warning: overflow in conversion from ‘int’ to ‘signed char:1’ changes value from ‘1’ to ‘-1’ [-Woverflow]

Ugh, I forgot that the C standard allows that. (C2x 6.7.2.1p12: "A bit-field member is interpreted as having a signed or unsigned integer type consisting of the specified number of bits" -- GCC decided to turn our int into signed char which is nice for packing data together, but not as nice when it comes to boolean-like bit-fields.)

Should this type change be included in the upcoming StorePreamblesInMemory revision?

It'd probably be the cleanest to fix that separately. Given that it's NFC and you don't have commit privileges, I can make the change on your behalf and land it today if that's what you'd like.

vedgy marked an inline comment as not done.Mar 9 2023, 6:03 AM
vedgy added inline comments.
clang/include/clang-c/Index.h
349

Yes, I agree that such changes should be in separate commits. But I don't want to burden you with committing them all separately. So if 4 is too much, I can request the commit access for myself. If this burden is not too heavy, I am fine with you making the change on my behalf.

aaron.ballman added inline comments.Mar 9 2023, 6:30 AM
clang/include/clang-c/Index.h
349

No worries, this was a trivial one -- I landed it in dbde7cc17c3a5b6a35e5ec598ba7eaba6f75d90b, so you should be able to fetch and rebase on top of that.

vedgy added inline comments.Mar 10 2023, 4:11 AM
clang/include/clang-c/Index.h
349

I also implemented two other changes to the struct CXIndexOptions (mostly documentation/comments).

Here they are: D145775, D145783.

aaron.ballman added inline comments.Mar 10 2023, 6:29 AM
clang/include/clang-c/Index.h
349

Thank you, I've landed both of them.

Just created the follow-up StorePreamblesInMemory revision: D145974.