This is an archive of the discontinued LLVM Phabricator instance.

[clang][lex] Keep references to `DirectoryLookup` objects up-to-date
Needs ReviewPublic

Authored by jansvoboda11 on Jan 6 2022, 8:16 AM.

Details

Summary

The elements of SearchPath::SearchDirs are being referenced to by their indices. This proved to be error-prone: HeaderSearch::SearchDirToHSEntry was accidentally not being updated in HeaderSearch::AddSearchPath(). This patch fixes that by referencing SearchPath::SearchDirs elements by their address instead, which is stable thanks to the bump-ptr-allocation strategy.

Depends on D117312 & D117566.

Diff Detail

Event Timeline

jansvoboda11 created this revision.Jan 6 2022, 8:16 AM
jansvoboda11 requested review of this revision.Jan 6 2022, 8:16 AM
Herald added a project: Restricted Project. · View Herald TranscriptJan 6 2022, 8:16 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript

Adjusting the indices seem pretty fragile to me. Any reason why you wanted to stick with indices as keys instead of switching to something else like I suggested here?

Adjusting the indices seem pretty fragile to me. Any reason why you wanted to stick with indices as keys instead of switching to something else like I suggested here?

I've implemented your suggestion locally like so:

llvm::SpecificBumpPtrAllocator<DirectoryLookup> SearchDirsAlloc;
std::vector<DirectoryLookup *> SearchDirs;
llvm::DenseMap<const DirectoryLookup *, unsigned> SearchDirToHSEntry;
...

In the end, I decided to stick with just updating the indices to avoid creating unnecessary indirection (and replacing . with -> everywhere).

I don't have any preference though, so if you feel strongly about the fragility, I can use addresses to refer to DirectoryLookup objects instead of indices.

I would prefer to use std::shared_ptr<DirectoryLookup> instead of unsigned. IIUC we already basically have a level of indirection because if we want to get the DirectoryLookup, we need to find it by its index in SearchDirs. And updating the indices just seems like a hack to me that can be solved more cleanly.

Wouldn’t we also need to update the indices in SearchDirsUsage and ModuleToSearchDirIdx? Or replace those by std::shared_ptr<DirectoryLookup> as well?

Yeah, in some cases, we'd be just replacing the index-based indirection with address-based indirection (e.g. when working with SearchDirToHSEntry).

However, most existing HeaderSearch algorithms already work with the index and use that to get the DirectoryLookup from SearchDirs. We'd be adding another level of indirection for them: first they'd need to get DirectoryLookup * from SearchDirs with the index and then dereference the pointer. Maybe that's not very important from performance perspective (especially if we bump-ptr-allocate them), but something to be aware of IMO.

The SearchDirsUsage member is a bit-vector that's being inserted into the same way SearchDirs is, so no need to fiddle with indices there. As for ModuleToSearchDirIdx: yes, we'd need to update indices the same way we do here.

Use pointers instead of indices to identify DirectoryLookup objects.

jansvoboda11 retitled this revision from [clang][lex] Keep search directory indices up-to-date to [clang][lex] Keep references to `DirectoryLookup` objects up-to-date.Jan 7 2022, 7:15 AM
jansvoboda11 edited the summary of this revision. (Show Details)

I like this a lot better. Some comments inline.

clang/include/clang/Lex/HeaderSearch.h
179–184

I haven’t tried it but is there a particular reason why this can’t be a const DirectoryLookup *?

clang/lib/Lex/HeaderSearch.cpp
323

Nitpick: SearchDirs[Idx] can be simplified to SearchDir->isFramework(). Similarly below.

clang/unittests/Lex/HeaderSearchTest.cpp
308

Wouldn’t it be cleaner to just check that UsedSearchDirs only contains a single element and that it’s name is /M2? In that case we could also remove getSearchDirUsage.

jansvoboda11 added inline comments.Jan 10 2022, 8:39 AM
clang/include/clang/Lex/HeaderSearch.h
179–184

While iterating over SearchDirs, the elements can be passed to HeaderSearch::loadSubdirectoryModuleMaps that mutates them.

clang/lib/Lex/HeaderSearch.cpp
323

SearchDir will be removed in the following patch: D113676.

clang/unittests/Lex/HeaderSearchTest.cpp
308

Maybe I'm misunderstanding you, but I don't think so. We'd still need accessor for HeaderSearch::UsedSearchDirs and we don't have the expected DirectoryLookup * lying around, making matching more cumbersome:

const llvm::DenseSet<DirectoryLookup *> &UsedSearchDirs =
    Search.getUsedSearchDirs();
EXPECT_EQ(UsedSearchDirs.size(), 2);
EXPECT_EQ(1, llvm::count_if(UsedSearchDirs, [](const auto *SearchDir) {
            return SearchDir->getName() == "/M1";
          }));
EXPECT_EQ(1, llvm::count_if(UsedSearchDirs, [](const auto *SearchDir) {
            return SearchDir->getName() == "/M2";
          }));

or

llvm::DenseSet<std::string> UsedSearchDirsStr;
for (const auto *SearchDir : Search.getUsedSearchDirs())
  UsedSearchDirsStr.insert(SearchDir->getName());
EXPECT_EQ(UsedSearchDirsStr, (llvm::DenseSet<std::string>{"/M1", "/M2"}));

I think having bit-vector, whose indices correspond to the directory names ("/M{i}"), and using operator== for matching is simpler.

Let me know if you had something else in mind.

ahoppen added inline comments.Jan 10 2022, 10:26 AM
clang/include/clang/Lex/HeaderSearch.h
179–184

OK. Makes sense. Thanks.

clang/lib/Lex/HeaderSearch.cpp
323

Ah, OK. In that case it makes sense to keep using SearchDirs[Idx].

clang/unittests/Lex/HeaderSearchTest.cpp
308

I just don’t like the bit-vectors and basically thought of the second option you were suggesting, but maybe that’s really just personal taste. If you’d like to keep the bit-vector, could you change the comment of getSearchDirUsage to something like

/// Return a vector of length \c SearchDirs.size() that indicates for each search directory whether it was used.

Update documentation, inline variables in test

jansvoboda11 marked 4 inline comments as done.Jan 11 2022, 5:26 AM

Thanks for the feedback!

clang/unittests/Lex/HeaderSearchTest.cpp
308

I've updated documentation for HeaderSearch::getSearchDirUsage.

ahoppen accepted this revision.Jan 11 2022, 5:40 AM
This revision is now accepted and ready to land.Jan 11 2022, 5:40 AM
This revision was landed with ongoing or failed builds.Jan 11 2022, 6:24 AM
This revision was automatically updated to reflect the committed changes.
jansvoboda11 marked an inline comment as done.
thakis added a subscriber: thakis.Jan 11 2022, 6:49 AM
  1. which is stable thanks to the bump-ptr-allocation strategy. I don't understand this. In each slab, that's true, but why is it true between objects allocated in different slabs?
  2. This increases numbers of TUs compiled for LexTests by over 10%. Is there no way around that Frontend dep?
  1. which is stable thanks to the bump-ptr-allocation strategy. I don't understand this. In each slab, that's true, but why is it true between objects allocated in different slabs?

That's referring to the fact that once we allocate new DirectoryLookup with SpecificBumpPtrAllocator, address of that object won't change (unlike with std::vector). This means that we can take the address and use it without worrying about invalidation.

  1. This increases numbers of TUs compiled for LexTests by over 10%. Is there no way around that Frontend dep?

I'll look into moving clang::ApplyHeaderSearchOptions from Frontend into Lex.

That's referring to the fact that once we allocate new DirectoryLookup with SpecificBumpPtrAllocator, address of that object won't change (unlike with std::vector). This means that we can take the address and use it without worrying about invalidation.

Ah, "stable" made me think of iteration order, but that's not what you meant. Makes sense, thanks!

carlosgalvezp added a subscriber: carlosgalvezp.EditedJan 13 2022, 4:55 AM

Hi @jansvoboda11 !

I believe this commit is the root cause of this issue: https://github.com/llvm/llvm-project/issues/53161

Believe it or not, I was (un)lucky enough to run into it. I have bisected the repo and my tests indicate that this commit is the offender. Do you have any obvious idea what could be the problem?

I'm suspecting this is the offender, since it's related to #include_next which is where I get the error:
https://github.com/llvm/llvm-project/commit/8503c688d555014b88849e933bf096035a351586#diff-fb96fa41fef79b66ab53ac1d845ab467defc41c2639e9b0cd60b471b04f898cdR981
To summarize, clang cannot find /usr/include only if I include exactly 511 directories via -I. If I include 510 or 512, then it works fine. I will test if the same happens with smaller powers of two. I have a test script that I can share if you want to test it yourself.

EDIT: the same happen when including exactly 255 directories. It does not happen for 127 and below. It does happen for 1023 and above.

Hi @jansvoboda11 !

I believe this commit is the root cause of this issue: https://github.com/llvm/llvm-project/issues/53161

Believe it or not, I was (un)lucky enough to run into it. I have bisected the repo and my tests indicate that this commit is the offender. Do you have any obvious idea what could be the problem?

I'm suspecting this is the offender, since it's related to #include_next which is where I get the error:
https://github.com/llvm/llvm-project/commit/8503c688d555014b88849e933bf096035a351586#diff-fb96fa41fef79b66ab53ac1d845ab467defc41c2639e9b0cd60b471b04f898cdR981
To summarize, clang cannot find /usr/include only if I include exactly 511 directories via -I. If I include 510 or 512, then it works fine. I will test if the same happens with smaller powers of two. I have a test script that I can share if you want to test it yourself.

EDIT: the same happen when including exactly 255 directories. It does not happen for 127 and below. It does happen for 1023 and above.

Hi, thanks for reporting this. The reproducer script would be helpful. Looking at the code, I don't see anything obviously wrong.

Hi @jansvoboda11 !

I believe this commit is the root cause of this issue: https://github.com/llvm/llvm-project/issues/53161

Believe it or not, I was (un)lucky enough to run into it. I have bisected the repo and my tests indicate that this commit is the offender. Do you have any obvious idea what could be the problem?

I'm suspecting this is the offender, since it's related to #include_next which is where I get the error:
https://github.com/llvm/llvm-project/commit/8503c688d555014b88849e933bf096035a351586#diff-fb96fa41fef79b66ab53ac1d845ab467defc41c2639e9b0cd60b471b04f898cdR981
To summarize, clang cannot find /usr/include only if I include exactly 511 directories via -I. If I include 510 or 512, then it works fine. I will test if the same happens with smaller powers of two. I have a test script that I can share if you want to test it yourself.

EDIT: the same happen when including exactly 255 directories. It does not happen for 127 and below. It does happen for 1023 and above.

Hi, thanks for reporting this. The reproducer script would be helpful. Looking at the code, I don't see anything obviously wrong.

Sure, I attach it do this message. It assumes you run from a place from which you can run ./build/bin/clang.

Let me know if I can provide anything else to help! It might not be this commit in particular, perhaps the problem is in some allocator class merged earlier.

Thank you!

I'm pretty sure it's this commit that causes the failure.

I now see that HeaderSearch::LookupFile callers perform pointer arithmetics on the out-parameter const DirectoryLookup *&CurDir. They rely on the fact that all DirectoryLookup instances are in contiguous memory. That was true when they were stored in std::vector<DirectoryLookup>, but since they are allocated in llvm::SpecificBumpPtrAllocator<DirectoryLookup> after this patch, that assumption is no longer true.

I have reverted the patch for now and will look into fixing this.

Great catch and thanks for the revert!

jansvoboda11 reopened this revision.Jan 14 2022, 1:13 AM

I'll need to figure out how to efficiently replace the pointer arithmetic performed by Preprocessor::LookupFile, or go back to my first solution: keeping std::vector<DirectoryLookup> and updating indices in AddSearchPath...

This revision is now accepted and ready to land.Jan 14 2022, 1:13 AM
jansvoboda11 planned changes to this revision.Jan 14 2022, 1:13 AM

Replace external uses of const DirectoryLookup * by an optional iterator. That's how they get treated by callers.

This revision is now accepted and ready to land.Jan 14 2022, 6:58 AM
jansvoboda11 requested review of this revision.Jan 14 2022, 6:58 AM
jansvoboda11 added inline comments.Jan 14 2022, 7:04 AM
clang/include/clang/Lex/HeaderSearch.h
405

This is the interesting change in the latest revision. Callers of this function performed pointer arithmetics on const DirectoryLookup *, essentially treating it as an iterator. Since DirectoryLookup objects are no longer stored in contiguous memory, that's unsafe. I'm using an iterator to allow doing that safely and using llvm::Optional to provide the nullability clients take advantage of.

I don't love the long name that causes a bunch of formatting changes. Maybe it's worth extracting it out of HeaderSearch and giving it shorter name. I also don't love that the API is not the same as for const DirectoryLookup *, leading to some "unnecessary" changes (e.g. CurDir = nullptr -> CurDir.reset()). I think it might be worth creating custom iterator type that would have the original API, but would behave correctly with the new memory setup.

Rebase on top of D117566.

jansvoboda11 edited the summary of this revision. (Show Details)Jan 18 2022, 8:03 AM
jansvoboda11 added inline comments.
clang/include/clang/Lex/HeaderSearch.h
405

Resolved in the latest revision by introducing D117566.