Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
since we modify the python script, could you please update its artifacts (StdSymbolMap.inc, CSymbolMap) as well?
Sorry for not being cleared. My suggestion was to generate the maps on the old data (2018-10-28), the reason is that we can see the diffs caused by the change of the python script, it would make the review easier. We plan to update the symbol mappings to latest version, I think we should do it in a followup patch.
clang/tools/include-mapping/gen_std.py | ||
---|---|---|
15–16 | nit: std::swap doesn't have variants, it is provided by multiple headers. |
Oh I misunderstood your comment regarding which maps need to be regenerated as part of this patch. Regenerated the 2018 ones now. As expected, there are only added symbols now.
clang/tools/include-mapping/cppreference_parser.py | ||
---|---|---|
174 | this is actually checking for something else (sorry for the confusing naming). the variant here refers to library name mentioned in parentheses (this is same problem as std::move) on the std symbol index page https://en.cppreference.com/w/cpp/symbol_index (e.g. remove<>() (algorithm)). by getting rid of this we're introducing a regression, as previously std::remove wouldn't be recognized by the library, but now it'll be recognized and we'll keep suggesting <cstdio> for it. so we should actually keep this around. |
Can you add a unittest in the StandardLibraryTest.cpp?
clang/include/clang/Tooling/Inclusions/StdSymbolMap.inc | ||
---|---|---|
100 ↗ | (On Diff #491260) | Conceptually, this (and other atomic_* symbols) doesn't feel correct:
They are variant symbols (ideally, they should be treat as the std::move()). The issue here is that unlike std::move which has two individual entries in the index page, we only have one entry for std::atomic (extending the cppreference_parser.py script to perfectly distinguish these two cases in the page seems non-trivial). Some options:
@kadircet, what do you think? |
369 ↗ | (On Diff #491260) | this one as well, both headers provide different overloads. |
1053 ↗ | (On Diff #491258) | I think std::remove should not be in the scope of the patch, it has two variants:
|
clang/lib/Tooling/Inclusions/Stdlib/StandardLibrary.cpp | ||
20–22 | int => unsigned. Any reason to change it to a map? My understanding is that SymbolNames is a mapping from the SymbolID => {Scope, Name}, it should not be affected if we add multiple-header symbols. | |
57 | int => unsigned. | |
58 | Given the fact that multiple-header symbols are grouped in the .inc file, we could simplify the code of checking a new symbol by looking at the last available SymIndex: if (NextAvailSymIndex > 0 && SymbolNames[NextAvailSymIndex-1].first == NS && SymbolNames[NextAvailSymIndex-1].second == Name) { // this is not a new symbol. } | |
69 | nit: inline the HeaderID. | |
103 | The ID field is guaranteed to be valid, so no need to do the sanity check. |
Re-introduce the special std::remove handling for now.
clang/tools/include-mapping/cppreference_parser.py | ||
---|---|---|
174 | Ok, I can keep this out of this patch, but we'll have to remove this logic evetually when we deal with overloads. I have a slight suspicion that this code might be buggy, because it suggests that one _of_ the variants should be accepted. What is does in reality, though, is it keeps algorithm in the list of headers suitable for std::remove alongside cstdio, and then in the last step std::remove is ignored by the generator because of being defined in two headers. With this patch, the result will be both {cstdio, algorithm}. Is this (more) satisfactory for now compared to skipping algorithm due to being an overload? |
clang/lib/Tooling/Inclusions/Stdlib/StandardLibrary.cpp | ||
---|---|---|
20–22 | Right. I am not sure what I was thinking. Possibly I was thinking that symbol IDs won't be continuous anymore after this change, but they actually are. Thanks! | |
58 | I know this is easier in terms of code, but this heavily relies on the order in the generated files. I would prefer to keep the library and the generator as decoupled as possible, even if it means slightly more complex code here. Overall, it seems more future-proof in case of unrelated generator changes (bugs?) that might change the order. | |
103 | This is all gone now, since the code is back to the array version of SymbolNames. |
clang/include/clang/Tooling/Inclusions/StdSymbolMap.inc | ||
---|---|---|
100 ↗ | (On Diff #491260) | right, i believe this patch is definitely adding keeping multiple headers for a symbol around, but mixing it with keeping headers for variants (e.g. overloads provided by different headers, or specializations as mentioned here). we definitely need some more changes in parser to make sure we're only preserving alternatives for the same symbol and not for any other variants (overloads, specializations). IIRC there are currently 2 places these variants could come into play:
in the scope of this patch, we should keep ignoring both. |
clang/tools/include-mapping/cppreference_parser.py | ||
167 | nit: formatting | |
174 |
Surely, I wasn't saying this should stay here forever, i am just saying that what's done in the scope of this patch doesn't really address the issues "worked around" by this piece.
right, it's because we have logic to prefer "non-variant" versions of symbols when available (i.e. in the absence of this logic, we'd prefer std::remove from cstdio). this logic enables us to preserve certain variants (in addition to non-variants). that way we treat std::remove as ambigious rather than always resolving to <cstdio>, hence it's marked as "missing", similar to std::move.
in the end this should probably look like {algorithm, cstdio}, but as mentioned elsewhere, this is not the same problem as "same symbol being provided by multiple header" but rather "different symbols having same name and different headers". so treatment of it shouldn't change by this patch. |
clang/include/clang/Tooling/Inclusions/StdSymbolMap.inc | ||
---|---|---|
100 ↗ | (On Diff #491260) | I suggest to special-case the overloads for now, just not to solve all the problems in the same patch. |
369 ↗ | (On Diff #491260) | I suggest to special-case the overloads for now, just not to solve all the problems in the same patch. |
clang/tools/include-mapping/cppreference_parser.py | ||
174 | On second thought, I think we'd rather special-case the overloads for now (until we get to them). See the newest patch version. |
clang/include/clang/Tooling/Inclusions/StdSymbolMap.inc | ||
---|---|---|
1053 ↗ | (On Diff #491258) | Yes, agreed. Let's take care of the overloads going forward. |
clang/tools/include-mapping/cppreference_parser.py | ||
174 |
Where is this logic? AFAICS the generator in the current state doesn't generate anything for std::remove. |
clang/include/clang/Tooling/Inclusions/StdSymbolMap.inc | ||
---|---|---|
100 ↗ | (On Diff #491260) | The first group are already ignored. We need to take a lot at how to ignore the second one. |
clang/include/clang/Tooling/Inclusions/StdSymbolMap.inc | ||
---|---|---|
100 ↗ | (On Diff #491260) | Ideally, we should tweak the _ParseSymbolPage to handle this case, but I don't see a easy way to do it (the atomic case is quite tricky where the symbol name is identical, only the template argument is different, and the simple text-match heuristic in _ParseSymbolPage fails in this case). so specialcasing these (there are not too many of them) looks fine to me. |
106 ↗ | (On Diff #491407) | Looks like the regex filters too many symbols -- e.g. atomic_fetch_add is not a variant symbol, we should keep it instead. The same to other following symbols as well. |
clang/lib/Tooling/Inclusions/Stdlib/StandardLibrary.cpp | ||
58 |
Yeah, this is true. I actually think we should make it as an invariant (multiple-header symbols are grouped) of the generated .inc files. This invariant is important and useful, it is much easier for human to read and justify. We can probably guarantee it in the generator side. | |
clang/tools/include-mapping/gen_std.py | ||
101 | I'd use a list of exact symbol names (the number of them should not be huge), to avoid filtering out symbols by accident (see my other comment). |
clang/include/clang/Tooling/Inclusions/StdSymbolMap.inc | ||
---|---|---|
100 ↗ | (On Diff #491260) | I don't follow why we can't perform this detection directly, haven't been taking a look at the details but the page looks like: Defined in header <atomic> template< class T > struct atomic; template< class U > struct atomic<U*>; Defined in header <memory> template< class U > struct atomic<std::shared_ptr<U>>; template< class U > struct atomic<std::weak_ptr<U>>; Defined in header <stdatomic.h> #define _Atomic(T) /* see below */ So _ParseSymbolPage, first sees <atomic> then a bunch of decls, then <memory> then a bunch more decls and so on, and in the end it returns all the headers it has found. Why can't we change the logic here to return nothing when there are different decls && headers? i.e. return empty if we see a new header after seeing some declarations. |
clang/tools/include-mapping/cppreference_parser.py | ||
---|---|---|
174 |
It's the logic that you're deleting: if variant and variant not in variants_for_symbol: continue we ignore any symbols that has a variant we shouldn't accept and always prefer the non-variant versions. to be more concrete when parsing c++ symbol index we'll see two alternatives for std::remove: remove() remove<>() (algorithm) first one is a non-variant, hence it's accepted by default. second one is algorithm variant, and is accepted per this deleted logic. in the end we used to didn't generate anything because we now have multiple headers providing std::remove. i think it creates more confusion to special case only std::remove down below, and not other symbols whose non-variant versions we accept, e.g. sin also has non-variant versions, but we don't "special case" it to keep it out of the map. i'd suggest treating std::remove similar to other symbols with an accepted variant, rather than creating a divergence. so my suggestion is changing the logic above to only accept variants mentioned in the map when there's a mapping for the symbol (and reject the non-variant version). net effect will be that we'll now include the mapping from std::remove to <algorithm>, which is currently inline with our behaviour against other symbols with different variants. we try to pick the most common variant, and that's almost always the "non-variant" version, apart from std::remove. that way we should get rid of the special casing below completely. does that make sense? | |
clang/tools/include-mapping/gen_std.py | ||
101 | different headers providing std::swap don't provide different variants of it, these are just ADL extension points, similar to std::begin (which also shouldn't be missing from the mapping). for remove, i've provided more context in the discussion above, about keeping a different variant of std::remove around. |
Fix "atomic" and "div" overloads programmatically. Add a FIXME for future proper handling of overloads.
clang/include/clang/Tooling/Inclusions/StdSymbolMap.inc | ||
---|---|---|
106 ↗ | (On Diff #491407) | They're all back now. |
100 ↗ | (On Diff #491260) | Ok, I've fixed (or hacked :) this programmatically. |
369 ↗ | (On Diff #491260) | Should be fixed now, together with "atomic.*" cases. |
Remove special casing from gen_std.py. Re-introduce special handling for std::remove.
Thank you all for comments! The patch should be ready for the next review now.
clang/include/clang/Tooling/Inclusions/StdSymbolMap.inc | ||
---|---|---|
1053 ↗ | (On Diff #491258) | The latest version keeps the std::remove overload from <algorithm>. |
clang/lib/Tooling/Inclusions/Stdlib/StandardLibrary.cpp | ||
58 | Ok, sure. Sorted the symbols in the generator now and also applied the snippet from above. I'm not 100% sure the code became much simpler but this version seems fine too :) | |
clang/tools/include-mapping/cppreference_parser.py | ||
174 | Ok, I've tweaked the logic to keep std::remove from <algorithm> now. | |
clang/tools/include-mapping/gen_std.py | ||
101 | Thank you. The latest version should take care both of the overloads that have accidentally made it to this patch and of the std::remove case. |
I haven't read though all the change of _ParseSymbolPage, left some comments.
I think we need to update&add proper tests in include-mapping/test.py as we has changed the parser.
clang/include/clang/Tooling/Inclusions/StdSymbolMap.inc | ||
---|---|---|
1162 ↗ | (On Diff #492413) | Is this intended? it seems to me the cppparser doesn't handle this case correctly, in the swap symbol page, we have two headers in a single t-dsc-header tr, the parser should consider both (like the above size_t). Defined in header <algorithm> Defined in header <utility> |
clang/lib/Tooling/Inclusions/Stdlib/StandardLibrary.cpp | ||
58 | We can avoid a local variable by auto Add = [.. SymIndex(-1)] (....) { if (SymIndex >=0 && SymbolNames[SymIndex].first == NS && ...) { } else { // the first symbol, or new symbol. ++SymIndex; } SymbolNames[SymIndex] = {NS, Name}; .... } | |
clang/tools/include-mapping/cppreference_parser.py | ||
39 | If the header_to_accept is set, I think we can just return {header_to_accept} directly. | |
104 | IIUC, this is the major and intended change of the function, this is not a FIXME. | |
113 | The comment doesn't match the current behavior. IIUC, iof the symbol was never named, we only consider the first header now. | |
115 | why sort the headers? |
clang/include/clang/Tooling/Inclusions/StdSymbolMap.inc | ||
---|---|---|
205 ↗ | (On Diff #492413) | i think we should have other providers here, https://en.cppreference.com/w/cpp/iterator/begin. do we know why they're dropped? |
clang/tools/include-mapping/cppreference_parser.py | ||
133–134 | variant is not necessarily the header, eg: acos() acos<>() (std::complex) (since C++11) acos<>() (std::valarray) in this case variants are std::complex and std::valarray. hence we're not trying to "infer" the header we want to preserve but rather decide on which symbol page we want to parse. we should still accept "all the headers" mentioned in that variant symbol page. |
Thanks for the comments! AFAICS I've addressed all of them.
Re tests, thanks for the reminder @hokein! I've fixed them now, everything is green.
clang/include/clang/Tooling/Inclusions/StdSymbolMap.inc | ||
---|---|---|
205 ↗ | (On Diff #492413) | Yes, this list is the regeneration of the 2018 html book. It's explicitly _not_ generated from the 2022 book in order to have a clear diff related to the topic of this patch. |
1162 ↗ | (On Diff #492413) | You're right. There's another random class name (t-dcl-rev-aux) that needs to be skipped. I've changed the condition in line 81 of cppreference_parser.py (the loop that skips unneeded rows between the header block and the symbol block) to negation in order to make it more robust and avoid listing all these unneeded classes unnecessarily. |
clang/lib/Tooling/Inclusions/Stdlib/StandardLibrary.cpp | ||
58 | Ok, sure. | |
clang/tools/include-mapping/cppreference_parser.py | ||
39 | We'd need to add <> then. It needs a couple of extra lines here or there anyways, I don't think it makes a difference. | |
104 | I guess it depends on how you view it: I meant it as FIXME in the context of overloads, i.e., we might want to remove this and add infrastructure to differentiate between overloads explicitly. | |
113 | I've removed all the hacking with all_headers now, so the comment is valid again. | |
115 | This was a hack to make the generator return atomic rather than memory for std::atomic.* symbols. But I've noticed now that this adds some trouble to the C symbol map. I think now that cutting all_headers to the first element is wrong. Note that the problem is not std::atomic itself (for which Kadir extensively described the solution in some of the other comments), but rather atomic_bool, atomic_char etc. mentioned further in the page. The headers for them come from all_headers rather than symbol_headers, and there seems to be no way to limit them to <atomic> without other collateral effects. | |
133–134 | Ah ok, I was mostly looking at std::remove (algorithm). Ok, I will use the term "variant" as before. Thanks. |
clang/tools/include-mapping/cppreference_parser.py | ||
---|---|---|
115 |
What's exactly breaking once you do that? (I guess by first element you still mean "all the headers we've seen until the first declaration in the page")
I guess this is happening because we're trying to find headers that are for the declaration block containing the interesting symbol name. Any idea if that logic is winning us anything? e.g. if we dropped the if symbol_name in found_symbols: what does change? | |
133–134 | sorry i wasn't just trying to nitpick on the name of the variable/parameters. i was trying to say _ReadSymbolPage should not be a function of variant_to_accept. later on when we're parsing a symbol page, the name of the variant shouldn't effect the headers we're picking in any way (especially in the way you're relying right now that says variant name and the header name will be the same, as i pointed out this might as well be std::complex). Is this trying to work around a different problem? (especially around std::atomic_.* I assume, i'll follow up on that thread) | |
160 | there's no variant of "std::atomic.*" called "atomic", in https://en.cppreference.com/w/cpp/symbol_index. is it something in 2018 version of the html book? otherwise there's only the unnamed variant and std::shared_ptr variant, and we should be preferring unnamed variant. |
Thanks for the comments!
clang/tools/include-mapping/cppreference_parser.py | ||
---|---|---|
115 |
No, I mean the first element of the all_headers collection. all_headers is a collection that collects _all_ the headers in the page and, in case a symbol is not defined in a table with an individual header block, all_headers in the page are returned as an approximation.
It is essentially arbitrary to limit all_headers to the first element (at the moment it would be the first element alphabetically). E.g., for the INT8_C and other constants here, the result would be inttypes.h and stdint.h would be cut, so this result doesn't make sense.
Then all the headers in the table will be attributed to all the symbols in the table. Consider this page: std::imaxdiv would be reported as defined both in cstdlib and cinttypes, which is wrong.
Not really. Headers for atomic_bool, atomic_char etc. come from all_headers rather than symbol_headers. The reason for that is they are defined in a table which doesn't have its own header block, so we are trying to approximate the headers by naming all the headers in the file. | |
133–134 | This is obsolete. As discussed offline, I've removed all the special-casing logic. All the variants are skipped now without exceptions. | |
160 | This is obsolete. As discussed offline, I've removed all the special-casing logic. All the variants are skipped now without exceptions. |
As discussed offline, we decided to stop spending effort on improving the cppreference_parser.py, instead, we allow human edits in the generated StdSymbolMap.inc and CSymbolMap.inc, it gives us more flexibility: sort the headers for multi-header symbol in a humn-desired order rather than suboptimal alphabet order; symbols (e.g. std::experimental::filesystem) that are not available in cppreference website; symbols (e.g. PRIX16) from a complicated cppreference HTML page which is hard to parse in cppreference_parser.py etc.
We have two major things in this patch:
- extend the Stdlib API to support multiple-header symbols
- emits multiple-header symbols in the *.inc files
For 1), this change is good. We need to make sure the change not break other users of the .inc files. clangd is the main user, the behavior will be changed in https://github.com/llvm/llvm-project/blob/main/clang-tools-extra/clangd/index/CanonicalIncludes.cpp#L736 as now we have duplicated qualified names (the ideal solution is to ignore multiple-header symbols, I think it is fine now if we just add a single std::size_t symbol, see my comment below)
For 2), I have some concerns, it looks like some multiple-headers symbols that are handled incorrectly by the cppreference_parser.py (see my other comments). I think the best way for us is to list them manually (no action required, I will make a followup)
I'd suggest to narrow the scope of the patch:
- keep 1) and add a unittest in llvm-project/clang/unittests/Tooling/StandardLibraryTest.cpp;
- drop all generated multi-header symbols in *.inc files, instead, we manually add a single one std::size_t symbol to the StdSymbolMap.inc for the StandardLibrary unittest. This means we need to revert the current change of gen_std.py;
clang/include/clang/Tooling/Inclusions/CSymbolMap.inc | ||
---|---|---|
80 ↗ | (On Diff #493565) | These symbols seem to only have a single header, I think it is an improvement effect of the new change of parsing logic in cppreference_parser.py. |
96 ↗ | (On Diff #493565) | This type of symbol (INTX_C) seems not correct, they are only from <stdint.h> -- (this is a limitation of our current cppreference parser (it can not handle the complex page: https://en.cppreference.com/w/c/types/integer) |
172 ↗ | (On Diff #493565) | I think this kind of symbol (PRIdN, PRIdLEASTN, PRIdFASTN, PRIdMAX, PRIdPTR etc) is only from <inttypes.h> (another limitation of the cppreference parser). |
clang/lib/Tooling/Inclusions/Stdlib/StandardLibrary.cpp | ||
34 | It is sad that we're allocating more memory for these two arrays now (as we have multiple SYMBOLs for the same qualified name). No action needed, can you add a comment saying that we're allocate more space here? | |
clang/tools/include-mapping/test.py | ||
56 ↗ | (On Diff #493565) | The test needs to update as well, as you remove the change of _ParseSymbolPage. |
Thanks, it looks mostly good. This needs some work on rebase (as we have landed a few patches that change StandardLibrary.cpp file). I will rebase it and land it for you.
- Rebase to main, and some small tweaks
- unittest will be added when landing the https://reviews.llvm.org/D143280
int => unsigned.
Any reason to change it to a map? My understanding is that SymbolNames is a mapping from the SymbolID => {Scope, Name}, it should not be affected if we add multiple-header symbols.