This is an archive of the discontinued LLVM Phabricator instance.

[clang-doc] Move file layout to the generators.
ClosedPublic

Authored by brettw on Nov 15 2022, 3:59 PM.

Details

Summary

Previously file naming and directory layout was handled on a per Info object basis by ClangDocMain and the generators blindly wrote to the files given. This means all generators must use the same file layout and caused problems where multiple objects mapped to the same file. The object collision problem happens most easily with template specializations because the template parameters are not part of the "name".

This patch moves the responsibility for output file organization to the generators. Currently HTML and MD use the same structure as before. But they now collect all objects that map to a given file and combine them, avoiding the corruption problems.

Converts the YAML generator to naming files based on USR in one directory. This is easier for downstream tools to manage and avoids the naming problems with template specializations. Since this change requires backward-incompatible output changes to referenced files anyway (since each one is now an array), this is a good time to introduce this change.

Diff Detail

Event Timeline

brettw created this revision.Nov 15 2022, 3:59 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 15 2022, 3:59 PM
brettw requested review of this revision.Nov 15 2022, 3:59 PM
brettw edited the summary of this revision. (Show Details)Nov 15 2022, 3:59 PM

If you're curious, you can see the simplification the YAML output format change makes in the consuming code:
https://fuchsia-review.git.corp.google.com/c/fuchsia/+/760570/2/tools/cppdocgen/clangdoc/clangdoc.go
Since it allows us to remove some special-casing code for loading things in the global namespace and for anonymous records.

brettw updated this revision to Diff 475869.Nov 16 2022, 10:33 AM

Updated the lit tests.

Can you document the new directory structure with a small example? It would also be good to document in the code that some of these changes were driven by file corruption, and if possible point to a bug on github.

clang-tools-extra/clang-doc/HTMLGenerator.cpp
860

Nit: necessary

890–892

So does this patch make HTML unusable? Currently, if there is a conflict, the file gets replaced, right?

I read this as we're going from a situation where we lost data, but still had valid HTML pages to one where we have complete, but incorrect HTML, is that correct?

Also, why does that happen under template specialization? USRs are SHA1 hashes of the symbol. I would expect that hashing the symbol would be unambiguous and unique, given C++'s ODR.

That last point made me wonder if an alternate way to handle this without using the hash would be to use the mangled names?

893

Nit: separately

clang-tools-extra/clang-doc/MDGenerator.cpp
382

nit: necessary

422–435

Why did this move? Nothing has changed in its implementation, right?

clang-tools-extra/test/clang-doc/single-file-public.cpp
6–9

Its OK to use ls and grep in the runline, but you cannot use the backtick syntax. It will cause problems on other platforms and I can't find any other usage of that in our codebase. TBH I'm a bit surprised this works, since lit has some interesting behavior around RUN.

I think you can just use cat < ls %t/docs | grep -v index.yaml instead. I haven't tried it, but that is certainly a more typical method of doing this in a lit test. Using find may also work, but I only found one use in llvm/test/ExecutionEngine/MCJIT/load-object-a.ll.

brettw updated this revision to Diff 476183.Nov 17 2022, 10:49 AM
brettw marked 3 inline comments as done.
brettw added inline comments.
clang-tools-extra/clang-doc/HTMLGenerator.cpp
890–892

I think this makes HTML strictly better (though not ideal).

Previously if there was a collision, they got written over the top of each other without truncation (not sure why). So if the longer page was written second, you got lucky and you get a complete valid page and just lost the shorter struct's definition. If you got unlucky the shorter page was written second and you get a corrupted file with the shorter content, followed by the end of the longer content peeking out from the end.

With this change, you get both content concatenated without corruption, you just have duplicate doctype and title tags in between them. Browsers should generally handle this so the page should be usable.

Fixing this requires a refactor of the HTML generator such that the concept of writing a struct is separate from writing a page. Even if I was going to work on the HTML generator, I would probably want to do this in a separate patch anyway.

The markdown generator does the same, but markdown doesn't have all of the headers stuff that shouldn't be in the middle of the file, you just get a new # ... for the second section and it should be reasonable.

The HTML and MD generators name files according to the Name of the structure. USR is not involved. So any time to things in the same namespace have the same name, you'll get a collision. This happens most commonly with template specialization because the names are identical (the name doesn't account for the template parameters).

All of this complexity is why I changed the YAML output to use USR which avoids the problem. I didn't want to make this change for the HTML and MD generators because it will make ugly file names that the user will see, and you probably want the template specializations to be in the same file anyway.

clang-tools-extra/test/clang-doc/single-file-public.cpp
6–9

Your example doesn't work (it's trying to load a file named "ls") but I found another solution using xargs (which is hopefully supported on all of the platforms).

paulkirth added inline comments.Nov 17 2022, 1:05 PM
clang-tools-extra/clang-doc/HTMLGenerator.cpp
890–892

I think this makes HTML strictly better (though not ideal).

Previously if there was a collision, they got written over the top of each other without truncation (not sure why). So if the longer page was written second, you got lucky and you get a complete valid page and just lost the shorter struct's definition. If you got unlucky the shorter page was written second and you get a corrupted file with the shorter content, followed by the end of the longer content peeking out from the end.

With this change, you get both content concatenated without corruption, you just have duplicate doctype and title tags in between them. Browsers should generally handle this so the page should be usable.

Thanks for the explanation. My concern was that this change made HTML unusable, if it is still valid (if not ideal) then I don't think there is an issue.

Fixing this requires a refactor of the HTML generator such that the concept of writing a struct is separate from writing a page. Even if I was going to work on the HTML generator, I would probably want to do this in a separate patch anyway.

No arguments re splitting up work. Can you file an issue to track this on github and reference it in the TODO? That should help make sure this gets addressed.

The markdown generator does the same, but markdown doesn't have all of the headers stuff that shouldn't be in the middle of the file, you just get a new # ... for the second section and it should be reasonable.

The HTML and MD generators name files according to the Name of the structure. USR is not involved. So any time to things in the same namespace have the same name, you'll get a collision. This happens most commonly with template specialization because the names are identical (the name doesn't account for the template parameters).

I think there should be a way to obtain the mangled name from whatever you're looking at. Those should be available, even from the AST, so if you've found a template specialization, it should be possible to get the mangled name (even for partial specializations). This shouldn't necessarily block you, but it seems like that would solve the issue nicely for all three generators.

clang-tools-extra/test/clang-doc/single-file-public.cpp
6–9

I don't think we require xargs, so I'm not sure you can rely on that. We do require find, so that may be a better choice, and you can even have it invoke cat directly.

See: https://llvm.org/docs/GettingStarted.html#software

brettw updated this revision to Diff 476612.Nov 18 2022, 3:10 PM
brettw marked an inline comment as done.
brettw added inline comments.
clang-tools-extra/clang-doc/HTMLGenerator.cpp
890–892

TODO added.

I don't think the mangled names helps both because I think you want the file names to be readable (the mangled names are barely better than the USR) and because I think you also want the specializations to all end up in the same file for end-user documentation. I discussed this in the bug.

clang-tools-extra/test/clang-doc/single-file-public.cpp
6–9

I found a find invocation that worked. I couldn't get it to recognize the "repeats 40 times" regex syntax so I copy-pasted [0-9A-Z] 40 times like in the body of the test.

paulkirth accepted this revision.Nov 18 2022, 4:32 PM

This is mostly LGTM, modulo the small change to the test that I've suggested. Assuming the presubmit tests still pass with that small edit, you're good on my end.

clang-tools-extra/clang-doc/HTMLGenerator.cpp
890–892

TODO added.

I don't think the mangled names helps both because I think you want the file names to be readable (the mangled names are barely better than the USR) and because I think you also want the specializations to all end up in the same file for end-user documentation. I discussed this in the bug.

Name mangling less than ideal, but it's an established part of C++. I don't know how important it is that file names be completely readable. Lots of web URLs are not, and have little impact on users. Mangled names are also commonly used in doxygen's HTML generation, so there is a precedent.

Also, I'm not sure I follow your argument that file names should be readable, but you're willing to use a SHA1 as a file name? Mangled names may not be completely intuitive, but you can quickly figure out how to read them, where as you cannot do so with a SHA.

My questions about this pertain to a more unified solution that could work across generators. Do you think that if we just substituted the mangled names, we would no longer have the file corruption? From your description of the underlying problem, I would assume "yes", but I'd like your opinion. We can ignore the question about whether they should be grouped under a single file for now, as I view that as a different, but related issue. I haven't' had time to read the issue on github yet, but will soon.

clang-tools-extra/test/clang-doc/single-file-public.cpp
9
clang-tools-extra/test/clang-doc/single-file.cpp
14

I don't expect to fix these, since they're not part of your change, but this is a more concise way to do the same check for a USR using the FileCheck regular expressions. I doubt you need it, but just in case.

This revision is now accepted and ready to land.Nov 18 2022, 4:32 PM
brettw updated this revision to Diff 476955.Nov 21 2022, 11:16 AM
brettw marked 2 inline comments as done.
brettw added inline comments.
clang-tools-extra/clang-doc/HTMLGenerator.cpp
890–892

I would assume mangled names would resolve the template override issue in my particular use-case but I don't think this is a good way to go:

  • I assume you could apply the name mangling algorithm to the record name, but mangling type names is weird to me because this doesn't correspond to anything the compiler generates since type names aren't linked, only their contribution to function and static variable names.
  • I'm not using clang-doc this way, but the design seems to be with the libtooling stuff that you would apply it to your repository. By default, this will cross executable and shared library boundaries which can cause name collisions even in the absence of ODR violations in the linked binaries.
  • Although you're not supposed to write ODR violations, these do exist in real code and I think clang-doc should not generate corrupt output if it can be easily avoided.
clang-tools-extra/test/clang-doc/single-file.cpp
14

I fixed it anyway, I actually found this confusing because I thought there must be some reason why it was expressed this way.

This revision was automatically updated to reflect the committed changes.
thakis added a subscriber: thakis.Nov 22 2022, 11:09 AM

This breaks tests on mac:
http://45.33.8.238/macm1/49211/step_8.txt

(and possibly win; that's still cycling.)

Please take a look and revert for now if it takes a while to fix.

(Also, 👋 Mr W!)

@thakis thanks for the report. I've reverted for now. @brettw can you take a look? I'm a bit surprised that passed on the Linux and windows bot, but failed on Mac.

@thakis thanks for the report. I've reverted for now. @brettw can you take a look? I'm a bit surprised that passed on the Linux and windows bot, but failed on Mac.

Probably a BSD find vs GNU find issue.

paulkirth reopened this revision.Nov 22 2022, 6:49 PM
paulkirth added inline comments.
clang-tools-extra/test/clang-doc/single-file-public.cpp
9

since this failed, maybe:

// RUN: find %t/docs -regex ".*/[0-9A-F]*.yaml" -exec cat {} \; | FileCheck %s

That works locally for me on both mac and linux. you lose the nice check for 40 chars, but at least it functions. if its very important, you can always have a single run line that does ls and grep.

FYI: the BSD find accepts the failing regex w/o -regextype sed. It's unfortunate that GNU find doesn't have the same default regex, and doesn't accept that version w/o the -regextype being set.

This revision is now accepted and ready to land.Nov 22 2022, 6:49 PM
brettw updated this revision to Diff 478683.Nov 29 2022, 12:37 PM
This revision was automatically updated to reflect the committed changes.