This is an archive of the discontinued LLVM Phabricator instance.

For --relativenames, handle dwarf absolute include directories similarly to compilation directories.
ClosedPublic

Authored by saugustine on May 18 2020, 3:56 PM.

Diff Detail

Event Timeline

saugustine created this revision.May 18 2020, 3:56 PM

Ping on this?

Ping on this?

Hi @saugustine,

Unless something is urgent, the usual practice is to ping after a week rather than 24 hours, as many people have big piles on their review plate, so can't always get to it the next day.


The code change looks reasonable, but I might be missing something - how does the test change actually test the new behaviour? I also feel like it would make more sense to test the getFileNameByIndex directly using a gtest unit test, since other code beyond the symbolizer might need to use the RelativeFilePath stuff in the future.

aprantl added inline comments.
llvm/lib/DebugInfo/DWARF/DWARFDebugLine.cpp
1246

directory

1249

Is !IncludeDir.empty() implied by isPathAbsoluteOnWindowsOrPosix(IncludeDir)?

1250

It's not obvious to me how returning FileName is the right thing here as I'm unfamiliar with this code. Can you explain the mechanism a little?

saugustine marked an inline comment as done.

Address comments

saugustine marked 2 inline comments as done.May 21 2020, 9:52 AM
saugustine added inline comments.
llvm/lib/DebugInfo/DWARF/DWARFDebugLine.cpp
1250

Dwarf saves file names in three pieces:

Compilation directory (This is stored in the TU, and just implied as entry zero in the include directory list of the line table. Dwarf 5 makes it explicit in the directory list). It is an absolute path
Include Directory. This can be an absolute path and is referenced at the filename level with an index into the table.
File. In spite of the name, this could be a full path, a partial path, or just the filename.

The normal case to find the absolute path to a file is:

compilation directory + include directory + filename

But a producer could have stored the full path in include directory + filename, or just plain filename. Different producers can do different things, so whatever happens here will be a bit of a heuristic.

The immediate case is that if a file has an include directory of zero under dwarf 5, that is the same as the compilation directory, and it should not be added relative names. This allows for a consumer to set a single base directory for finding the files, even if the root has changed.

But more generally, if the producer chose to use an absolute path as the include directory, that probably means that the include directory was specified as an absolute path on the command line (at least not relative to the compilation directory). If, under relative names we return the filename as an absolute path here, the consumer can't specify a different root for this set of files without quite a bit more guesswork.

Hope this helps.

Yeah, I'm kind of inclined to agree with @aprantl here - when the client asks for relative names I could see how that'd be reasonable to say "don't add the compilation dir" but I'm not sure what it means for other paths/doesn't necessarily seem reasonable for it to mean "give me back the unqualified name" even if it's relative to some other/unknown directory.

How would the client then know where to find this name relative to?

The immediate case is that if a file has an include directory of zero under dwarf 5, that is the same as the compilation directory, and it should not be added relative names. This allows for a consumer to set a single base directory for finding the files, even if the root has changed.

I /think/ I'd be open to special casing directory zero for this

But more generally, if the producer chose to use an absolute path as the include directory, that probably means that the include directory was specified as an absolute path on the command line (at least not relative to the compilation directory). If, under relative names we return the filename as an absolute path here, the consumer can't specify a different root for this set of files without quite a bit more guesswork.

It seems like a client that ended up with this response would have to resolve all relative names relative to both the new compilation dir/root, and any include paths (which it can't determine from the DWARF at all) - that doesn't seem to me like a great state of affairs? But legit question what asking for relative names means. I think "relative to the compilation dir, but otherwise absolute" makes some sense - you have system directories at system-known locations, but your source tree might be moving around on different distributed build machines, etc. Everything relative to a variety of different directories seems difficult to work with in practice.

Update for upstream comments.

Redo the test to better reflect the functionality.

Why is it necessary to check in the binary? Could we just assemble the input with llvm-mc on the fly?

Yeah, I'm kind of inclined to agree with @aprantl here - when the client asks for relative names I could see how that'd be reasonable to say "don't add the compilation dir" but I'm not sure what it means for other paths/doesn't necessarily seem reasonable for it to mean "give me back the unqualified name" even if it's relative to some other/unknown directory.

How would the client then know where to find this name relative to?

The same way something finds the name relative in a compilation relative to -I/foo/bar. But I'm not terribly attached to this, so fine.

The immediate case is that if a file has an include directory of zero under dwarf 5, that is the same as the compilation directory, and it should not be added relative names. This allows for a consumer to set a single base directory for finding the files, even if the root has changed.

I /think/ I'd be open to special casing directory zero for this

This works fine for the deeper use case I have in mind, but unfortunately, for the test case under dwarf4, llvm uses a a full path for the include directory, and a plain filename for the file, so although it does actually behave the way you are specifying above, it doesn't actually test the relativenames part:

/usr/local/google/home/saugustine/llvm-new/llvm-project/llvm/test/tools/llvm-symbolizer

But more generally, if the producer chose to use an absolute path as the include directory, that probably means that the include directory was specified as an absolute path on the command line (at least not relative to the compilation directory). If, under relative names we return the filename as an absolute path here, the consumer can't specify a different root for this set of files without quite a bit more guesswork.

It seems like a client that ended up with this response would have to resolve all relative names relative to both the new compilation dir/root, and any include paths (which it can't determine from the DWARF at all) - that doesn't seem to me like a great state of affairs? But legit question what asking for relative names means. I think "relative to the compilation dir, but otherwise absolute" makes some sense - you have system directories at system-known locations, but your source tree might be moving around on different distributed build machines, etc. Everything relative to a variety of different directories seems difficult to work with in practice.

Why is it necessary to check in the binary? Could we just assemble the input with llvm-mc on the fly?

Because when you assemble with llvm-mc in the usual test suite, it sets the compilation directory, include directory, and input file like so:

comp_dir: location where llvm-mc is run.
include_dir: absolute path to relativenames.s in the source tree
filename: "relativenames.s"

So, after the comments above that we *don't* want the remove the include dir, even if it is an absolute path, the difference between the output with relative names and without is nothing: We still get the absolute path.

Why is it necessary to check in the binary? Could we just assemble the input with llvm-mc on the fly?

Because when you assemble with llvm-mc in the usual test suite, it sets the compilation directory, include directory, and input file like so:

comp_dir: location where llvm-mc is run.
include_dir: absolute path to relativenames.s in the source tree
filename: "relativenames.s"

So, after the comments above that we *don't* want the remove the include dir, even if it is an absolute path, the difference between the output with relative names and without is nothing: We still get the absolute path.

You don't actually have to use the .loc/.file/etc. directives to create the line table. You can go lower level and spell out the debug_lines content directly (.byte et al.). That way, you should be able to create any line table you want. There are a bunch of existing tests for llvm-dwarfdump line table functionality that do this so you can get some inspiration from those.

Why is it necessary to check in the binary? Could we just assemble the input with llvm-mc on the fly?

Because when you assemble with llvm-mc in the usual test suite, it sets the compilation directory, include directory, and input file like so:

comp_dir: location where llvm-mc is run.
include_dir: absolute path to relativenames.s in the source tree
filename: "relativenames.s"

So, after the comments above that we *don't* want the remove the include dir, even if it is an absolute path, the difference between the output with relative names and without is nothing: We still get the absolute path.

You don't actually have to use the .loc/.file/etc. directives to create the line table. You can go lower level and spell out the debug_lines content directly (.byte et al.). That way, you should be able to create any line table you want. There are a bunch of existing tests for llvm-dwarfdump line table functionality that do this so you can get some inspiration from those.

I think if we're worried about the behavior changing when creating and want to make sure we can still handle it then we should check in binary input (and a note to say what it came from), if not then we can assemble it up using .loc just fine? Pavel: What kinds of tests were the dwarfdump .byte ones coming from?

Why is it necessary to check in the binary? Could we just assemble the input with llvm-mc on the fly?

Because when you assemble with llvm-mc in the usual test suite, it sets the compilation directory, include directory, and input file like so:

comp_dir: location where llvm-mc is run.
include_dir: absolute path to relativenames.s in the source tree
filename: "relativenames.s"

So, after the comments above that we *don't* want the remove the include dir, even if it is an absolute path, the difference between the output with relative names and without is nothing: We still get the absolute path.

You don't actually have to use the .loc/.file/etc. directives to create the line table. You can go lower level and spell out the debug_lines content directly (.byte et al.). That way, you should be able to create any line table you want. There are a bunch of existing tests for llvm-dwarfdump line table functionality that do this so you can get some inspiration from those.

I think if we're worried about the behavior changing when creating and want to make sure we can still handle it then we should check in binary input (and a note to say what it came from), if not then we can assemble it up using .loc just fine? Pavel: What kinds of tests were the dwarfdump .byte ones coming from?

I've written many of the llvm-dwarfdump .byte ones myself (see several test cases in tools/llvm-dwarfdump/X86), mostly to test odd code paths, like broken inputs, or specific behaviours where we can't rely on the input always being built the same way via .loc etc directives. It allows us fine-grained control over what ends up in the output of the debug line table, and is in some ways is easier to see what is going on because the specific debug line header entries are spelt out. I do lean heavily on comments though to help with the readability. The advantage is that it can easily be read and tweaked to match any new behaviour we want to test, unlike a canned binary. It's also friendlier to git in that way, and can be kept inline with the test file itself.

In my most recent work though, I've tended to lean on gtest unit tests, for which I wrote/expanded a generator that allows us to build up any input we want more or less, using a default prologue which can also be overridden if people want to test specific aspects of it. I think this is actually more preferable for testing the library code, since library code can be shared between multiple executables, as long as there are some higher-level lit tests.

Why is it necessary to check in the binary? Could we just assemble the input with llvm-mc on the fly?

Because when you assemble with llvm-mc in the usual test suite, it sets the compilation directory, include directory, and input file like so:

comp_dir: location where llvm-mc is run.
include_dir: absolute path to relativenames.s in the source tree
filename: "relativenames.s"

So, after the comments above that we *don't* want the remove the include dir, even if it is an absolute path, the difference between the output with relative names and without is nothing: We still get the absolute path.

You don't actually have to use the .loc/.file/etc. directives to create the line table. You can go lower level and spell out the debug_lines content directly (.byte et al.). That way, you should be able to create any line table you want. There are a bunch of existing tests for llvm-dwarfdump line table functionality that do this so you can get some inspiration from those.

I think if we're worried about the behavior changing when creating and want to make sure we can still handle it then we should check in binary input (and a note to say what it came from), if not then we can assemble it up using .loc just fine? Pavel: What kinds of tests were the dwarfdump .byte ones coming from?

Well, let me start of with saying that I haven't been paying close attention to this patch (I think I have a rough idea of what it does, but I haven't looked more closely) -- my comment was going off of the "this can't be produced with llvm-mc" claim. And I was writing it in a hurry since it was getting late.

Anyway, I believe there is general consensus that tests for debug info parsing/dumping code should be written close to the form in which the parser will read them. This is to ensure we have known static data to test against, and avoid having matching bugs in the producer and consumer "cancel each other out" (produce incorrect/nonconforming output, which can still be read my a matching version of the parsing tool). I also believe there is (maybe slighly less general) consensus that we should avoid checking in binary data. The main reason I don't like those is that they are not reviewable, and often not reduced.

For most debug info sections that makes the choice quite clear -- use llvm-mc with the appropriate .byte directives (most of the time obtained by tweaking/reducing the output "clang -S"). I think that strikes a good balance between "ensuring a static output" and "reviewability/modifiability/etc.", and I believe most of the existing tests follow that. In this sense the debug_line table is very unique, because there are multiple ways that one can generate line tables with llvm-mc, and the way that is "normally" done is still very far from the actual on-disk representation -- the .loc/.line directives leave a fair amount of freedom in how they end up being implemented.

So, my point was that if we cannot get llvm-mc to reliably produce the required line tables with .line directives (that's what I thought it uses -- it looks like it goes even higher-level than that with llvm-mc -g), then a level which would offer a similar kind of compromise/experience to other debug info sections would be spelling out the line table with .bytes.

Now, to answer your question, the main .debug_line .byte tests that I've worked with are those that test error handling on invalid inputs (e.g. tools/llvm-dwarfdump/X86/Inputs/debug_line_malformed.s). That's a pretty clear use case for this, as tables generated with .line should always be valid. However, they don't seem to be the only ones: e.g. tools/llvm-dwarfdump/X86/debug-line-dw-lne-end-sequence.s, and tools/llvm-dwarfdump/X86/debug-line-dw-lns-copy.s generate valid line tables but they test a very specific line table feature so they use .byte directives to ensure that it is generated. One test I wrote needed more than 256 file name entries. That could have been produced in a number of ways but a combination of .byte directives with assembler metaprogramming meant that the result was very short but still explicit.

That said, when I look at this test more closely, it's not clear to me whether that is really needed -- the steps in the comments (mkdir, cd) could be executed as a part of the test too, except that it would cd to %T/something instead of /tmp. Then the test could use the presence of something in the output path to check whether it is printing an absolute path or not. However I don't think that going the .byte route would be bad either. There's also a unit-test framework (see unittests/DebugInfo/DWARF/DWARFDebugLineTest.cpp) that could also be used to generate interesting line tables, but I am personally not very fond of it. However, using a binary file does not seem like a good choice to me, because of its opaqueness.

Address upstream concerns about a checked-in binary for the test case.

I have added the necessary debug info to the .s file--it is a fairly minimal set. I have also removed the checked in binary.

In my opinion, the test is far less clear than it was with the checked in binary--80% of the lines are now irrelevant boilerplate, and the issue this is trying to test is if llvm-symbolizer accurately reports certain details about the compilation's directory structure. But at this point I just want to move on.

Part of that problem is that llvm-symbolizer isn't smart enough to find the line table when .debug_info is stripped or minimal, so fixing that would be an improvement, but not in the cards for me today.

aprantl accepted this revision.May 29 2020, 2:45 PM

I have added the necessary debug info to the .s file--it is a fairly minimal set. I have also removed the checked in binary.

In my opinion, the test is far less clear than it was with the checked in binary--80% of the lines are now irrelevant boilerplate, and the issue this is trying to test is if llvm-symbolizer accurately reports certain details about the compilation's directory structure. But at this point I just want to move on.

That's okay. Feel free to add the original source code / instructions for how to regenerate the .s file in a comment!

This revision is now accepted and ready to land.May 29 2020, 2:45 PM
jhenderson accepted this revision.Jun 1 2020, 1:06 AM

LGTM, with nits.

llvm/lib/DebugInfo/DWARF/DWARFDebugLine.cpp
1233–1236

Nit: so for don't -> so don't

llvm/test/tools/llvm-symbolizer/relativenames.s
5

Nit: missing trailing full stop.

This revision was automatically updated to reflect the committed changes.