This is an archive of the discontinued LLVM Phabricator instance.

[DWARF] Make llvm-dwp handle DWARF v5 string offsets tables and indexed strings.
AbandonedPublic

Authored by wolfgangp on Feb 5 2018, 4:37 PM.

Details

Reviewers
dblaikie
Summary

llvm-dwp needs to remap string offsets from input .dwo or .dwp files and therefore needs to be made aware of DWARF v5 string offsets tables now that llvm generates them. Unlike the pre-v5 GNU-style string offsets tables, DWARF v5 string offsets tables have a header describing length, format and version of the table. This patch preserves the header and performs some simple validation when it rewrites the string offsets table contributions.
It also makes llvm-dwp handle v5 compile unit headers.

One thing to note is that the current implementation recreates the string offsets table in one single loop over the entire string offsets section. Since we want to support a mixture of v5 and pre-v5 CUs, we can't parse the section without knowing where the pre-v5 units' contributions are. This is a similar problem as in llvm-dwarfdump, so we go unit by unit in re-writing the table.

Diff Detail

Event Timeline

wolfgangp created this revision.Feb 5 2018, 4:37 PM
dblaikie added inline comments.Feb 7 2018, 6:43 PM
test/tools/llvm-dwp/X86/invalid_string_form.test
3 ↗(On Diff #132902)

Should the error message specify the form it was encoded with? (& maybe mention that it's unsupported in dwo/dwp files, to highlight that for the reader so they're less confused/not lead into thinking this form is just unsupported by LLVM tools more generally)

test/tools/llvm-dwp/X86/string_offsets.test
11 ↗(On Diff #132902)

Yeah, I forget whether I already mentioned this in other tests.

The reason is that writing an ELF file requires seeking through the file (or doing a lot of precomputation about the size of things) to write the section table offset in the ELF header (basically: write the ELF header but leave a gap, write all the sections (thus discovering their sizes and offsets), then write the section tables, etc - then seek back to the start of the file to fill in the offset to the section tables). I'm probably not using the right terminology.

Other tools, like LLVM itself - write the file to a temp location, then copy from that I think.

Be nice to fix this one way or another, but I'm not sure of the best approach. Maybe just detecting if the output file is seekable, and if it isn't, using a temporary file. That way we don't incur the temporary file overhead when the output location /is/ seekable.

13–15 ↗(On Diff #132902)

Could you enumerate what those codepaths are? & it might be better to test them separately if there are interesting things at each DWP step?

I guess what I'm trying to say/understand is why a test would involve two invocations of the dwp tool, rather than one.

You can specify more than 2 inputs to DWP together (either multiple DWOs, DWPs, or any combination) so maybe you only need one invocation, with 3 DWO inputs? Not sure.

tools/llvm-dwp/llvm-dwp.cpp
728 ↗(On Diff #132902)

I'm not quite following this change (from 'continue' to 'else') could you help me understand it?

wolfgangp updated this revision to Diff 133656.Feb 9 2018, 11:01 AM
wolfgangp marked 2 inline comments as done.

Addressed review comments, added a test for detecting DW_FORM_strp which is not supported in dwp files.

test/tools/llvm-dwp/X86/invalid_string_form.test
3 ↗(On Diff #132902)

I guess it would make sense to single out DW_FORM_strp, since it's supported in a dwo file, but not in a dwp file. Other than that, any other string form wouldn't be supported anywhere. I made that change, let me know if that's better.

test/tools/llvm-dwp/X86/string_offsets.test
13–15 ↗(On Diff #132902)

There are 2 relevant code paths in write(). One is processing a dwo input file (which doesn't have an index), the other a dwp file (which does). Both file types can be combined in any order and any number, but it's really important to test the dwp input for the following reason: Since we have to go unit by unit when rewriting the string offsets contributions, we need to collect all the individual contributions to .debug_str_offsets.dwo first. This happens during the initial scanning of the CU index table and so occurs in the order in which the CUs appear in the index table. This may not be the same order as their contributions are laid out in .debug_str_offsets.dwo, so before we write the contributions back to the output dwp file, we need to sort them according to their offsets in .debug_str_offsets.dwo. This happens in writeStringOffsetsDWP(). It's important to preserve the order of string offsets table contributions in the output because the TU index table refers to the exact same contributions.

So I was thinking that a good way to test this would be to first verify the dwo codepath, and then the dwp codepath, assuming that the dwp that was generated in the first step is correct.

I added some tests to verify the first step. If you're uncomfortable with basing the second test on the results of the first, I'll figure something else out.

tools/llvm-dwp/llvm-dwp.cpp
728 ↗(On Diff #132902)

No reason. I think I had some code at the end in a previous attempt and didn't revert back to the continue - approach. I removed the change.

Curious combination of tests - some assembly, some IR. I think I made the existing test cases all checked in binaries.

Perhaps we could come up with a uniform approach/strategy here? (& yeah, I'd probably separate the "testing DWP input" from "testing DWP output" rather than weaving one into the other)

Curious combination of tests - some assembly, some IR. I think I made the existing test cases all checked in binaries.

Perhaps we could come up with a uniform approach/strategy here? (& yeah, I'd probably separate the "testing DWP input" from "testing DWP output" rather than weaving one into the other)

I admit I have a preference for assembly files over checked-in binaries. With checked-in binaries you'd probably have to check in the assembly source as well for reference, and then you have to keep them consistent.

Curious combination of tests - some assembly, some IR. I think I made the existing test cases all checked in binaries.

Perhaps we could come up with a uniform approach/strategy here? (& yeah, I'd probably separate the "testing DWP input" from "testing DWP output" rather than weaving one into the other)

I admit I have a preference for assembly files over checked-in binaries. With checked-in binaries you'd probably have to check in the assembly source as well for reference, and then you have to keep them consistent.

Same here, which is why I keep doing it.
Hard to code-review or edit a binary file. I don't think it costs that much to run llvm-mc on an assembler source, and the benefit in ease of understanding the test is huge.

wolfgangp updated this revision to Diff 134350.Feb 14 2018, 6:15 PM

Changed the test (string_offsets.test) to perform the 2 tests independently (instead of the second test depending on the result of the first). Added a hand-constructed dwp file to serve as input to test case 2. It has a mix of v5 and v4 units, type units to make sure the reference strings correctly and has the string offsets table contributions in a different order than the order in which the CUs appear in the index table.

I'm still producing the dwo files from IR, I could turn those into assembly files as well, if that's preferable.

I'm still producing the dwo files from IR, I could turn those into assembly files as well, if that's preferable.

I think it'd probably be better - could you explain what the motivation is for them being IR tests? (what's the different needs there compared to the assembly test cases?)

test/tools/llvm-dwp/Inputs/string_offsets/mixed_dwp.s
1–4 ↗(On Diff #134350)

Worth having repro steps here? (original source, commands used to produce it, including describing the manual steps used to reorder sections)

Should this comment explain why the string offset contributions are reordered? Might help - though probably want to explain it in the test file too.

Also perhaps it'd be easier if the string_offsets.test file were split (there aren't that many "BOTH" lines - might be easier to read as two separate tests rather than trying to have pieces in common?) & then this mixed_dwp.s could be its own test with the RUN/CHECK lines written in the file directly, rather than split over two files?

I'm still producing the dwo files from IR, I could turn those into assembly files as well, if that's preferable.

I think it'd probably be better - could you explain what the motivation is for them being IR tests? (what's the different needs there compared to the assembly test cases?)

Can't really say there was any particular motivation other than the fact that producing these files with v5 string offsets tables via llc/objcopy is supported now and so the test case provides a small measure of integration testing. But no matter, I'll hand-code them in assembly, then.

Removed the test for dwo-only input since it's redundant. The 'mixed' test of dwo + dwp file covers all cases.

Made the mixed test an assembly file, which does assemble another unit (c.s) in order to perform the test. The auxiliary units (in IR) a.ll b.ll and c.ll have been removed. c.ll has been replaced by the aforementioned c.s.

Added some more commentary to explain the rationale for the difference between the order of string table contributions and corresponding CUs in the index table.

wolfgangp added inline comments.Feb 21 2018, 11:06 AM
test/tools/llvm-dwp/Inputs/string_offsets/mixed_dwp.s
1–4 ↗(On Diff #134350)

The test is now an assembly file, though it still needs to assemble a third unit, which lives in the Input directory. There isn't too much repro for the hand-code dwp file, other than the units themselves. The index sections are hand-constructed.

(this patch doesn't have any changes to llvm-dwp - are those missing? Did they already get reviewed/submitted? something else?)

string_offsets_mixed.s
40–46

Oh... I hadn't realized/understood/thought about this.

That's kind of awkward - mixing blobs with headers and blobs without headers in the same section (str_offsets) & then having to use the CU/TUs to disambiguate/dictate how to parse those chunks.

Can we avoid that? Could we just say v4 and v5 are incompatible? Have a flag or something that checks.

Then we could always walk the str_offsets alone, either expecting header'd sections (which I hope are self descriptive - once you know you're reading a v5 str_offsets, you don't need to consult the CU for anything to do that?) or non-header'd sections (where you just treat every word as a string offset without consideration for how those are divided into contributions)

(Ah, I see you mentioned/highlighted that in the patch description too)

(this patch doesn't have any changes to llvm-dwp - are those missing? Did they already get reviewed/submitted? something else?)

There were no changes to llvm-dwp.c from the previous review, I'm not clear why the diff is not displayed by phabricator. It certainly wasn't submitted yet. I'm able to see it by clicking on "Show older changes".

wolfgangp added inline comments.Feb 22 2018, 5:02 PM
string_offsets_mixed.s
40–46

That's kind of awkward - mixing blobs with headers and blobs without headers in the same section (str_offsets) & then having to use the CU/TUs to disambiguate/dictate how to parse those chunks.

Can we avoid that? Could we just say v4 and v5 are incompatible? Have a flag or something that checks.

That would have been the simpler solution, but llvm-dwarfdump is already handling the same thing, so I thought it would be inconsistent if llvm-dwp didn't handle it too. We also would have to reject any dwp input files with mixed units that were created by non-llvm tools.

It would certainly be easy to reject any mixing of v5 and v4 units.
Pity, though, since it's already working...

dblaikie added inline comments.Feb 26 2018, 4:47 PM
string_offsets_mixed.s
40–46

I'd be happy to hear some other people's opinions on this (Adrian & Paul?).

It seems pretty unfortunate to have this split between versions making a bunch of complexity to support like this.

probinson added inline comments.Feb 26 2018, 5:51 PM
string_offsets_mixed.s
40–46

The sticking point for me is whether there are objects in the wild that use GNU-style .debug_str_offsets, and might get linked with proper v5 .debug_str_offsets. The linker will combine them without a second thought, and it's unreasonable to ask a linker to verify.
I guess... given that you need to enable this stuff explicitly for gcc with DWARF v4, and the point is to reduce relocations which is an irrelevant consideration for fission, it's maybe not so likely that we have objects like this in the wild. That makes it reasonable to put our collective foot down and say llvm-dwp won't mix-n-match.

I do think it's worthwhile for llvm-dwarfdump to handle the mixed case, partly because it's already done and partly because it's a different kind of tool (diagnostic/analysis rather than production).

Ok, souunds like the best thing to do would be to abandon this approach and start over in favor of a v5 only approach (in addition to the existing one of course).

wolfgangp abandoned this revision.Feb 28 2018, 10:24 AM