This is an archive of the discontinued LLVM Phabricator instance.

[llvm-objdump] Add --process-context to adjust VMAs
AcceptedPublic

Authored by mysterymath on Apr 11 2023, 2:05 PM.

Details

Summary

This flag loads a JSON process context, e.g. emitted by llvm-symbolizer,
which records the runtime addresses that each virtual address in a
module corresponds to. This allows adjusting VMAs for display
automatically.

Diff Detail

Event Timeline

mysterymath created this revision.Apr 11 2023, 2:05 PM
Herald added a project: Restricted Project. · View Herald Transcript
mysterymath requested review of this revision.Apr 11 2023, 2:05 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 11 2023, 2:05 PM

Remove spurious #include; git clang-format.

jhenderson added inline comments.Apr 12 2023, 1:25 AM
llvm/docs/CommandGuide/llvm-objdump.rst
185

This reference to the llvm-symbolizer option should be a link to the relevant documentation.

llvm/test/tools/llvm-objdump/X86/markup-context.test
10 ↗(On Diff #512617)

In this and other dumps below, since it's really only the VMA that you care about, you should omit the other columns (replace with wildcards where necessary), to make it clear what you're actually trying to test.

48 ↗(On Diff #512617)

This will fail on various OSes. You should use the %errc... substitutions to get the platform-specific message. You'll need to pass them in as FileCheck variables.

59 ↗(On Diff #512617)

I think it would be useful to have comments in the YAML to explain why each section/symbol etc is interesting to the test.

llvm/tools/llvm-objdump/llvm-objdump.cpp
3054

You should probably have a test case that shows that only the last --markup-context is used.

3058

There are three checks here, but only two test cases that I can map them too. Is one of them missing a test case?

mysterymath marked 6 inline comments as done.

Address comments.

llvm/test/tools/llvm-objdump/X86/markup-context.test
48 ↗(On Diff #512617)

Ah thanks, TIL!

jhenderson accepted this revision.Apr 13 2023, 12:05 AM

LGTM.

llvm/test/tools/llvm-objdump/X86/markup-context.test
63 ↗(On Diff #512992)

Nit: most new tests in llvm-objdump use ## for comments. Comments should also end with a "."

This revision is now accepted and ready to land.Apr 13 2023, 12:05 AM

Comment changes.

MaskRay accepted this revision.Apr 13 2023, 1:27 PM
mysterymath marked an inline comment as done.Apr 13 2023, 1:41 PM
mysterymath retitled this revision from [llvm-objdump] Add --markup-context to adjust VMAs. to [llvm-objdump] Add --process-context to adjust VMAs.Apr 24 2023, 4:21 PM
mysterymath edited the summary of this revision. (Show Details)

I hadn't really expected to have an option like this. I'd always presumed one would just use separate scripting to process the JSON into --build-id and --adjust-vma switches. But this can handle nonuniform adjustments for a module that doesn't have a constant runtime - link-time load bias across all its segments (which is possible in various general cases, though not usually used in common ELF layouts).

AFAICT, this basically uses --process-context to replace --adjust-vma and nothing else. That seems a bit odd to me, since the JSON format contains a) multiple modules and b) build IDs for each module. But here you are just looking at whatever one file the user pointed objdump at via file name or -build-id switch, with no cross-checking that the module looked up from the JSON mmap data has either the same build ID as the chosen file or a set of memory regions congruent with the phdrs (if ELF, or equivalent elsewhere) in the file. It's fine for objdump not to be doing all this stuff, but I think I'd want something doing that and so just having objdump parse the same JSON I'd want to preprocess in other ways to save me passing the --adjust-vma I could compute while doing that preprocessing seems like it might not be enough of a feature to bother supporting.

I can imagine a couple of modes for a more thoroughgoing feature that seems like it could more nearly eliminate the need to do separate JSON-based scripting to drive using this in practice.

One approach is to have JSON provide simply an address adjustment reference in addition to separate file selection, as you have here, but with some robustness cross-checking. That is, if the input file specified (and already opened by now) has a build ID, then only use address adjustments that apply to a module specified to have that build ID. If the input file has no build ID, then a good fallback cross-check that the static address has a valid runtime correspondence is to normalize the mmap region list for each module (or just each module with no build ID) in the input and compare that against the normalized (i.e. address-sorted and page-rounded) list of segments in the file's headers (PT_LOAD phdrs for ELF). Perhaps have a mode (or default) to error/warn if any static address being presented doesn't map to any module.

A fancier approach is to make any specified input files (i.e. file names or -build-id switches) identify a subset of the JSON-based list of modules. That is, the JSON-based list of modules acts like a list of -build-id switches to go and find a bunch of modules. The command-line list acts as a filter to identify a subset of those to actually find. There could be a new --all-modules switch or just no file/build-id arguments could mean all when there is context input (instead of a command-line error as it always is now). Then you act like objdump always does with multiple file arguments: dump the requested parts from one file, then the next, with "file blah in format soforth:" lines before each one (perhaps something novel in lieu of file name when it came from -build-id or implicit debuginfod fetching of all in the module list, like the build ID or the resolved debuginfod URL instead of or in addition to the possibly empty and/or synthetic file name from the module element name field).

An especially script-friendly variant of the fancy approach could accept an enriched version of the JSON schema. This would be easy for a script to inject into the context-capture JSON without ingesting it in any full way. Each module object in the JSON lists can have additional keys giving an output file name and perhaps a set of objdump switches to apply for that particular module's output (which can then combine with and/or supercede command-line switches selecting output details). This would do nearly all the work of the original scripting use case I had in mind, which would just use the symbolizer to parse markup and then lightly filter its JSON to inject {"output":"dir/{moduleid}.{name}.{buildid}.lst", "args":["-drl", "--demangle"]} and then feed that to objdump. Making objdump that fancy makes it tempting to offer some canned version without JSON massaging for objdump -dl --output=%{id}.%{name}.%{buildid}.lst or some such ad hoc syntax like various things use for generating dump file names with interpolated user strings, since that would replace the whole of my intended script. But making it a purely scriptable piece requiring some jq pipelines or the like remains extremely easy to cobble together from the smaller general pieces and not bother to fret about the precise UI details of any all-in-one switch features.

In a different all-in-one vein as I raised for contemplation on an earlier change, we again might consider these features in the more general like of different kinds of ProcessContext input via direct switches using common library code rather than only via the JSON bottleneck. (Though maybe instead we do want a JSON bottleneck, I don't know.) For objdump, making use of that kind of requires either the pure subset-specified-by-other-means or some kind of all-in-one output selector if you don't just want everything to go sequentially to stdout. But for parity with the symbolizer, it's kind of compelling to get even better analogs to:

$ eu-unstrip -n -p $$                                                                                                                                                                                                                                                                                  
0x55a546ea9000+0x134000 - - - /usr/bin/bash (deleted)                                                                                                                                                                                                                                                              
0x7fbe42000000+0x2e9000 - - - /usr/lib/locale/locale-archive (deleted)                                                                                                                                                                                                                                             
0x7fbe42347000+0x1d4000 e144007f35d794adf218479af5ddcb2a11a2c583@0x7fbe42347380 /usr/lib/x86_64-linux-gnu/libc.so.6 /usr/lib/debug/.build-id/e1/44007f35d794adf218479af5ddcb2a11a2c583.debug /usr/lib/x86_64-linux-gnu/libc.so.6                                                                                   
0x7fbe42528000+0x32000 - - - /usr/lib/x86_64-linux-gnu/libtinfo.so.6.3 (deleted)                                                                                                                                                                                                                                   
0x7fbe42563000+0x7000 697a51fa9b8ee632114d4e54f6f8ba7304f19700@0x7fbe42563248 /usr/lib/x86_64-linux-gnu/libnss_cache.so.2.0 - /usr/lib/x86_64-linux-gnu/libnss_cache.so.2.0                                                                                                                                        
0x7fbe4256a000+0x7000 - /usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache - /usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache                                                                                                                                                                                  
0x7fbe42573000+0x34000 74d101cb610a46ca719adfc8365bd257139ae610@0x7fbe42573248 /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2 /usr/lib/debug/.build-id/74/d101cb610a46ca719adfc8365bd257139ae610.debug /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2                                                              
0x7fffbb59b000+0x2000 - - - [vdso: 78194]                                                                                                                                                                                                                                                                          
$ eu-addr2line -p $$ 0x7fbe42579000                                                                                                                                                                                                                                                                    
./elf/./elf/dl-load.c:622:6                                                                                                                                                                                                                                                                                        

(I happen to have debuginfo for libc installed. Some of the tools can also read the vDSO out of memory to reconstruct it or acquire its build ID and if you have the right kernel debuginfo package you can get source line information from your vDSO code addresses too. Linux doesn't quite let this tool read process memory for hairy reasons, so it only knows the build IDs for the modules it could open by gleaned file name. Otherwise it would be able to e.g. lookup debuginfod for the bash binary even though it's been deleted locally by a package upgrade since that shell was launched, and the same for the vDSO.) There isn't another objdump tool that integrates all that stuff the way a few elfutils tools and things like systemtap do (but eu-objdump doesn't), but it doesn't seem like an unworthy goal nonetheless. :-) eu-unstrip is IIRC the only tool that deals with multiple module files as files (as opposed to e.g. eu-addr2line, which just uses them all as sources of symbols/debuginfo without caring how many different modules there are), and it uses the approach of command-line args that are a match list of module names (or resolved file names, with another switch) and omitting those args meaning all modules (where other options to say where to place all the output are required, but just use a fixed name scheme in a specified output directory rather than some fancy user-specified interpolation thing).