This is an archive of the discontinued LLVM Phabricator instance.

[LLDB] Add ObjectFileWasm plugin for WebAssembly debugging
ClosedPublic

Authored by paolosev on Dec 16 2019, 3:16 PM.

Details

Summary

This is the first in a series of patches to enable LLDB debugging of WebAssembly targets.

Current versions of Clang emit (partial) DWARF debug information in WebAssembly modules and we can leverage this debug information to give LLDB the ability to do source-level debugging of Wasm code that runs in a WebAssembly engine.

A way to do this could be to use the remote debugging functionalities provided by LLDB via the GDB-remote protocol. Remote debugging can indeed be useful not only to connect a debugger to a process running on a remote machine, but also to connect the debugger to a managed VM or script engine that runs locally, provided that the engine implements a GDB-remote stub that offers the ability to access the engine runtime internal state.

To make this work, the GDB-remote protocol would need to be extended with a few Wasm-specific custom query commands, used to access aspects of the Wasm engine state (like the Wasm memory, Wasm local and global variables, and so on).
Furthermore, the DWARF format would need to be enriched with a few Wasm-specific extensions, here detailed: https://yurydelendik.github.io/webassembly-dwarf.

This CL introduce classes ObjectFileWasm, a file plugin to represent a Wasm module loaded in a debuggee process. It knows how to parse Wasm modules and store the Code section and the DWARF-specific sections.

Diff Detail

Event Timeline

paolosev created this revision.Dec 16 2019, 3:16 PM
paolosev edited the summary of this revision. (Show Details)Dec 16 2019, 3:19 PM

Sounds exciting. My comments are all about formatting and coding style, if someone has something technical to say, too that would be appreciated.

lldb/source/Plugins/DynamicLoader/WASM-DYLD/DynamicLoaderWasmDYLD.cpp
1
  1. This hangs over the line
  2. a -*- C++ -*- is only necessary for .h files where C vs. C++ is ambiguous
77

This should be a doxygen comment.

132

Can you convert this to early exits? The deep nesting makes the control flow difficult to read.

lldb/source/Plugins/ObjectFile/WASM/ObjectFileWasm.cpp
1

ditto

32

Is a StringRef comparison easier to read here?

72

Do you want any form error logging/handling here?

76

Please use full sentences in comments with a trailing ..

99

early-exitify?

150

Again early exits would make this easier to read.

394

doxygen comment.

lldb/source/Plugins/ObjectFile/WASM/ObjectFileWasm.h
46

FYI you can group doxygen comments like this:

/// PluginInterface protocol
/// \{
...
/// \}
102

llvm::Optional<uint8_t> GetVaruint7()

104

same here

119

doxygen comments on all (most) non-override function would be useful.

I don't know the lldb codebase, but from a webassembly perspective this looks promising.

I suppose we are long way from having a webassebly VM that exports that correct wire protocol to actually test this?

lldb/include/lldb/Utility/ArchSpec.h
191

Add another newline below to the follow the existing grouping pattern?

lldb/source/API/SystemInitializerFull.cpp
73

Can you name this directory "wasm" rather than "WASM" since its not an acronym.

179

Why is the namespace needed here for wasm but not the other three above.. seems inconsistent.

lldb/source/Plugins/DynamicLoader/WASM-DYLD/DynamicLoaderWasmDYLD.h
1

Again, avoid WASM in the directory name. I suppose "Wasm" would also be acceptable, but I'm trying to push for "wasm" or "WebAssembly".

lldb/source/Utility/ArchSpec.cpp
107

Is this just clang format being greedy?

lldb/unittests/ObjectFile/WASM/CMakeLists.txt
11

trailing newline

lldb/unittests/ObjectFile/WASM/TestObjectFileWasm.cpp
2

somethings up with the ling wrapping here.

llvm/include/llvm/BinaryFormat/ELF.h
314

This seems like an odd place to add this, given that are not using or relying on ELF anywhere. Does this make sense?

I think this is really nice. I have some minor remarks here and there but otherwise this LGTM.

lldb/source/Plugins/DynamicLoader/WASM-DYLD/DynamicLoaderWasmDYLD.cpp
88

what about making this if and the one blow a if (ModuleSP module_sp = ...) { ...; return module_sp; }. Then you don't need to do the double-parentheses trick and the end of this function can just be return ModuleSP(); so it is obvious that the end of this function is the error code path.

171

no brackets

lldb/source/Plugins/ObjectFile/WASM/ObjectFileWasm.cpp
32

(IMHO it is)

65

no brackets

79

No brackets

88

no brackets

110

I don't think we do these static comments usually in LLVM?

128

Single line if -> no brackets needed.

172

Should also be switched to ConstString if you make the member of the section info a ConstString.

191

This member is only created here and only used below from what I can see? Also you never compare it against any other strings so it should be a std::string.

292

range-based loop

299

Your sect_info.name is already a std::string so comparing here against a ConstString is just a slower and less readable.

447

early-exit

449

Bonus: You can write this code by directly using llvm::raw_ostream by just calling s->AsRawOstream() to get the equivalent raw_ostream. I migrate all code to LLVM's stream classes so not having more code using lldb::Stream would be nice (but not required to get this patch in).

(same for the Stream code below)

lldb/source/Plugins/ObjectFile/WASM/ObjectFileWasm.h
116

This should be a ConstString. OR you keep this as a std::string and then you move all other ConstString variables that compare against your section names to be just plain std::string, llvm::StringRef and llvm::StringSwitch. But mixing ConstString and normal strings brings us the disadvantages of both worlds (hard to read and slow to compare) without any benefits.

labath added a subscriber: labath.Dec 17 2019, 1:42 AM

I think this is pretty good for a first pass, but I would like to see this split up into (at least) three patches (one for each plugin). That way, we can properly focus on each plugin. For instance, I'm pretty sure that the object file and symbol vendor changes are fully testable. The dynamic loader stuff may or may not be, but I don't want to discuss that yet to avoid too many parallel threads going on.

lldb/source/API/SystemInitializerFull.cpp
179

Some of the older code puts "plugins" into the default namespace, but lately we've started to put new plugins into their own namespaces. However, most of the old plugins have not been migrated yet.

lldb/source/Plugins/ObjectFile/WASM/ObjectFileWasm.cpp
299

ELF and PECOFF code has been already converted to use StringSwitch for this stuff. I'd do the same here.

llvm/include/llvm/BinaryFormat/ELF.h
314

Indeed, that looks very unexpected, and begs an explanation.

paolosev marked 43 inline comments as done.Dec 17 2019, 10:07 PM

Thanks for all the comments! I am updating the code following your suggestions.
Next step will be to split this into three distinct patches, as suggested by Pavel.

lldb/source/Plugins/DynamicLoader/WASM-DYLD/DynamicLoaderWasmDYLD.cpp
1

Fixed, but I see the -*- C++ -*- in all files.

lldb/source/Plugins/ObjectFile/WASM/ObjectFileWasm.cpp
1

But all the C++ files seem to have "-*- C++ -*-===//" in the first line.

lldb/source/Utility/ArchSpec.cpp
107

Yes, it was edited by clang format.

paolosev added inline comments.Dec 17 2019, 10:07 PM
llvm/include/llvm/BinaryFormat/ELF.h
314

Oops... my mistake. This is a relic from an old version where I expected Wasm stripped debug symbol might also come in an ELF file. Removed.

paolosev updated this revision to Diff 234463.Dec 17 2019, 10:08 PM
paolosev marked 2 inline comments as done.
paolosev removed a project: Restricted Project.

Addressed first review comments.

teemperor added inline comments.Dec 18 2019, 12:08 AM
lldb/source/Plugins/DynamicLoader/WASM-DYLD/DynamicLoaderWasmDYLD.cpp
1

The line just got cargo-culted into many *.cpp files so all occurrences in source files are unintentional.

lldb/source/Utility/ArchSpec.cpp
107

Can you revert that change? clang-format shouldn't touch these unrelated files if you use git-clang-format or something similar.

paolosev updated this revision to Diff 234471.Dec 18 2019, 12:40 AM
paolosev retitled this revision from [LLDB] Add initial support for WebAssembly debugging to [LLDB] Add ObjectFileWasm plugin for WebAssembly debugging.
paolosev edited the summary of this revision. (Show Details)
paolosev marked 2 inline comments as done.Dec 18 2019, 12:48 AM
paolosev updated this revision to Diff 234474.Dec 18 2019, 12:51 AM
paolosev set the repository for this revision to rG LLVM Github Monorepo.
aprantl added inline comments.Dec 19 2019, 11:48 AM
lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.cpp
45 ↗(On Diff #234474)

The LLVM coding style requests that doxygen comments should be on the declaration in the header file and not in the implementation.

375 ↗(On Diff #234474)

again.. in the header, or inside the function

452 ↗(On Diff #234474)

ditto

lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.h
18 ↗(On Diff #234474)

This line is redundant and can be removed.

25 ↗(On Diff #234474)

this comment is inconsistent with the others

115 ↗(On Diff #234474)

Please make sure to use full sentences that end in a . in all comments.

paolosev updated this revision to Diff 234920.Dec 20 2019, 10:05 AM
paolosev marked 7 inline comments as done.
paolosev added inline comments.
lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.cpp
45 ↗(On Diff #234474)

This function is not declared in the header. Probably it doesn't need a doxygen comment.

aprantl added inline comments.Dec 20 2019, 1:02 PM
lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.cpp
45 ↗(On Diff #234474)

If it isn't declared in a header, then it should be either static or in an anonymous namespace.
Using a doxygen comment is still preferred, since many IDEs will use that to display online help.

labath added inline comments.Dec 20 2019, 1:24 PM
lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.cpp
40 ↗(On Diff #234920)

This looks like it will cause problems on big endian hosts..

252 ↗(On Diff #234920)

I take it that wasm files don't have anything like a build id, uuid or something similar?

334–351 ↗(On Diff #234920)

It would be nice to merge these two section-creating blocks..

lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.h
116 ↗(On Diff #234920)

We don't use this typedef style (except possibly in some old code which we shouldn't emulate).

lldb/unittests/ObjectFile/wasm/TestObjectFileWasm.cpp
1 ↗(On Diff #234920)

Overall, these tests would be better off as "lit" tests. Something along the lines of:

yaml2obj %s >%t
lldb-test object-file %t | FileCheck %t

You can look at existing tests in test/Shell/ObjectFile for inspiration. Is there anything you test here that "lldb-test object-file" does not print out?

A bunch more comments from me. :)

A higher level question I have is whether there's something suitable within llvm for parsing wasm object files that could be reused. I know this can be tricky with files loaded from memory etc., so it's fine if it isn't possible to do that, but I am wondering if you have considered this option...

lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.cpp
35–38 ↗(On Diff #234920)

if (llvm::identify_magic(toStringRef(data_sp->GetData())) != llvm::file_magic::wasm_object) maybe ?

40 ↗(On Diff #234920)

One way to get around that would be to use something like llvm::support::read32le.

46–63 ↗(On Diff #234920)

Maybe merge these and make the maximum width a function argument?

192 ↗(On Diff #234920)

I'd just use StringRef here -- there's no advantage in ConstStringifying this...

215 ↗(On Diff #234920)

This seems odd -- I don't think any of our object file plugins work this way. It's normally the symbol vendor who fiddles with the symbol file spec.

This is kind of similar to the gnu_debuglink section, and the way that works in elf is that the object file exposes this via a separate method, which the symbol vendor can then query and do the appropriate thing.

Maybe you could just drop this part and we can get back to it with the symbol vendor patch?

217–225 ↗(On Diff #234920)

I wouldn't be afraid of presenting external_debug_info as an actual section, if that's how it's treated byt the object file format. And it looks like that could simplify this code a bit...

342–343 ↗(On Diff #234920)

Are the debug info sections actually loaded into memory for wasm? Should these be zero (that's what they are for elf)?

382 ↗(On Diff #234920)

This is strange.. I wouldn't expect that the section decoding logic should depend on the actual address that the object is loaded in memory. Can you explain the reasoning here?

395 ↗(On Diff #234920)

Normally I would expect to see ->GetFileAddress() here, as that's the thing which says how the sections are laid out in memory. The way you create these sections, the two values are the same, but it still seems more correct to call GetFileAddress()

paolosev marked 10 inline comments as done.Jan 3 2020, 10:41 AM

A bunch more comments from me. :)

A higher level question I have is whether there's something suitable within llvm for parsing wasm object files that could be reused. I know this can be tricky with files loaded from memory etc., so it's fine if it isn't possible to do that, but I am wondering if you have considered this option.

I have considered this option, there is indeed some code duplication because there is already a Wasm parser in class WasmObjectFile (llvm/include/llvm/Object/Wasm.h).
However class WasmObjectFile is initialized with a memory buffer that contains the whole module, and this is not ideal. LLDB might create an ObjectFileWasm either from a file or by retrieving specific module chunks from a remote, but it doesn't often need to load the whole module in memory.
I think this is the reason why other kind of object files (like ObjectfileELF) are re-implemented in LLDB rather than reusing existing code in LLVM (like ELFFile in llvm/include/llvm/Object/ELF.h).

lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.cpp
40 ↗(On Diff #234920)

Good point, modified to use read32le.

252 ↗(On Diff #234920)

Not yet. There is a proposal to add a uuid to wasm modules in a custom section, but it's not part of the standard yet.

lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.h
116 ↗(On Diff #234920)

Yes, this was copied from old code :). Removed.

paolosev updated this revision to Diff 236237.Jan 5 2020, 5:39 AM
paolosev marked 8 inline comments as done.

Addressed more review comments:

  • removed code to manage "external_debug_info" sections; logic for this will be implemented in the symbol vendor code.
  • modified test code, from unittests to be Shell "lit" tests.
lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.cpp
46–63 ↗(On Diff #234920)

I'd keep the functions separate, it's better if they return different sized integers.

342–343 ↗(On Diff #234920)

Yes, I was thinking that the debug sections should be loaded into memory; not sure how this works for ELF, how does the debugger find the debug info in that case?

382 ↗(On Diff #234920)

There is a reason for this: DecodeNextSection() calls:

ReadImageData(offset, size)

and when the debug info sections are embedded in the wasm module (not loaded from a separated symbol file), ReadImageData() calls Process::ReadMemory() that, for a GDB remote connection, goes to ProcessGDBRemote::DoReadMemory(). Here I pass the offset because it also represents the specific id of the module loaded in the target debuggee process, as explained in the comment above.

I would suggest removing GetVaruint7 and GetVaruint32 and adding "llvm::Optional<uint8_t> DataExtractor::GetULEB128(uint64_t *offset_ptr, uint64_t max_value);" as mentioned in inlined comments.

lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.cpp
49 ↗(On Diff #234474)

Is it ok if we consume more than 1 byte here? What is the offset points to a larger ULEB, are we ok with advancing the offset by multiple bytes or should we back it up and return llvm::None?

This might be a good candidate to add to DataExtractor directly as:

/// Extract a ULEB128 number with a specified max value. If the extracted value exceeds 
/// "max_value" the offset will be left unchanged and llvm::None will be returned.
llvm::Optional<uint8_t> DataExtractor::GetULEB128(uint64_t *offset_ptr, uint64_t max_value);

There are many places where we extract a uint64_t, but only need a uint16_t (like in the DWARF parser where all DW_TAG_XXXX, DW_AT_XXX and DW_FORM_XXX values must only be uint16_t values but are encoded as ULEB128 values. So this could be used elsewhere if we do put it into DataExtractor.

162–163 ↗(On Diff #234474)

Is this a one byte section ID or is it a ULEB? Not sure why it would be encoded as a ULEB if it is always one byte? IF this really is just a one byte value, then replace with:

uint8_t section_id = data.GetU8(&offset);
59 ↗(On Diff #236237)

remove if we add:

llvm::Optional<uint8_t> DataExtractor::GetULEB128(uint64_t *offset_ptr, uint64_t max_value);
clayborg added inline comments.Jan 6 2020, 1:12 PM
lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.cpp
34 ↗(On Diff #236237)

I typically would put "data_sp" into a DataExtractor and extract a uint32_t and then check the decoded value with something like:

static bool ValidateModuleHeader(DataExtractor &data, uint64_t *offset_ptr) {
  auto magic = data.GetU32(offset_ptr);
  if (magic == WASM_MAGIC)
    return true;
  if (magic == WASM_CIGAM) {
    // Set byte order in DataExtractor
    data.SetByteOrder(data.GetByteOrder() == eByteOrderBig ? eByteOrderLittle : eByteOrderBig);
    return true;
  }
  return false;
}

This function expects a DataExtractor to be passed in that has "data_sp" inside of it with the host endian set as the byte order. It will set the byte order correctly. It also expects to have two uint32_t macros defined: WASM_MAGIC and WASM_CIGAM. These contain the non byte swapped and the byte swapped magic values. Easy to replace those with real definitions from else where (llvm::file_magic::wasm_object? Not sure of the type of this though, seemed like a StringRef).

44 ↗(On Diff #236237)

use DataExtractor::GetU32()? Or is the byte order always little endian for wasm object files?

187–195 ↗(On Diff #236237)

All these lines can use the GetCStr with a length:

ConstString sect_name(data.GetCStr(&offset, *name_len));
if (!sect_name)
  return false;
199 ↗(On Diff #236237)

remove ConstString constructor here if we switch code above as suggested.

labath added a comment.Jan 7 2020, 4:17 AM

A bunch more comments from me. :)

A higher level question I have is whether there's something suitable within llvm for parsing wasm object files that could be reused. I know this can be tricky with files loaded from memory etc., so it's fine if it isn't possible to do that, but I am wondering if you have considered this option.

I have considered this option, there is indeed some code duplication because there is already a Wasm parser in class WasmObjectFile (llvm/include/llvm/Object/Wasm.h).
However class WasmObjectFile is initialized with a memory buffer that contains the whole module, and this is not ideal. LLDB might create an ObjectFileWasm either from a file or by retrieving specific module chunks from a remote, but it doesn't often need to load the whole module in memory.
I think this is the reason why other kind of object files (like ObjectfileELF) are re-implemented in LLDB rather than reusing existing code in LLVM (like ELFFile in llvm/include/llvm/Object/ELF.h).

Thanks for sharing your thoughts. I think the history of ObjectFileELF is a bit more complicated, but yes, being able to load from memory is a good reason for not using the llvm reader right now (also, the parsing code seems to be quite simple).

lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.cpp
252 ↗(On Diff #234920)

Ok. Thanks for clarifying. I think somethink like that will be useful, as otherwise you can't tell if you're using the correct separate debug info file.

342–343 ↗(On Diff #234920)

Are you referring to the load-from-memory, or load-from-file scenario? Normally, the debug info is not loaded into memory, and if lldb is loading the module from a file, then the debug info is loaded from the file (and we use the "file offset" field to locate them). Loading files from memory does not work in general (because we can't find the whole file there -- e.g., section headers are missing). The only case it does work is in case of jitted files. I'm not 100% sure how that works, but given that there is no if(memory) block in the section creation code, those sections must too get vm size = 0.

In practice, I don't think it does not matter that much what you put here -- I expect things will mostly work regardless. I am just trying to make this consistent in some way. If these sections are not found in the memory at the address which you set here (possibly adjusted by the load bias) then I think it makes sense to set vm_size=vm_addr=0. If they are indeed present there, then setting it to those values is perfectly fine.

382 ↗(On Diff #234920)

What you say about ReadMemory makes sense, but it's not clear to me why you couldn't use the value in m_memory_addr for this. For in-memory object file this value should contain the actual address the object file was loaded from (which, I would expect, will include the module_id business), and so you wouldn't need the dynamic loader address in order to locate it. I believe this is how the other object file plugins do their in-memory loading..

lldb/source/Utility/ArchSpec.cpp
226

add a trailing comma to avoid subsequent needs to touch this line when adding new entries..

lldb/test/Shell/ObjectFile/wasm/basic.yaml
21–79 ↗(On Diff #236237)

Could you remove the sections which are not relevant for this test? This makes it easier to see the correspondence between the yaml and the test expectations..

lldb/test/Shell/ObjectFile/wasm/debug-sections.yaml
42 ↗(On Diff #236237)

Same here (and you don't even have to put actual valid debug info in the debug info sections -- just a couple of random bytes is sufficient).

paolosev updated this revision to Diff 236868.Jan 8 2020, 11:04 AM
paolosev marked 20 inline comments as done.
paolosev added inline comments.
lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.cpp
49 ↗(On Diff #234474)

Modified in DataExtractor, as suggested. However we need to return a llvm::Optional<uint64_t>, which can be not ideal because the result could need to be casted to another integer type.
We could also male DataExtractor::GetULEB128 templatized on the integer type, getting max_value as std::numeric_limits<int>::max(), but I don't want to overly complicate the code.

162–163 ↗(On Diff #234474)

You are right, it is a one-bye section. I cannot use data.GetU8(&offset); though, because it returns 0 on failure.

342–343 ↗(On Diff #234920)

Modified, but I am not sure I completely understood the difference of file addresses and vm addresses.

In the case of Wasm, we can have two cases:
a) Wasm module contains executable code and also DWARF data embedded in DWARF sections
b) Wasm module contains code and has a custom section that points to a separated Wasm module that only contains DWARF sections

The file of Wasm modules that contains code should never be loaded directly by LLDB; LLDB should use GDB-remote to connect to the Wasm engine that loaded the module and retrieve the module content from the engine.
But when the DWARF data has been stripped into a separate Wasm file, that file should be directly loaded by LLDB in order to load the debug sections.
So, if I am not wrong, only in the first case we should assume that the debug info is loaded into memory, and have vm_size != 0?

382 ↗(On Diff #234920)

Good point! Changed.

34 ↗(On Diff #236237)

Not sure about this... WebAssembly is always little-endian and a DataExtractor is not really needed here.
I made sure to set the byte order where the DataExtractor is created in ReadImageData.

59 ↗(On Diff #236237)

Actually, section_id is encoded as a byte not as a varuint7, so I modified the code accordingly, with llvm::Optional<uint8_t> GetByte(DataExtractor &, lldb::offset_t *).
Even this could be moved to DataExtractor, maybe?

187–195 ↗(On Diff #236237)

I had tried to use DataExtractor::GetCStr() but it doesn't work because it wants null-terminated strings, and this is not the case in Wasm, where strings are encoded as len (varuint32) followed by an array of len bytes that represent UTF8 chars (https://webassembly.github.io/spec/core/binary/values.html#names).

lldb/test/Shell/ObjectFile/wasm/basic.yaml
21–79 ↗(On Diff #236237)

Ok. What can be confusing is that Wasm modules (almost) always have at least a few standard sections (MEMORY, GLOBAL, CODE, ...) but these sections can be ignored by LLDB, so also by these tests.

labath added inline comments.Jan 9 2020, 3:55 AM
lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.cpp
342–343 ↗(On Diff #234920)

You're right, this is getting pretty confusing, as a lot of the concepts we're talking about here are overloaded (and not even consistently used within lldb). Let me try to elaborate here.

I'll start by describing various Section fields (I'll use ELF as a reference):

  • file offset (m_file_offset) - where the section is physically located in the file. It does not matter if the file is loaded from inferior memory or not (in the former case, this is an offset from location of the first byte of the file). In elf this corresponds to the sh_offset field.
  • file size (m_file_size) - size of the section in the file. Some sections don't take up any space in the file (.bss), so they have this zero. This is roughly the sh_size field in elf (but it is adjusted for SHT_NOBITS sections like .bss)
  • file address (m_file_addr) - The address where the object file says this section should be loaded to. Note that this may not be the final address due to ASLR or such. It is the job of the SetLoadAddress function to compute the actual final address (which is then called the "load address"). This corresponds to the sh_addr field in elf, but it is also sometimes called the "vm address", because of the p_vaddr program header field
  • vm size (m_byte_size) size of the section when it gets loaded into memory. Sections which don't get loaded into memory have this as 0. This is also "rougly" sh_size`, but only for SHF_ALLOC sections.

All of these fields are really geared for the case where lldb opens a file from disk, and then uses that to understand the contents of process memory. They become pretty redundant if you have loaded the file from memory, but it should still be possible to assign meaningful values to each of them.

The file offset+file size combo should reflect the locations of the sections within the file (regardless of whether it's a "real" file or just a chunk of memory). And the file address+vm size combo should reflect the locations of the sections in memory (modulo ASLR, and again independently of where lldb happened to read the file from).

Looking at the patch, I think you've got most of these right. The only question is, what is the actual memory layout of a wasm debug info in the inferior process. I originally highlighted this because I was assuming the typical elf model where the debug info is not loaded into memory and the debugger loads it from a file. (jitted files are an afterthought in elf, and not that well supported/tested). However, now I get the impression that the wasm debug info is actually always loaded into memory (assuming it is present, that is), and so setting the vm size to non zero might actually be correct here. You'll need to make the call there, as I don't know how wasm actually works. (Note that you needn't concern yourself much with separate debug files here -- the vm size only really matters once you call SetLoadAddress to actually mark the object file as loaded into the process, and we never call SetLoadAddress on a separate debug file).

paolosev marked 2 inline comments as done.Jan 10 2020, 9:56 AM
paolosev added inline comments.
lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.cpp
342–343 ↗(On Diff #234920)

Thanks for the clarification!

WebAssembly modules are not really loaded into the memory of the inferior process like native executables are. The Wasm/JavaScript engine loads the module and executes it, either interpreting the Wasm bytecode or jit-compiling it into native code. Furthermore, in WebAssembly the code is distinct from the addressable memory.
All this poses some interesting problems to enable debugging with LLDB.

My idea is that a Wasm engine that wants to support LLDB debugging should implement a GDB remote stub and LLDB would not directly access the inferior memory but only access it by talking with the engine through the remote protocol. From what concerns debug info, by default Clang embeds it in a few custom sections in the Wasm module itself. But normally these debug sections should be stripped into a separate Wasm file (we should modify llvm-objcopy for this) because they are not useful in the engine and take a lot of space.

Usually an engine loads and keeps the whole Wasm module into memory, so when the debug info is embedded in the module, we can say that the debug info is loaded into the inferior process but it’s only accessible through the engine. Here SetLoadAddress will be used to specify at which ‘virtual address’ the engine loaded the module; for example a module with id==4 will be loaded at address 0x00000004`00000000 (logic for this will be implemented in a new class DynamicLoaderWasm). LLDB will use gdb-remote commands like ‘m’ to read chunks of the debug sections from a loaded Wasm modules, and the remote stub identifies the module from the id encoded in the address.

When the debug info is in a separate file, as you say we don’t call SetLoadAddress on that file, and instead LLDB will use m_file_offset, m_file_size to read the debug directly from the file (m_file_offset, m_file_size can be 0). The code above should be consistent with this logic.

paolosev marked an inline comment as done.Jan 10 2020, 10:34 AM

I apologize for the noob question, but how do I schedule a build for this diff with Harbormaster?

Sorry for the delay. I was trying to figure out whether I want to get into the whole DataExtractor discussion or not -- I eventually did... :/

Besides that bit, I think this is looking good..

lldb/source/Plugins/ObjectFile/wasm/ObjectFileWasm.cpp
342–343 ↗(On Diff #234920)

Cool, thanks for expaining. I think we've both learned a lot here.

lldb/source/Utility/DataExtractor.cpp
914 ↗(On Diff #236868)

It doesn't look like this actually happens, does it? (If max_value is exceeded, the offset will still be updated, right?).

And overall, I am not very happy with backdooring an api inconsistent with the rest of the DataExtractor (I am aware it was clayborg's idea). Overall, it would probably be better to use the llvm DataExtractor class, which has the Cursor interface designed to solve some of the problems you have here (it can handle EOF, it cannot check the uleb magnitude). And it tries to minimize the number of times you need to error check everything. The usage of it could be something like:

llvm::DataExtractor llvm_data = lldb_data.GetAsLLVM();
llvm::DataExtractor::Cursor c(0);
unsigned id = llvm_data.GetU8(c);
unsigned payload_len = llvm_data.GetULEB128(c);
if (!c)
  return c.takeError();
// id and payload_len are valid here
if (id == 0) {
  unsigned name_len = llvm_data.GetULEB128(c);
  SmallVector<uint8_t, 32> name_storage;
  llvm_data.GetU8(c, name_storage, name_len);
  if (!c)
    return c.takeError();
  // name_len and name valid here
  StringRef name = toStringRef(makeArrayRef(name_storage));
  unsigned section_length = ...;
  m_sect_infos.push_back(...)
}

This won't handle the uleb magnitude check, but these checks seem irrelevant and/or subsumable by other, more useful checks: a) Checking the name length is not necessary, as the code will fail for any names longer 1024 anyway (as that's the amount of data you read); b) instead of section_len < 2^32 it seems more useful to check that *offset_ptr + section_len is less than 2^32, to make sure we don't wrap the module_id part of the "address".

I apologize for the noob question, but how do I schedule a build for this diff with Harbormaster?

Harbormaster is a red herring. There's no automated pre-commit testing in llvm (TBE, there's an experimental @merge_guards_bot, which you can opt into, but it doesn't test or build lldb yet, so it's not very useful for you now...).

paolosev updated this revision to Diff 238168.Jan 14 2020, 10:34 PM
paolosev marked 2 inline comments as done.
paolosev added inline comments.
lldb/source/Utility/DataExtractor.cpp
914 ↗(On Diff #236868)

Good points! Changed.

labath accepted this revision.Jan 15 2020, 12:56 AM

Thanks. I think this is looking very good now. Excited to have this ready.

Do you have commit access?

This revision is now accepted and ready to land.Jan 15 2020, 12:56 AM

Thanks. I think this is looking very good now. Excited to have this ready.

Do you have commit access?

No, I certainly don't have commit access, this would be my first accepted patch. :)

This revision was automatically updated to reflect the committed changes.

BTW, I had to fix this patch (cd9e5c32302cd3b34b796683eedb072c6a1cfdc1) to build on macOS. uint64_t and size_t are differently spelled (though I think otherwise equivalent.) One is "long long unsigned int", the other "long unsigned int". I have no idea why that's true, but std::min refuses to compare a size_t and a unit64_t. Anyway, I fixed this by casting one of the two sides of the comparison. But this was causing problems because we have an api (ReadImageData) that takes a uint64_t for the offset and a size_t for the size. That seems a little weird to me, why are these different types?

BTW, I had to fix this patch (cd9e5c32302cd3b34b796683eedb072c6a1cfdc1) to build on macOS. uint64_t and size_t are differently spelled (though I think otherwise equivalent.) One is "long long unsigned int", the other "long unsigned int". I have no idea why that's true, but std::min refuses to compare a size_t and a unit64_t. Anyway, I fixed this by casting one of the two sides of the comparison. But this was causing problems because we have an api (ReadImageData) that takes a uint64_t for the offset and a size_t for the size. That seems a little weird to me, why are these different types?

I am sorry for this problem, thank you for the fix! Evidently size_t can also be 32 bit, like in this case
To be more precise ReadImageData should take an lldb::offset_t as first argument (which is indeed an uint64_t) and it should use the same type for the size; I'll clean up this in a separate patch.

Do you have commit access?

No, I certainly don't have commit access, this would be my first accepted patch. :)

Well.. congratulations. :) I was making sure you are able to commit this, but it looks like you already have that covered.