This is an archive of the discontinued LLVM Phabricator instance.

[WebAssembly] Fixed wasm64 DWARF using 64-bit code pointer sizes
AbandonedPublic

Authored by aardappel on Nov 10 2020, 1:25 PM.

Details

Summary

As per https://reviews.llvm.org/D90503, a "code pointer" is always 32-bit in Wasm.

The emitter for DW_FORM_addr used AsmPrinter::getPointerSize (which calls TargetMachine::getPointerSize and then DataLayout::getPointerSize) which is 64-bit. Since those last calls are non-virtual, there is no easy way to fix this path for Wasm.

Thus I added a AsmPrinter::getCodePointerSize which uses MCAsmInfo::getCodePointerSize instead, which hopefully should be correct for other platforms as well.

Diff Detail

Event Timeline

aardappel created this revision.Nov 10 2020, 1:25 PM
Herald added a project: Restricted Project. · View Herald TranscriptNov 10 2020, 1:25 PM
aardappel requested review of this revision.Nov 10 2020, 1:25 PM

Note this has not been tested on non-Wasm platforms yet. Wanted to first get feedback if this is a legit way to solve the problem, or if there are better ways.

Does this mean, for instance, that sizeof(int*) is different from sizeof(void(*)())?

I'm not sure how DWARF handles that - for instance the address of a global variable (in the DW_AT_location of a DW_AT_variable) compared to the starting address of a function (in the DW_AT_low_pc in a DW_TAG_subprogram) use the same concept of address encoding - so I'm not sure if/how DWARF handles having different data/code pointer sizes, perhaps it has to use the larger of the two for both? You might need to find another architecture that has such a feature and, say, already has GCC support for DWARF to see how it's handled there?

@dblaikie Confusingly, these are all different: function pointers at runtime (in a Wasm VM) are 32-bit indices. LLVM function pointers are 64-bit in wasm64 for consistency, but get truncated when lowered in Isel. Then here we have a 3rd kind of code pointer, just for DWARF, since Wasm doesn't have the concept of a pointer to an instruction inside a function (which DWARF needs for DW_AT_low_pc, and we need to relocate).

And yes, if this is also used for globals, and there is an architecture for which code pointers would be smaller than global pointers, then this patch would break them.

An alternative solution for Wasm would be to just go along with the expected 64-bit size for DW_FORM_addr, but that would mean wasted space, and a new reloc type for us. Also means DWARF data between wasm32 and wasm64 is different.

As for how to make a test for this.. The code before this patch would (I assume) write 4 extra bytes, which would cause dwarfdump to show incorrect information. Our existing test however wasn't affected. So to cover this patch against regression would require some larger dwarfdump example I guess..

@dblaikie Confusingly, these are all different: function pointers at runtime (in a Wasm VM) are 32-bit indices. LLVM function pointers are 64-bit in wasm64 for consistency, but get truncated when lowered in Isel. Then here we have a 3rd kind of code pointer, just for DWARF, since Wasm doesn't have the concept of a pointer to an instruction inside a function (which DWARF needs for DW_AT_low_pc, and we need to relocate).

And yes, if this is also used for globals, and there is an architecture for which code pointers would be smaller than global pointers, then this patch would break them.

An alternative solution for Wasm would be to just go along with the expected 64-bit size for DW_FORM_addr, but that would mean wasted space, and a new reloc type for us. Also means DWARF data between wasm32 and wasm64 is different.

Not sure we're perhaps talking about the same thing or there's some overlap/confusion.

What's the actual sizeof(int*) and sizeof(int(*)()) if I printed those out in some C++ code compiled to wasm? The size has to be known by the frontend/at the language level, so it can't change due to lowering.
If I have an array of int*, that array needs elements of a size known by the frontend/language-level, similarly if I have an array of int(*)() - and in both cases I'd need to use some relocation to fill in my array initializer that initializes that array?

I'd suggest doing whatever happens for that case in DWARF too.

@dblaikie Confusingly, these are all different: function pointers at runtime (in a Wasm VM) are 32-bit indices. LLVM function pointers are 64-bit in wasm64 for consistency, but get truncated when lowered in Isel. Then here we have a 3rd kind of code pointer, just for DWARF, since Wasm doesn't have the concept of a pointer to an instruction inside a function (which DWARF needs for DW_AT_low_pc, and we need to relocate).

And yes, if this is also used for globals, and there is an architecture for which code pointers would be smaller than global pointers, then this patch would break them.

An alternative solution for Wasm would be to just go along with the expected 64-bit size for DW_FORM_addr, but that would mean wasted space, and a new reloc type for us. Also means DWARF data between wasm32 and wasm64 is different.

Not sure we're perhaps talking about the same thing or there's some overlap/confusion.

What's the actual sizeof(int*) and sizeof(int(*)()) if I printed those out in some C++ code compiled to wasm? The size has to be known by the frontend/at the language level, so it can't change due to lowering.
If I have an array of int*, that array needs elements of a size known by the frontend/language-level, similarly if I have an array of int(*)() - and in both cases I'd need to use some relocation to fill in my array initializer that initializes that array?

At the C/C++/llvm layers all pointer are the same size (i.e. 64bit on wasm64), including function pointers. The difference with wasm is that those functions pointers are not in the same address space as the bytecode offsets used in the DWARF information.

Function pointers in C/C++llvm are indexes into the indirect function pointer table that wasm maintains.

IIUC code offsets in DWARF are byte offsets into the wasm bytecode file (which it completely inaccessible from C/C++/llvm).

@dblaikie Confusingly, these are all different: function pointers at runtime (in a Wasm VM) are 32-bit indices. LLVM function pointers are 64-bit in wasm64 for consistency, but get truncated when lowered in Isel. Then here we have a 3rd kind of code pointer, just for DWARF, since Wasm doesn't have the concept of a pointer to an instruction inside a function (which DWARF needs for DW_AT_low_pc, and we need to relocate).

And yes, if this is also used for globals, and there is an architecture for which code pointers would be smaller than global pointers, then this patch would break them.

An alternative solution for Wasm would be to just go along with the expected 64-bit size for DW_FORM_addr, but that would mean wasted space, and a new reloc type for us. Also means DWARF data between wasm32 and wasm64 is different.

Not sure we're perhaps talking about the same thing or there's some overlap/confusion.

What's the actual sizeof(int*) and sizeof(int(*)()) if I printed those out in some C++ code compiled to wasm? The size has to be known by the frontend/at the language level, so it can't change due to lowering.
If I have an array of int*, that array needs elements of a size known by the frontend/language-level, similarly if I have an array of int(*)() - and in both cases I'd need to use some relocation to fill in my array initializer that initializes that array?

At the C/C++/llvm layers all pointer are the same size (i.e. 64bit on wasm64), including function pointers. The difference with wasm is that those functions pointers are not in the same address space as the bytecode offsets used in the DWARF information.

Ah, so the language level doesn't support the ability to point to sub-function granularity. Yeah... that might be pretty esoteric/weird for DWARF to understand that its code pointers are vastly different from language code pointers. I doubt DWARF consumers are written with support for that concept - I expect they'd intend to read a code pointer from program memory (eg: a void (*)()) and expect it means the same thing as reading a code pointer from the DWARF data itself (eg: low_pc of a function, etc).

I think probably before this patch/direction is pursued it'd be reasonable to have some data on what existing DWARF consumers do here/whether any are being made to handle wasm in some way, and if so what the plan is for that sort of possible divergence between code pointers in the program/process and those in the DWARF.

I expect they'd intend to read a code pointer from program memory (eg: a void (*)()) and expect it means the same thing as reading a code pointer from the DWARF data itself (eg: low_pc of a function, etc).

There is simply no way to do this in WebAssembly. The code pointers used by the user code are table indexes they will never match bytecode offsets (which is what low_pc is). Given that we already have DWARF consumers of WebAssembly DWARF pointers I wonder how they handle this today?

For example 1, 2, and 3 are perfectly valid function addresses in user code but make no sense as bytecode offsets. @yurydelendik will know more about this is handled.

Given that we already have DWARF consumers of WebAssembly DWARF pointers I wonder how they handle this today?
For example 1, 2, and 3 are perfectly valid function addresses in user code but make no sense as bytecode offsets.

LLVM outputs function references as a pointer type in DWARF, though there are not. We were lucky that a function "pointer" matches size of memory pointer. The answer to above question is that they are handled as memory pointers -- though they have to be a case of unsigned int / index.

IMHO it is nice to fix how function references type expressed in DWARF. Also make "bytecode offsets" / low_pc match size of architecture, e.g. for wasm64 make it 64bit. On different (memory) note, there is https://github.com/yurydelendik/webassembly-dwarf/issues/10 to open conversations about addressing multiple memories -- this could be a way to address 64bit memory in wasm32.

@yurydelendik note that wasm64 is an LLVM-level construct, in general Wasm we may one day have multiple memories, which means that "being 64-bit" is a property of each load/store instruction individually. So assuming we have a DWARF section associated with a possibly mixed mode Wasm module, there's no way to determine the size of DW_AT_low_pc or DW_FORM_addr from the module.

Granted, this is a little hypothetical, since we don't know if wasm-ld will ever support such mixed linking (or another linker will), or wether DWARF will be output by non-LLVM producers, but the fact remains that Wasm really doesn't have an "architecture" in the same way as other ISAs.

Thus, it seems easier to me to hard-code DW_FORM_addr to always be 32 or 64-bit for Wasm. Either choice is going to give the problem this patch is trying to address, though.

@dblaikie As @sbc100 indicated, this particular value is purely for use by DWARF, and has no relation to any other values in the Wasm ecosystem. So I am not sure there is a comparable situation in another target. The question remains thus how to plumb this to DIEInteger::SizeOf, and if indeed this is so exceptional, maybe a direct check against the Wasm triple there is better than trying to make the concept of "code pointer" generic for other architectures.

@yurydelendik note that wasm64 is an LLVM-level construct, in general Wasm we may one day have multiple memories, which means that "being 64-bit" is a property of each load/store instruction individually. So assuming we have a DWARF section associated with a possibly mixed mode Wasm module, there's no way to determine the size of DW_AT_low_pc or DW_FORM_addr from the module.

Granted, this is a little hypothetical, since we don't know if wasm-ld will ever support such mixed linking (or another linker will), or wether DWARF will be output by non-LLVM producers, but the fact remains that Wasm really doesn't have an "architecture" in the same way as other ISAs.

Thus, it seems easier to me to hard-code DW_FORM_addr to always be 32 or 64-bit for Wasm. Either choice is going to give the problem this patch is trying to address, though.

Agreed. My previous comment was more of a wish, and I have no opinion for this particular patch.

aardappel abandoned this revision.Nov 12 2020, 1:54 PM

After discussion with more Wasm+DWARF interested parties, it became clear that asking producers and consumers of DWARF to treat DW_FORM_addr as anything other than architecture dependent is going to be problematic, so the way forward for now is to make this field 64-bit on wasm64 as intended, and leave it 32-bit on wasm32.

I will thus abandon this patch (and revert the changes in the previous patch) and instead implement a 64-bit code pointer reloc for Wasm for the sake of DWARF, which overal will be easier to work with.

Consumers will need to either infer 64-bitness from the DWARF header or by checking memory section/import of the Wasm module, as other tools processing wasm64 (Memory64) do. If the current code relies on the DWARF header this may already work correctly transparently.

Memory64 as currently specced allows a single Wasm module to access multiple memories with a different bitness each. There are currently no tools that do this, but if they exist in the future, presumably they would have to write out the DWARF belonging to this mixed module as being wasm64.

My own wasm32 lowering pass in Binaryen is going to have to change the "architecture" of the associated DWARF data for example, not sure yet how difficult that will be.