This is an archive of the discontinued LLVM Phabricator instance.

[BitcodeReader] Allow reading pointer types from old IR
ClosedPublic

Authored by sebastian-ne on Jun 14 2022, 2:09 AM.

Details

Summary

When opaque pointers are enabled and old IR with typed pointers is read,
the BitcodeReader automatically upgrades all typed pointers to opaque
pointers. This is a lossy conversion, i.e. when a function argument is a
pointer and unused, it’s impossible to reconstruct the original type
behind the pointer.

There are cases where the type information of pointers is needed. One is
reading DXIL, which is bitcode of old LLVM IR and makes a lot of use of
pointers in function signatures.
We’d like to keep using up-to-date llvm to read in and process DXIL, so
in the face of opaque pointers, we need some way to access the type
information of pointers from the read bitcode.

This patch allows extracting type information by supplying a function to
parseBitcodeFile that gets called for each function signature. This
function can access the type information via the reader’s type IDs and
the getTypeByID and getContainedTypeID functions.
The AccessBitcodeTypeInfo test exemplarily shows how type info from
pointers can be stored in metadata for use after the BitcodeReader
finished.

Diff Detail

Event Timeline

sebastian-ne created this revision.Jun 14 2022, 2:09 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 14 2022, 2:09 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
sebastian-ne requested review of this revision.Jun 14 2022, 2:09 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 14 2022, 2:09 AM
nikic added a comment.Jun 15 2022, 4:21 AM

I think exposing something like this is reasonable. My high-level question would be whether function signatures are the only place where this is needed or whether we would benefit from a more generic mechanism -- for example, does a similar use-case for calls exist?

beanz added a comment.Jun 15 2022, 7:11 AM

This approach seems reasonable to me, but I'm curious about a meta question here.

DXIL and SPIR-V (and I'm sure other targets as well), require some degree of support for typed pointers.

Might a more general annotation in the IR make sense, then the bitcode reader could annotate the module with type information which would persist after the module is created.

We have a guarantee that we can read older bitcode, but it seems weird to amend the bitcode reader to support a specific version of older LLVM IR.
What's the use case for wanting to read and preserve everything in the DXIL?

I think exposing something like this is reasonable.

That’s encouraging, thanks for the feedback.

My high-level question would be whether function signatures are the only place where this is needed or whether we would benefit from a more generic mechanism -- for example, does a similar use-case for calls exist?

Good point. As far as I know, dxil does not use indirect calls, so one can always do CallInst->getCalledFunction() and get types from the function signature.
I’ll try to get some opaque pointer support into our existing software to see if there’s more needed.

To allow extending the callback (potentially later), we could give the lambda a Value instead of a Function?

Might a more general annotation in the IR make sense, then the bitcode reader could annotate the module with type information which would persist after the module is created.

The elementtype(<ty>) attribute on function arguments sounded sounds quite fitting as a general annotation. It’s explicitly restricted to intrinsics though.
I think there’s more problems than preserving type information. As far as I understand, the SPIR-V backend wants to compile things like C code, which allows pointers and union casts, but I think SPIR-V and DXIL both don’t allow to re-interpret memory, so not sure how that’s handled at all.

I found one more place where type information from dxil needs to be extracted and that’s metadata.
The patch now adds a second callback, which allows changing metadata while it’s read in.
In the test, that’s used to replace a pointer value metadata with a tuple of the original value and metadata that stores its type information.
I wasn’t able to store type info about metadata without replacing it because the values get indistinguishable after being read (e.g. an i8* and an i32* both end up as a ptr).

The callback for function types now takes a Value, which should make it easy to extend for CallInstructions or others if that’s needed.

Rebased. Any concerns with the current patch?

High-level comment regarding how to pass callbacks to parser functions: We have quite a few different parser routines, both for textual IR and bitcode, and special modes (e.g. lazy loaders).
This leads to many places where we would add essentially the same arguments, often with default values. Some functions miss these (e.g. getLazyBitcodeModule), because nobody bothered so far to add callback(s) for them.

Maybe we could instead add a ParserCallbackCollection that is passed to parser methods and contains the various callbacks available for parsing?

llvm/lib/Bitcode/Reader/BitcodeReader.cpp
1330–1333

The assumptions here seem to be based on the existing calls from within the parser. If we expose getTypeByID and getContainedTypeID to callers, these might no longer hold.

Maybe use an additional argument that flags external callbacks, and just return nullptr in that case?

llvm/unittests/Bitcode/BitReaderTest.cpp
308

static

sebastian-ne marked an inline comment as done.Jan 12 2023, 9:51 AM
sebastian-ne added inline comments.
llvm/lib/Bitcode/Reader/BitcodeReader.cpp
1330–1333

I’m not completely sure in which case this could break something, but I think the callback added in this patch is fine anyway.
It only gets called on values that are already created in the IR Module, so I assume their type is completely known at this point.

I guess a case where getTypeByID can’t return anything meaningful is for cycles is structs, i.e.

struct A { B*; }
struct B { A*; }

The callbacks added here don’t get called on (potentially incomplete) types and even if they would, I think the current behavior of getTypeByID is the most sensible thing to do.

Assemble callbacks in a struct, as suggested.

Thanks for the refactoring. One more thought inline.

llvm/include/llvm/Bitcode/BitcodeReader.h
73

Now that the callback itself is optional, and the callback receives the data layout that would be used if not for the callback,
I'd like to point out that we *could* let the callback return a string instead of an optional string.

I'd find that slightly better, but I'm not sure it's worth the trouble.

jsilvanus accepted this revision.Jan 13 2023, 12:08 AM
jsilvanus added inline comments.
llvm/include/llvm/Bitcode/BitcodeReader.h
73

On second thought, returning an optional has the advantage of being explicit about whether the DL str was changed or not, making life easier for the parser. For example, the textual parser keeps track of the source location of the DL string for diagnostics, and discards that location if the callback replaces the DL string. That would be less convenient to do with strings returned.

This revision is now accepted and ready to land.Jan 13 2023, 12:08 AM
This revision was landed with ongoing or failed builds.Jan 17 2023, 4:20 AM
This revision was automatically updated to reflect the committed changes.
sebastian-ne reopened this revision.Jan 18 2023, 2:40 AM
This revision is now accepted and ready to land.Jan 18 2023, 2:40 AM

I would have prefered a comment instead of a straight revert without giving me a chance to fix it.
Anyway, here’s a version that embeds bitcode instead of textual IR.

nikic accepted this revision.Jan 18 2023, 4:10 AM

LGTM, thanks

llvm/unittests/Bitcode/BitReaderTestCode.h
146

Add a newline before !1?

nikic added inline comments.Jan 18 2023, 4:11 AM
llvm/unittests/Bitcode/BitReaderTest.cpp
352

nit: These calls aren't necessary anymore.

This revision was landed with ongoing or failed builds.Jan 18 2023, 4:20 AM
This revision was automatically updated to reflect the committed changes.
sebastian-ne marked 2 inline comments as done.