This is an archive of the discontinued LLVM Phabricator instance.

[Bitcode] Add BITCODE_SIZE_BLOCK_ID to encode the size of the bitcode
AbandonedPublic

Authored by MaskRay on Aug 29 2020, 9:35 PM.

Details

Summary

Currently a bitcode file has no terminator. If several bitcode files are
concatenated, there is no suitable tool splitting them. For example:

clang -fembed-bitcode=all -c a.c b.c
ld a.o b.o
objcopy --dump-section=.llvmbc=a.bc a.out /dev/null
llvm-dis a.bc  # LLVM ERROR: Invalid encoding

This is because llvm-dis does not terminate after reaching the end of the first
stream, and decodes the start of the second bitcode file as the continuation
instead, and fail.

This patch introduces an optional BITCODE_SIZE_BLOCK_ID block to encode the size
of the stream. The block can be placed anywhere in the stream. For a streaming
writer, BITCODE_SIZE_BLOCK_ID can be placed at the end as a terminator. If
random access is supported, BITCODE_SIZE_BLOCK_ID can be placed earlier to allow
fast filtering (by parsing some metadata records like MODULE_CODE_SOURCE_FILENAME).

llvm-dis is taught to handle concatenated streams.

Diff Detail

Event Timeline

MaskRay created this revision.Aug 29 2020, 9:35 PM
MaskRay requested review of this revision.Aug 29 2020, 9:35 PM
MaskRay edited the summary of this revision. (Show Details)Aug 29 2020, 9:49 PM
dougpuob added inline comments.
llvm/lib/Bitcode/Reader/BitcodeAnalyzer.cpp
84

Seems like a typo.

llvm/lib/Bitcode/Reader/BitcodeReader.cpp
6504–6505

About the boundary checking issue, unsigned integer wrapping. Is it possible that the BCBegin is greater than Size? if so, is it possible that the `return error() path will be executed?

I'm a naive to compiler, just curious about it.

MaskRay marked an inline comment as done.Aug 30 2020, 12:09 AM
MaskRay added inline comments.
llvm/lib/Bitcode/Reader/BitcodeReader.cpp
6504–6505

The check is to catch erroneous cases (if Size in BITCODE_SIZE_BLOCK_ID block is incorrectly set). Otherwise it is difficult to catch such errors.

dougpuob added inline comments.Aug 30 2020, 10:29 PM
llvm/lib/Bitcode/Reader/BitcodeReader.cpp
6504–6505

Got it, thank you :)

I am a bit worry that linker might concatenate bitcode file with padding to achieve alignment requirement, etc. I guess you can create a termination block to mark the end but it is hard to seek the next start.

MaskRay marked an inline comment as done.Aug 31 2020, 4:22 PM

I am a bit worry that linker might concatenate bitcode file with padding to achieve alignment requirement, etc. I guess you can create a termination block to mark the end but it is hard to seek the next start.

We can set the section alignment to 1 to avoid padding: (I rushed a bit, sorry: 6ae7b403c3e1aebcb825d3dd4777d3c1149d6d67)

@bartell mentioned something about a bitcode wrapper that does provide the length - how does this patch's strategy compare to that approach? Are they redundant? Should the wrapper be replaced with this patch's approach? (or should the bitcode wrapper approach be used instead of adding this patch?)

I am a bit worry that linker might concatenate bitcode file with padding to achieve alignment requirement, etc. I guess you can create a termination block to mark the end but it is hard to seek the next start.

We can set the section alignment to 1 to avoid padding: (I rushed a bit, sorry: 6ae7b403c3e1aebcb825d3dd4777d3c1149d6d67)

I didn't see that. I don't really have concerns then.

@bartell mentioned something about a bitcode wrapper that does provide the length - how does this patch's strategy compare to that approach? Are they redundant? Should the wrapper be replaced with this patch's approach? (or should the bitcode wrapper approach be used instead of adding this patch?)

I guess that is an implementation choice. This approach prefers a simple linker implementation, while the bitcode wrapper requires linker know to treat this section differently and write bitcode wrapper into section.
@MaskRay Do you have any specific use case in mind for this? If you are tied to a linker like lld, it might be better just teach lld to treat this section differently so we don't need worry about concatenated bitcode file.

@bartell mentioned something about a bitcode wrapper that does provide the length - how does this patch's strategy compare to that approach? Are they redundant? Should the wrapper be replaced with this patch's approach? (or should the bitcode wrapper approach be used instead of adding this patch?)

I guess that is an implementation choice. This approach prefers a simple linker implementation, while the bitcode wrapper requires linker know to treat this section differently and write bitcode wrapper into section.
@MaskRay Do you have any specific use case in mind for this? If you are tied to a linker like lld, it might be better just teach lld to treat this section differently so we don't need worry about concatenated bitcode file.

The linker wouldn't need to know about the wrapper. We would just change Clang to always emit the wrapper in .llvmbc.

The "bitcode wrapper" I'm referring to is the simple header written by emitDarwinBCHeaderAndTrailer and read by SkipBitcodeWrapperHeader. It's currently only written for bitcode files that use Darwin/Mach-O targets, but it should be read successfully regardless of target.

Rather than adding new fields to the bitcode itself as this patch does, I would recommend using the "bitcode wrapper" instead, just for the sake of reusing code. We could either enable it for all uses of WriteBitcodeToFile, or just use it in Clang when generating the .llvmbc section.

I am a bit worry that linker might concatenate bitcode file with padding to achieve alignment requirement, etc. I guess you can create a termination block to mark the end but it is hard to seek the next start.

We can set the section alignment to 1 to avoid padding: (I rushed a bit, sorry: 6ae7b403c3e1aebcb825d3dd4777d3c1149d6d67)

I didn't see that. I don't really have concerns then.

@bartell mentioned something about a bitcode wrapper that does provide the length - how does this patch's strategy compare to that approach? Are they redundant? Should the wrapper be replaced with this patch's approach? (or should the bitcode wrapper approach be used instead of adding this patch?)

I guess that is an implementation choice. This approach prefers a simple linker implementation, while the bitcode wrapper requires linker know to treat this section differently and write bitcode wrapper into section.
@MaskRay Do you have any specific use case in mind for this? If you are tied to a linker like lld, it might be better just teach lld to treat this section differently so we don't need worry about concatenated bitcode file.

It is not tied to a linker like lld. I believe in most binary formats, if you specify an alignment of 1, their linkers will concatenate input sections in the output without padding. Actually, the section content of .llvmbc is entirely opaque to lld/GNU ld/gold. ("dumb linker, smart format" design).

If the bitcode wrapper requires the linker to understand its format, I'd vote against that solution. (It will not work with GNU ld/gold.)

I am a bit worry that linker might concatenate bitcode file with padding to achieve alignment requirement, etc. I guess you can create a termination block to mark the end but it is hard to seek the next start.

We can set the section alignment to 1 to avoid padding: (I rushed a bit, sorry: 6ae7b403c3e1aebcb825d3dd4777d3c1149d6d67)

I didn't see that. I don't really have concerns then.

@bartell mentioned something about a bitcode wrapper that does provide the length - how does this patch's strategy compare to that approach? Are they redundant? Should the wrapper be replaced with this patch's approach? (or should the bitcode wrapper approach be used instead of adding this patch?)

I guess that is an implementation choice. This approach prefers a simple linker implementation, while the bitcode wrapper requires linker know to treat this section differently and write bitcode wrapper into section.
@MaskRay Do you have any specific use case in mind for this? If you are tied to a linker like lld, it might be better just teach lld to treat this section differently so we don't need worry about concatenated bitcode file.

It is not tied to a linker like lld. I believe in most binary formats, if you specify an alignment of 1, their linkers will concatenate input sections in the output without padding. Actually, the section content of .llvmbc is entirely opaque to lld/GNU ld/gold. ("dumb linker, smart format" design).

If the bitcode wrapper requires the linker to understand its format, I'd vote against that solution. (It will not work with GNU ld/gold.)

Per @bartell 's last comment, it sounds like it doesn't require the linker to understand the format (the handling is on the bitcode reader/writer side). Do we know whether that would work here instead?

The idea is to add a terminator block, but a terminator carries less information so this patch intends to add a size block instead. With a terminator, the following problem can be better addressed

// We may be consuming bitcode from a client that leaves garbage at the end
// of the bitcode stream (e.g. Apple's ar tool). If we are close enough to
// the end that there cannot possibly be another module, stop looking.

If the container does not expect more than one bitcode stream. Then after passing one stream, it can bail out.
For users who want to have multiple streams, they can strip any padding. The scheme will automatically work with most binary formats (alignment=1).

The idea is to add a terminator block, but a terminator carries less information so this patch intends to add a size block instead. With a terminator, the following problem can be better addressed

// We may be consuming bitcode from a client that leaves garbage at the end
// of the bitcode stream (e.g. Apple's ar tool). If we are close enough to
// the end that there cannot possibly be another module, stop looking.

If the container does not expect more than one bitcode stream. Then after passing one stream, it can bail out.
For users who want to have multiple streams, they can strip any padding. The scheme will automatically work with most binary formats (alignment=1).

It'd be nice to not have two different tools to solve what seem to be fairly similar problems. Perhaps the MachO/bitcode wrapper could be simplified if the alignment issue was fixed there? (not sure if it needs the full header, etc) & be a simple prefix length, perhaps. I'm probably a bit biased by DWARF, but it seems nice to be able to read a length fairly quickly/easily and skip the rest, rather than having to parse the contents to find out where it ends.

MaskRay abandoned this revision.Oct 20 2020, 10:02 AM

BitcodeReader.h has the following code

/// SkipBitcodeWrapperHeader - Some systems wrap bc files with a special
/// header for padding or other reasons.  The format of this header is:
///
/// struct bc_header {
///   uint32_t Magic;         // 0x0B17C0DE
///   uint32_t Version;       // Version, currently always 0.
///   uint32_t BitcodeOffset; // Offset to traditional bitcode file.
///   uint32_t BitcodeSize;   // Size of traditional bitcode file.
///   ... potentially other gunk ...
/// };

The header contains the size so I assume this patch is not needed.