This is an archive of the discontinued LLVM Phabricator instance.

Flush bitcode incrementally for LTO output
ClosedPublic

Authored by stephan.yichao.zhao on Aug 31 2020, 8:48 PM.

Details

Summary

Bitcode writer does not flush buffer until the end by default. This is
fine to small bitcode files. When -flto,--plugin-opt=emit-llvm,-gmlt are
used, the final bitcode file is large, for example, >8G. Keeping all
data in memory consumes a lot of memory.

This change allows bitcode writer flush data to disk early when buffered
data size is above some threshold. This is only enabled when lld emits
LLVM bitcode.

One issue to address is backpatching bitcode: subblock length, function
body indexes, meta data indexes need to backfill. If buffer can be
flushed partially, we introduced raw_fd_stream that supports
read/seek/write, and enables backpatching bitcode flushed in disk.

Diff Detail

Event Timeline

stephan.yichao.zhao requested review of this revision.Aug 31 2020, 8:48 PM
MaskRay added a comment.EditedAug 31 2020, 9:08 PM

I can understand the read-write stream requirement, but the changes to lib/Support may require an additional set of reviewers. You will need some unittests (see llvm/unittests/Support/raw_ostream_test.cpp for example) Probably consider splitting the patch into two.

lld/ELF/LTO.cpp
69
71

return {};

openFile would likely fail for the same reason.

llvm/include/llvm/Support/raw_ostream.h
318
llvm/lib/Support/raw_ostream.cpp
32

Can this be avoided?

stephan.yichao.zhao marked 4 inline comments as done.

addressed comments

I can understand the read-write stream requirement, but the changes to lib/Support may require an additional set of reviewers. You will need some unittests (see llvm/unittests/Support/raw_ostream_test.cpp for example) Probably consider splitting the patch into two.

created D86913 with unit tests.

lld/ELF/LTO.cpp
71

openFile opens raw_fd_ostream (write-only) but not raw_fd_stream (read-write-seek). So openFile may work on a platform that does not support seek or read-write mode).

llvm/include/llvm/Support/raw_ostream.h
318

Thank you.
I implemented classof. then dyn_cast works!
But I did not make kind as a class member initialized at constructor like that link does. Making that change will change lots of code at different child classes. Is the way that kind is defined in the link a requirement in LLVM codebase?

evgeny777 added inline comments.
llvm/include/llvm/Bitstream/BitstreamWriter.h
160

Can we use memory mapped I/O and avoid backpatching on disk?

llvm/include/llvm/Support/raw_ostream.h
562

Comment is misleading

MaskRay added inline comments.Sep 1 2020, 12:16 PM
llvm/include/llvm/Support/raw_ostream.h
318

Can Kind use a default member initializer?

stephan.yichao.zhao marked 2 inline comments as done.Sep 1 2020, 2:53 PM
stephan.yichao.zhao added inline comments.
llvm/include/llvm/Bitstream/BitstreamWriter.h
160

Our use case is likely not what mmap is good at.

I assume mmap in Linux loads pages on demand. If a code reads/writes data on pages already loaded, its access has no IO cost. For example, a code randomly accesses a chunk of continuous addresses or addresses within a same page. Although the first time a page is loaded, the memory copy and page fault cost are still paid, the cost is ignorable asymptotically.

Our case is a bit different. Given a 512M incremental flush threshold, I tested an LTO built that outputs a 5G bitcode file. The BackpatchWord is called 16,613,927 times, among which only 12 needs disk seek. Plus, each access visits 4-8 bytes on a page, and all visited pages are far away from each other. It is likely that the pages are not cached, and need to load anyway, and after a load, our code does not access enough data on a page to 'cancel' the page fault cost. So its cost could be very similar to seek.

Note that if a BackpatchWord needs to access disk, we need 1 seek to load existing data, 1 seek to overwrite the data, and 1 seek to jump back. The first 2 seek addresses are very close, hopefully disk cache can handle them. Although the last jump back seek is a very long jump, if a page cache is based no time or frequency, the page that it jumps back may not be evicted yet. Overall the ratio of disk access introduced is very small, so hopefully its additional cost is small. I also did a perf profile, no observable latency is shown (because LTO takes too much time).

Give the above and that mmap support is different across systems, the seek based approach seems fine.

llvm/include/llvm/Support/raw_ostream.h
562

Will be updating this at https://reviews.llvm.org/D86913

stephan.yichao.zhao marked 3 inline comments as done.Sep 1 2020, 4:11 PM
stephan.yichao.zhao added inline comments.
llvm/include/llvm/Support/raw_ostream.h
318

yes. will be updating https://reviews.llvm.org/D86913

stephan.yichao.zhao marked an inline comment as done.Sep 1 2020, 4:12 PM

Our case is a bit different. Given a 512M incremental flush threshold, I tested an LTO built that outputs a 5G bitcode file. The BackpatchWord is called 16,613,927 times, among which only 12 needs disk seek. Plus, each access visits 4-8 bytes on a page, and all visited pages are far away from each other. It is likely that the pages are not cached, and need to load anyway, and after a load, our code does not access enough data on a page to 'cancel' the page fault cost. So its cost could be very similar to seek.

It seems that you're trying to implement your own I/O caching. I don't understand why you're not letting OS to do this for you. For instance on systems with larger amount of memory (I have 64 GB on my home PC, typical build server may have even more) mmap will buffer all your 5G bc file in memoy and then write it back to disk without any seek operations (which are costly on traditional HDD).

Give the above and that mmap support is different across systems, the seek based approach seems fine.

LLVM has FileOutputBuffer class which abstracts underlying OS differences. LLVM lld.lld linker uses it for output file generation

Our case is a bit different. Given a 512M incremental flush threshold, I tested an LTO built that outputs a 5G bitcode file. The BackpatchWord is called 16,613,927 times, among which only 12 needs disk seek. Plus, each access visits 4-8 bytes on a page, and all visited pages are far away from each other. It is likely that the pages are not cached, and need to load anyway, and after a load, our code does not access enough data on a page to 'cancel' the page fault cost. So its cost could be very similar to seek.

It seems that you're trying to implement your own I/O caching. I don't understand why you're not letting OS to do this for you. For instance on systems with larger amount of memory (I have 64 GB on my home PC, typical build server may have even more) mmap will buffer all your 5G bc file in memoy and then write it back to disk without any seek operations (which are costly on traditional HDD).

My local machine also has lot of memory to make this work. :) The problem is when LTO-ing thousands of such targets, the build server I am using throttles memory usage.
ThinLTO (https://www.youtube.com/watch?v=9OIEZAj243g) has one similar motivation like this: build services do not allow memory consumption above some-G threshold.

Although disk seek has cost, it happens < 1 out of million only when generating large bitcode files merged by LTO. A bitcode generated from each compilation unit is much smaller, does not need any disk seek.
So in practice, the disk seek overhead happens in a very small chance with a trade-off to save lot of memory to make build thousands of large targets by a build service.

Give the above and that mmap support is different across systems, the seek based approach seems fine.

LLVM has FileOutputBuffer class which abstracts underlying OS differences. LLVM lld.lld linker uses it for output file generation

Thank you for sharing FileOutputBuffer. This is a useful platform-independent mmap.
If it uses mmap, maybe it is not necessary to buffer the entire file contents in memory, but still leveraging OS page management.
So it still needs to reload evicted pages, which is similar to seek. If it buffers all 5G in memory, the memory issue still exists.

The current WriteBitcodeToFile API assumes a raw_ostream argument. If we used FileOutputBuffer, we may want to encapsulate it as a subclass of raw_ostream like raw_fd_stream.
So raw_fd_stream can be extended to use FileOutputBuffer internally when necessary in the future.

MaskRay added a comment.EditedSep 2 2020, 3:23 PM

WriteBitcodeToFile does not know the size beforehand - this makes FileOutputBuffer impractical. I think raw_fd_stream is still needed.

Can you rebase this on top of D86913 so that it doesn't include those changes?

tejohnson added inline comments.Sep 11 2020, 7:06 PM
llvm/include/llvm/Bitstream/BitstreamWriter.h
98

Would it be valuable to make this configurable? How sensitive is performance to the value chosen here?

156

Why is this guarded by NDEBUG? I'm not convinced there is much value in doing this code even when StartBit is 0 in the debug case.

stephan.yichao.zhao marked 2 inline comments as done.

addressed comments

llvm/include/llvm/Bitstream/BitstreamWriter.h
98

Added a flag to plugin-opt. Is it the right way to do this?

156

This relates to the assert at line 165.

At debug mode, the assert at line 165 checks if the value to backpatch is 0. So the code needs to read data from disk.

At non-debug mode, the assert at line 165 does not verify existing data. So the code does not need to read data from disk for this reason.
But if StartBit is non 0, the code still needs to read the existing data because the backpatched value is not aligned.
For example, when backpatching with StartBit = 2, the aligned data on disk are,

c0 00 00 00 3f

So we need to read them out, fill in those 0s, then write back. Although the code can always read these bytes from disk, I wanted to save some IO overhead.

Added comments.

Please mention in the summary somewhere that this is only enabled for lld right now.

lld/ELF/LTO.cpp
172

Nit, document constant parameters with /*parameter_name=*/

llvm/include/llvm/Bitstream/BitstreamWriter.h
98

Oh sorry, I just meant an llvm internal option (cl::opt<int>) in this file. Will let @MaskRay comment on whether they want it as an lld option.

152

s/path/patch/. But I think the whole comment block would be clearer if written in the affirmative sense, e.g. something like:

// When unaligned, copy existing data into Bytes from the file FS and the buffer Out so
// that it can be updated before writing. For debug builds read bytes unconditionally 
// in order to check that the existing value is 0 as expected.

Also as noted below suggest moving comment just above the #ifdef check below.

156

Ah ok. I suggest moving that assert up under the #ifdef too then just for clarity, since they go together logically. And as suggested above, move that whole comment about filling in Bytes to here just above the #ifdef.

165

Why this line added? Oh ic, presumably to avoid an unused variable warning in the NDEBUG case. Maybe just add a comment to that effect.

stephan.yichao.zhao marked 2 inline comments as done.

addressed comments

stephan.yichao.zhao marked 3 inline comments as done.Sep 12 2020, 1:34 PM
stephan.yichao.zhao added inline comments.
lld/ELF/LTO.cpp
172

reverted the change at lld options.

llvm/include/llvm/Bitstream/BitstreamWriter.h
98

Switched to cl::opt. Thank you for the suggestion, I did not know this option.

What is the proper way to call the options with clang or lld? For example, I tried clang/ld.lld with --bitcode-mdindex-threshold=1 or -bitcode-mdindex-threshold=1. They are not accepted. Do we need any prefix before them?

stephan.yichao.zhao marked an inline comment as done.Sep 12 2020, 1:37 PM
stephan.yichao.zhao added inline comments.
llvm/include/llvm/Bitstream/BitstreamWriter.h
98

Added cl::opt at BitcodeWriter.cpp instead of here. BitstreamWriter.h does not have cpp. If we add its cpp, we also need to add a library, and many code needs to update to use the new library like this (https://reviews.llvm.org/D63899).

stephan.yichao.zhao edited the summary of this revision. (Show Details)Sep 12 2020, 6:27 PM

There are a number of single statement if and for bodies in the patch that have braces but should not per llvm coding style.

llvm/include/llvm/Bitstream/BitstreamWriter.h
156

I just realized the first part of this suggestion doesn't make sense, what I should have said is to move the assert up into within the braces, since it goes with that code. I.e. if the code within the braces isn't executed Bytes isn't filled in so there is no point asserting whether Bytes is 0.

stephan.yichao.zhao marked an inline comment as done.

updated

fixed the if and for clauses style issue.

tejohnson accepted this revision.Sep 13 2020, 8:02 AM

lgtm but please wait to see if @MaskRay or @evgeny777 have more comments

This revision is now accepted and ready to land.Sep 13 2020, 8:02 AM
This revision was landed with ongoing or failed builds.Sep 16 2020, 8:33 PM
This revision was automatically updated to reflect the committed changes.

See comment about a VC++ warning being generated.

llvm/include/llvm/Bitstream/BitstreamWriter.h
188

Microsoft Visual C++ is warning that this needs to be char Bytes[9]

Can this be updated (at least to silence the warning, we use warnings as errors in our builds)?

llvm/include/llvm/Bitstream/BitstreamWriter.h
188

Yes. I will be fixing it.
What is the name of this warning from Visual C++?

stephan.yichao.zhao marked an inline comment as done.Sep 18 2020, 9:44 AM
stephan.yichao.zhao added inline comments.
dstuttard added inline comments.Sep 21 2020, 1:28 AM
llvm/include/llvm/Bitstream/BitstreamWriter.h
188

Sorry - should have included that in the original report. It's warning C6386:

\llvm-project\llvm\include\llvm\Bitstream\BitstreamWriter.h(177) : warning C6386: Buffer overrun while writing to 'Bytes': the writable size
is '8' bytes, but '9' bytes might be written.

MTC added a subscriber: MTC.Oct 9 2021, 12:51 AM

This looks like a nice idea: can you include perf numbers in the description?

llvm/include/llvm/Bitcode/BitcodeWriter.h
50

I think this is the main public API entry point change?
Likely worth updating the doc clearly here.

This looks like a nice idea: can you include perf numbers in the description?

I didn't notice this was already committed, but numbers are still interesting to go with the change, can you post some here?

I didn't notice this was already committed, but numbers are still interesting to go with the change, can you post some here?

I dont have a statistic result.
This change mainly helps large binaries. In my cases, a binary is ~8G, the other RAM used for LTO is ~40G. So this change reduces 17% memory overhead.
This 17% reduction is important to me because otherwise I have to find a machine with more than the normal RAM limit to build this binary.
For small applications, this change does not help too much.

I didn't notice this was already committed, but numbers are still interesting to go with the change, can you post some here?

I dont have a statistic result.
This change mainly helps large binaries. In my cases, a binary is ~8G, the other RAM used for LTO is ~40G. So this change reduces 17% memory overhead.
This 17% reduction is important to me because otherwise I have to find a machine with more than the normal RAM limit to build this binary.

Thanks, this is exactly the kind of "numbers" I was looking for actually.