Page MenuHomePhabricator

[ELF][LTO] Call madvise(MADV_DONTNEED) on MemoryBuffer instances
ClosedPublic

Authored by MaskRay on Dec 29 2021, 1:15 AM.

Details

Summary

@tejohnson noticed that freeing MemoryBuffer instances right before
lto->compile can save RSS, likely because the memory can be reused by
LTO indexing (e.g. ThinLTO import/export lists).).

For ELFFileBase instances, symbol and section names are backed by MemoryBuffer,
so destroying MemoryBuffer would make some infrequent passes (parseSymbolVersion,
reportBackrefs) crash and make debugging difficult.
For a BitcodeFile, its content is completely unused, but destroying its
MemoryBuffer makes the buffer identifier inaccessible and may introduce
constraints for future changes.
This patch leverages madvise(MADV_DONTNEED) which achieves the major gain
without the latent issues.

Maximum resident set size (kbytes): for a large --thinlto-index-only link:

  • current behavior: 10146104KiB
  • destroy MemoryBuffer instances: 8555240KiB
  • madvise(MADV_DONTNEED) just bitcodeFiles and lazyBitcodeFiles: 8737372KiB
  • madvise(MADV_DONTNEED) all MemoryBuffers: 8739796KiB

Depends on D116366

Diff Detail

Event Timeline

MaskRay created this revision.Dec 29 2021, 1:15 AM
MaskRay requested review of this revision.Dec 29 2021, 1:15 AM
Herald added a project: Restricted Project. · View Herald TranscriptDec 29 2021, 1:15 AM

This can be used by in-process LTO. I haven't tested it yet.

MaskRay edited the summary of this revision. (Show Details)Dec 29 2021, 1:18 AM
MaskRay edited the summary of this revision. (Show Details)

This can be used by in-process LTO. I haven't tested it yet.

Would it be better to try unconditionally marking all/only the bitcode file memory buffers as MADV_DONTNEED?

For the large thin link this change is very close to the same peak RSS reduction as with my change to destroy the buffer objects completely:
Head: 10.7G
Destroy: 8.9G
Don'tNeed: 9.1G

However, when I tried my earlier version which unconditionally freed all the memory buffers, I found that the memory buffer name is used when creating the output indices, which I fixed by saving the name in the InputFile object. Would not doing so with this change cause a time degradation because we will need to pull all these pages back in to get the name from the buffers?

lld/ELF/Driver.cpp
2416–2417

Comment from the second sentence onward should be moved back to the early return, as it relates to that part, or otherwise updated accordingly.

MaskRay updated this revision to Diff 396550.Dec 29 2021, 10:25 AM

update comments

MaskRay edited the summary of this revision. (Show Details)Dec 29 2021, 10:29 AM
MaskRay added reviewers: tejohnson, ikudrin, peter.smith.
MaskRay added a comment.EditedDec 29 2021, 10:44 AM

This can be used by in-process LTO. I haven't tested it yet.

Would it be better to try unconditionally marking all/only the bitcode file memory buffers as MADV_DONTNEED?

(Uploaded the previous diff before seeing your comment.)

Agree. Bitcode files can be completely MADV_DONTNEED even in non --thinlto-index-only mode.
For ELFFileBase files, their section and symbol names are used in non --thinlto-index-only mode and nearly unused in --thinlto-index-only mode.

For the large thin link this change is very close to the same peak RSS reduction as with my change to destroy the buffer objects completely:
Head: 10.7G
Destroy: 8.9G
Don'tNeed: 9.1G

However, when I tried my earlier version which unconditionally freed all the memory buffers, I found that the memory buffer name is used when creating the output indices, which I fixed by saving the name in the InputFile object. Would not doing so with this change cause a time degradation because we will need to pull all these pages back in to get the name from the buffers?

The MemoryBuffer identifier is stored separated from the buffer content, so technically there should be no issue. In my coarse estimate yesterday, I haven't found any time degradation.

I will benchmark this by running more times.

MaskRay added a comment.EditedDec 29 2021, 10:49 AM

BTW: I have refactored how --start-lib is handled recently. LazyObjFile has been removed.

The --start-lib style parsing for BitcodeFile is now in BitcodeFile::parseLazy.
It still uses saver.save(sym.getName()). There are two further improvements:

  • BitcodeFile::parseLazy and BitcodeFile::parse should use the same string (backed by saver) and perhaps share the symbols array as the current non-lazy and lazy ObjFile share.
  • In createBitcodeSymbol, if objSym.getName() is a stable string, we can remove saver.save. This will avoid waste for symbol names, but means that a MemoryBuffer backing a BitcodeFile may still be needed in non --thinlto-index-only mode.

That said, re-reading the string table (STRTAB_BLOCK) pages for BitcodeFile instances may unlikely cause a performance difference. This is a --time-trace dump for a large --thinlto-index-only. Most time is spend in lib/LTO. ld.lld's passes take a tiny portion of total time.

164.864697 Total ExecuteLinker
163.609803 Total Link
158.919291 Total LTO
53.818647 Total ThinLink
4.384058 Total Parse input files
2.107437 Total LowerTypeTestsPass
1.245371 Total Load input files
0.304628 Total Process symbol versions
0.001661 Total Read linker script
0.000595 Total Locate library
0.00054 Total OptModule
0.000366 Total Create output files
0.000227 Total IPSCCPPass
0.00016 Total VerifierPass
0.000128 Total CalledValuePropagationPass
6.9e-05 Total WholeProgramDevirtPass
1.4e-05 Total GlobalDCEPass
9e-06 Total ModuleToPostOrderCGSCCPassAdaptor
9e-06 Total CrossDSOCFIPass
6e-06 Total PGOIndirectCallPromotion
3e-06 Total ReversePostOrderFunctionAttrsPass
2e-06 Total ModuleToFunctionPassAdaptor
2e-06 Total ModuleInlinerWrapperPass
2e-06 Total RequireAnalysisPass<llvm::GlobalsAA, llvm::Module>
1e-06 Total GlobalOptPass
1e-06 Total InferFunctionAttrsPass
1e-06 Total Annotation2MetadataPass
MaskRay updated this revision to Diff 396562.Dec 29 2021, 11:52 AM
MaskRay retitled this revision from [ELF][LTO] --thinlto-index-only: call madvise(MADV_DONTNEED) to [ELF][LTO] Call madvise(MADV_DONTNEED) on BitcodeFile buffers.

decrease memory for in-process LTO as well.

This can be used by in-process LTO. I haven't tested it yet.

Would it be better to try unconditionally marking all/only the bitcode file memory buffers as MADV_DONTNEED?

(Uploaded the previous diff before seeing your comment.)

Agree. Bitcode files can be completely MADV_DONTNEED even in non --thinlto-index-only mode.
For ELFFileBase files, their section and symbol names are used in non --thinlto-index-only mode and nearly unused in --thinlto-index-only mode.

For the large thin link this change is very close to the same peak RSS reduction as with my change to destroy the buffer objects completely:
Head: 10.7G
Destroy: 8.9G
Don'tNeed: 9.1G

However, when I tried my earlier version which unconditionally freed all the memory buffers, I found that the memory buffer name is used when creating the output indices, which I fixed by saving the name in the InputFile object. Would not doing so with this change cause a time degradation because we will need to pull all these pages back in to get the name from the buffers?

The MemoryBuffer identifier is stored separated from the buffer content, so technically there should be no issue. In my coarse estimate yesterday, I haven't found any time degradation.

I will benchmark this by running more times.

I tried a few times and didn't see a slowdown.

However, with the latest version to only free the bitcode files, I am seeing less memory reduction (9.3G vs 9.1G before), and also a 30-40s slowdown (maybe the DenseSet lookups?). Something could probably be done to avoid the lookups, but do you have any idea why the memory reduction would be less? Hmm, probably there are some other non-bitcode files that are being read for symbol resolution but not being used further in index only mode?

MaskRay updated this revision to Diff 396571.EditedDec 29 2021, 2:00 PM

Mark lazyBitcodeFiles as well. Mark just bitcodeFiles and lazyBitcodeFiles should achieve the original saving for distributed ThinLTO usage.

While here, try to support other cases:

  • When ELF object files are significant
  • Or when archives are used instead of --start-lib and many archive members are not extracted
MaskRay marked an inline comment as done.Dec 29 2021, 2:02 PM

I tried a few times and didn't see a slowdown.

However, with the latest version to only free the bitcode files, I am seeing less memory reduction (9.3G vs 9.1G before), and also a 30-40s slowdown (maybe the DenseSet lookups?). Something could probably be done to avoid the lookups, but do you have any idea why the memory reduction would be less? Hmm, probably there are some other non-bitcode files that are being read for symbol resolution but not being used further in index only mode?

It was my negligence. --start-lib bitcode files may not be extracted. They reside in the lazyBitcodeFiles list. The latest diff marked them as well.
I think this is sufficient for our use case. The diff tried a bit more to support other use cases.

lld/ELF/Driver.cpp
1993

typo: passes

tejohnson accepted this revision.Dec 29 2021, 2:21 PM

LGTM with a couple of comment suggestions below.

I tried a few times and didn't see a slowdown.

However, with the latest version to only free the bitcode files, I am seeing less memory reduction (9.3G vs 9.1G before), and also a 30-40s slowdown (maybe the DenseSet lookups?). Something could probably be done to avoid the lookups, but do you have any idea why the memory reduction would be less? Hmm, probably there are some other non-bitcode files that are being read for symbol resolution but not being used further in index only mode?

It was my negligence. --start-lib bitcode files may not be extracted. They reside in the lazyBitcodeFiles list. The latest diff marked them as well.
I think this is sufficient for our use case. The diff tried a bit more to support other use cases.

Confirmed that the memory savings recovered, and the time reduced too, but in my case skipLinkedOutput is true so it is not doing the dense set lookups at all.

lld/ELF/Driver.cpp
2396

Suggest making this something like "Index file creation is performed in compileBitcodeFiles, so we are done afterwards in that case." Or move this sentence below where we take the early return.

2399

"When only certain thinLTO modules are specified for compilation." isn't a complete sentence. Maybe, "This is also true when only ..." ?

This revision is now accepted and ready to land.Dec 29 2021, 2:21 PM
MaskRay edited the summary of this revision. (Show Details)Dec 29 2021, 2:32 PM
MaskRay marked 2 inline comments as done.Dec 29 2021, 4:23 PM
MaskRay retitled this revision from [ELF][LTO] Call madvise(MADV_DONTNEED) on BitcodeFile buffers to [ELF][LTO] Call madvise(MADV_DONTNEED) on MemoryBuffer instances.Dec 30 2021, 11:28 AM
This revision was landed with ongoing or failed builds.Dec 30 2021, 11:37 AM
This revision was automatically updated to reflect the committed changes.