This is an archive of the discontinued LLVM Phabricator instance.

[CMake] Add clang-bolt target
ClosedPublic

Authored by Amir on Aug 30 2022, 2:23 PM.

Details

Summary

This patch adds CLANG_BOLT_INSTRUMENT option that applies BOLT instrumentation
to Clang, performs a bootstrap build with the resulting Clang, merges resulting
fdata files into a single profile file, and uses it to perform BOLT optimization
on the original Clang binary.

The projects and targets used for bootstrap/profile collection are configurable via
CLANG_BOLT_INSTRUMENT_PROJECTS and CLANG_BOLT_INSTRUMENT_TARGETS.
The defaults are "llvm" and "count" respectively, which results in a profile with
~5.3B dynamically executed instructions.

The intended use of the functionality is through BOLT CMake cache file, similar
to PGO 2-stage build:

cmake <llvm-project>/llvm -C <llvm-project>/clang/cmake/caches/BOLT.cmake
ninja clang++-bolt # pulls clang-bolt

Stats with a recent checkout (clang-16), pre-built BOLT and Clang, 72vCPU/224G

CMake configure with host Clang + BOLT.cmake1m6.592s
Instrumenting Clang with BOLT2m50.508s
CMake configure llvm with instrumented Clang5m46.364s (~5x slowdown)
CMake build not with instrumented Clang0m6.456s
Merging fdata files0m9.439s
Optimizing Clang with BOLT0m39.201s

Building Clang:

cmake ../llvm-project/llvm -DCMAKE_C_COMPILER=... -DCMAKE_CXX_COMPILER=...
  -DCMAKE_BUILD_TYPE=Release -DLLVM_ENABLE_PROJECTS=clang 
  -DLLVM_TARGETS_TO_BUILD=Native -GNinja
ReleaseBOLT-optimized
cmake0m24.016s0m22.333s
ninja clang5m55.692s4m35.122s

I know it's not rigorous, but shows a ballpark figure.

Diff Detail

Event Timeline

Amir created this revision.Aug 30 2022, 2:23 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 30 2022, 2:23 PM
Amir requested review of this revision.Aug 30 2022, 2:23 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 30 2022, 2:23 PM
Herald added a subscriber: cfe-commits. · View Herald Transcript
Amir retitled this revision from [clang][BOLT] Add clangbolt target (WIP) to [clang][BOLT] Add clang-bolt target (WIP).Aug 30 2022, 2:24 PM
Amir added reviewers: beanz, MaskRay.
Amir updated this revision to Diff 456799.Aug 30 2022, 2:33 PM

CMAKE_CURRENT_BINARY_DIR already contains bin/

Amir updated this revision to Diff 457102.Aug 31 2022, 3:04 PM

Succeeded instrumenting Clang with BOLT

Amir updated this revision to Diff 457172.Aug 31 2022, 9:54 PM

Successfully invoke the bootstrap/profiling build

This was already on my list of build system features I'd like to implement and I'm glad someone else is already looking into it, thank you! I have two high level comments about your approach.

The first one is related to the use of Clang build as the training data. I think that Clang build is both unnecessarily heavyweight, but also not particularly representative of typical workloads (most Clang users don't use it to build Clang). Ideally, we would give vendors the flexibility to supply their own training data. I'd prefer reusing the existing perf-training setup to do so. In fact, I'd imagine most vendors would likely use the same training data for both PGO and BOLT and that use case should be supported.

The second one is related to applicability. I don't think this mechanism should be limited only to Clang. Ideally, it should be possible to instrument and optimize other tools in the toolchain distribution as well; LLD is likely going to be the most common one after Clang.

Amir updated this revision to Diff 457400.Sep 1 2022, 2:18 PM

Succeeded in producing optimized Clang. Switch the default profiling target
from lld to count, which produces a sufficient Clang coverage of 5.3B exec
insns (along with configure-stage Clang invocations).

Amir retitled this revision from [clang][BOLT] Add clang-bolt target (WIP) to [clang][BOLT] Add clang-bolt target.Sep 1 2022, 2:21 PM
Amir edited the summary of this revision. (Show Details)
Amir added a comment.EditedSep 1 2022, 3:49 PM

Hi Petr, thank you for your comments!

This was already on my list of build system features I'd like to implement and I'm glad someone else is already looking into it, thank you! I have two high level comments about your approach.

The first one is related to the use of Clang build as the training data. I think that Clang build is both unnecessarily heavyweight, but also not particularly representative of typical workloads (most Clang users don't use it to build Clang). Ideally, we would give vendors the flexibility to supply their own training data. I'd prefer reusing the existing perf-training setup to do so. In fact, I'd imagine most vendors would likely use the same training data for both PGO and BOLT and that use case should be supported.

Agree that perf-training might be useful for vendors. I'll try to enable it in a follow-up diff.

Please note that the target for profile collection is not hardcoded to clang, it's configurable via CLANG_BOLT_INSTRUMENT_PROJECTS and CLANG_BOLT_INSTRUMENT_TARGETS. Right now it's the llvm/not tool (the smallest possible).

The second one is related to applicability. I don't think this mechanism should be limited only to Clang. Ideally, it should be possible to instrument and optimize other tools in the toolchain distribution as well; LLD is likely going to be the most common one after Clang.

I thought about it, and I think we can accommodate optimizing arbitrary targets by providing an interface to instrument specified target(s) via -DBOLT_INSTRUMENT_TARGETS. For each of the target binaries, CMake would create targets like bolt-instrument-$TARGET and bolt-optimize-$TARGET.
For bolt-instrument-$TARGET, BOLT would instrument the target binary, placing instrumented binary next to the original one (e.g. $target-bolt.inst). End users would use those instrumented binaries on representative workloads to collect the profile. For bolt-optimize-$TARGET, BOLT would post-process the profiles and create optimized binary ($target-bolt).

I appreciate your suggestions. Do you think we can move incrementally from this diff towards more general uses in follow-up diffs?

Amir edited the summary of this revision. (Show Details)Sep 1 2022, 5:41 PM
Amir updated this revision to Diff 457467.Sep 1 2022, 5:49 PM

Fix up paths

Amir edited the summary of this revision. (Show Details)Sep 1 2022, 6:11 PM
srhines added a subscriber: srhines.Sep 1 2022, 9:42 PM
n-omer added a subscriber: n-omer.Sep 2 2022, 3:15 AM

Hi Petr, thank you for your comments!

This was already on my list of build system features I'd like to implement and I'm glad someone else is already looking into it, thank you! I have two high level comments about your approach.

The first one is related to the use of Clang build as the training data. I think that Clang build is both unnecessarily heavyweight, but also not particularly representative of typical workloads (most Clang users don't use it to build Clang). Ideally, we would give vendors the flexibility to supply their own training data. I'd prefer reusing the existing perf-training setup to do so. In fact, I'd imagine most vendors would likely use the same training data for both PGO and BOLT and that use case should be supported.

Agree that perf-training might be useful for vendors. I'll try to enable it in a follow-up diff.

Please note that the target for profile collection is not hardcoded to clang, it's configurable via CLANG_BOLT_INSTRUMENT_PROJECTS and CLANG_BOLT_INSTRUMENT_TARGETS. Right now it's the llvm/not tool (the smallest possible).

The second one is related to applicability. I don't think this mechanism should be limited only to Clang. Ideally, it should be possible to instrument and optimize other tools in the toolchain distribution as well; LLD is likely going to be the most common one after Clang.

I thought about it, and I think we can accommodate optimizing arbitrary targets by providing an interface to instrument specified target(s) via -DBOLT_INSTRUMENT_TARGETS. For each of the target binaries, CMake would create targets like bolt-instrument-$TARGET and bolt-optimize-$TARGET.
For bolt-instrument-$TARGET, BOLT would instrument the target binary, placing instrumented binary next to the original one (e.g. $target-bolt.inst). End users would use those instrumented binaries on representative workloads to collect the profile. For bolt-optimize-$TARGET, BOLT would post-process the profiles and create optimized binary ($target-bolt).

I appreciate your suggestions. Do you think we can move incrementally from this diff towards more general uses in follow-up diffs?

That's fine with me. Do you envision replacing the use of LLVM build for training with perf-training or supporting both? I'd lean towards the former for simplicity but would be curious to hear about your use cases and plans.

clang/CMakeLists.txt
884

We could consider moving this block to a separate file which would then be included here since this file is already getting pretty large and the logic in this block is self-contained. That could be done in a follow up change though.

958

I'd like to avoid dependency on shell to make this compatible with Windows. Can we move this logic into a Python script akin to https://github.com/llvm/llvm-project/blob/607f14d9605da801034e7119c297c3f58ebce603/clang/utils/perf-training/perf-helper.py?

phosek added inline comments.Sep 2 2022, 11:58 AM
clang/CMakeLists.txt
933–940

I don't think this is sufficient in the general case, we would need to pass additional variables like CMAKE_AR the same way we do for the existing bootstrap logic, see https://github.com/llvm/llvm-project/blob/dc549bf0013e11e8fcccba8a8d59c3a4bb052a3b/clang/CMakeLists.txt#L825.

For example, on Fuchsia builders we don't have any system-wide toolchain installation, instead we manually set all necessary CMAKE_<TOOL> variables for the first stage, so this call will fail for us because it won't be able to find tools like the archiver.

Since handling this properly would likely duplicate a lot of the existing logic from the existing bootstrap logic, I'm wondering if we should instead try to refactor the existing logic and break it up into macros/functions which could then be reused here as well.

Will there be eventually a way to build a fully optimised clang/lld with ThinLTO, PGO, and Bolt?

Amir added a comment.Sep 3 2022, 11:03 PM

Will there be eventually a way to build a fully optimised clang/lld with ThinLTO, PGO, and Bolt?

Short answer is likely yes.
For clang, I think this diff should be compatible with PGO, with a caveat that BOLT should be applied to stage-2 clang built with PGO, which means that BOOTSTRAP_ options should be set carefully. And for sure it's compatible with ThinLTO - this one is completely orthogonal.
For lld, I can envision a similar fully automated optimized build, but likely in a future separate diff.

Amir retitled this revision from [clang][BOLT] Add clang-bolt target to [CMake] Add clang-bolt target.Sep 6 2022, 3:39 PM
MaskRay added inline comments.Sep 8 2022, 10:29 PM
clang/CMakeLists.txt
933–940

Supporting other cmake variables will be awesome. I use something like -DCMAKE_CXX_ARCHIVE_CREATE="$HOME/llvm/out/stable/bin/llvm-ar qcS --thin <TARGET> <OBJECTS>" -DCMAKE_CXX_ARCHIVE_FINISH=: to make my build smaller.

Amir updated this revision to Diff 459539.Sep 12 2022, 1:08 PM

Address @phosek's comment about dependency on shell

Amir marked an inline comment as done.Sep 12 2022, 1:08 PM
Amir marked an inline comment as not done.
Amir added inline comments.
clang/CMakeLists.txt
933–940

Addressed in D133633

Amir updated this revision to Diff 459588.Sep 12 2022, 4:28 PM

Add an ability to pass extra cmake flags

Amir marked an inline comment as done.Sep 12 2022, 4:32 PM
Amir added inline comments.
clang/CMakeLists.txt
933–940

Done: -DCLANG_BOLT_INSTRUMENT_EXTRA_CMAKE_FLAGS is passed to the cmake step of bolt-instrumentation-profile target.
I tested it with

-DCLANG_BOLT_INSTRUMENT_EXTRA_CMAKE_FLAGS='-DCMAKE_CXX_ARCHIVE_CREATE="<path/to/llvm/bin>/llvm-ar qcS --thin <TARGET> <OBJECTS>" -DCMAKE_CXX_ARCHIVE_FINISH=:'

and that appeared to work.

phosek accepted this revision.Sep 21 2022, 1:19 AM

LGTM

This revision is now accepted and ready to land.Sep 21 2022, 1:19 AM
This revision was automatically updated to reflect the committed changes.
Amir added a comment.Oct 15 2022, 11:08 PM

This was already on my list of build system features I'd like to implement and I'm glad someone else is already looking into it, thank you! I have two high level comments about your approach.

The first one is related to the use of Clang build as the training data. I think that Clang build is both unnecessarily heavyweight, but also not particularly representative of typical workloads (most Clang users don't use it to build Clang). Ideally, we would give vendors the flexibility to supply their own training data. I'd prefer reusing the existing perf-training setup to do so. In fact, I'd imagine most vendors would likely use the same training data for both PGO and BOLT and that use case should be supported.

Do you happen to know any existing perf-training sets? Or is there a simple way to create one?

This was already on my list of build system features I'd like to implement and I'm glad someone else is already looking into it, thank you! I have two high level comments about your approach.

The first one is related to the use of Clang build as the training data. I think that Clang build is both unnecessarily heavyweight, but also not particularly representative of typical workloads (most Clang users don't use it to build Clang). Ideally, we would give vendors the flexibility to supply their own training data. I'd prefer reusing the existing perf-training setup to do so. In fact, I'd imagine most vendors would likely use the same training data for both PGO and BOLT and that use case should be supported.

Do you happen to know any existing perf-training sets? Or is there a simple way to create one?

I'm working on a script for generating perf-training sets from Ninja-based build systems, I can contribute it to LLVM if you think it'd be useful.

Amir added a comment.Oct 16 2022, 8:33 PM

This was already on my list of build system features I'd like to implement and I'm glad someone else is already looking into it, thank you! I have two high level comments about your approach.

The first one is related to the use of Clang build as the training data. I think that Clang build is both unnecessarily heavyweight, but also not particularly representative of typical workloads (most Clang users don't use it to build Clang). Ideally, we would give vendors the flexibility to supply their own training data. I'd prefer reusing the existing perf-training setup to do so. In fact, I'd imagine most vendors would likely use the same training data for both PGO and BOLT and that use case should be supported.

Do you happen to know any existing perf-training sets? Or is there a simple way to create one?

I'm working on a script for generating perf-training sets from Ninja-based build systems, I can contribute it to LLVM if you think it'd be useful.

Yes, that would be super useful. BOLT should then also leverage that.