This is an archive of the discontinued LLVM Phabricator instance.

[Support] Introduce the BLAKE3 hashing function implementation
ClosedPublic

Authored by akyrtzi on Mar 11 2022, 9:20 PM.

Details

Summary

The C implementation originates from https://github.com/BLAKE3-team/BLAKE3/tree/1.3.1/c
License is at https://github.com/BLAKE3-team/BLAKE3/blob/1.3.1/LICENSE

This patch adds:

  • llvm/include/llvm-c/blake3.h: The BLAKE3 C API with certain changes (details below)
  • llvm/include/llvm/Support/BLAKE3.h: C++ wrapper of the C API
  • llvm/lib/Support/BLAKE3: Directory containing the BLAKE3 C implementation files, including the LICENSE file
  • llvm/unittests/Support/BLAKE3Test.cpp: unit tests for the BLAKE3 C++ wrapper

Changes from original BLAKE3 sources:

  • README.md: added a note about where the sources originated from
  • blake.h:
    • Changes to avoid conflicts if a client also links with its own BLAKE3 version:
      • Renamed the header macro guard with LLVM_C_ prefix
      • Renamed the C symbols to add the llvm_ prefix
    • Added a top header comment that references the CC0 license
  • blake3_impl.h: Added #defines to remove some of llvm_ prefixes for the rest of the internal implementation.
  • Implementation files:
    • Used llvm_ prefix for the C public API functions
    • Used LLVM_LIBRARY_VISIBILITY for internal implementation functions
    • Added .private_extern/.hidden in assembly files to reduce visibility of the internal implementation functions

And here's some timings comparing BLAKE3 with LLVM's SHA1/SHA256/MD5.
The table shows the speed-up multiplier of BLAKE3 for hashing 100 MBs:

ProcessorSHA1SHA256MD5
:------------::----::------::----:
Intel Xeon W6.5x17x5.9x
M1Pro2.1x4.7x2.8x

Diff Detail

Event Timeline

akyrtzi created this revision.Mar 11 2022, 9:20 PM
akyrtzi requested review of this revision.Mar 11 2022, 9:20 PM
Herald added a project: Restricted Project. · View Herald Transcript
akyrtzi updated this revision to Diff 414803.Mar 11 2022, 9:52 PM

Fix windows build, use the right filename extensions in CMakeLists.txt

@compnerd the windows bot says

[110/8000] Building ASM object lib\Support\CMakeFiles\LLVMSupport.dir\BLAKE3\blake3_avx512_x86-64_windows_msvc.asm.obj
Microsoft (R) C/C++ Optimizing Compiler Version 19.29.30136 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

cl : Command line warning D9035 : option 'o' has been deprecated and will be removed in a future release
cl : Command line warning D9024 : unrecognized source file type 'C:\ws\w32-1\llvm-project\premerge-checks\llvm\lib\Support\BLAKE3\blake3_avx512_x86-64_windows_msvc.asm', object file assumed
cl : Command line warning D9027 : source file 'C:\ws\w32-1\llvm-project\premerge-checks\llvm\lib\Support\BLAKE3\blake3_avx512_x86-64_windows_msvc.asm' ignored
cl : Command line warning D9021 : no action performed

Do you have a hint what is going wrong, is the .asm file extension not correct for passing them to the compiler?

The patch's description what this does, but now why it does what it does.

MaskRay added inline comments.
llvm/include/llvm/Support/BLAKE3.h
79

Otherwise we can't use BLAKE3::hash<16> to replace md5.

MaskRay added a comment.EditedMar 12 2022, 11:41 AM

I am in favor of adding BLAKE3.

The patch's description what this does, but now why it does what it does.

I can provide one rationale: speed up ld.lld --build-id: D121531 :)

BLAKE3 can replace several places llvm-project uses llvm::SHA1, e.g. Clang PCH serialization, bolt.
It's possible that we can just drop llvm/Support/SHA1.h after these call sites migrate away.

@compnerd the windows bot says

[110/8000] Building ASM object lib\Support\CMakeFiles\LLVMSupport.dir\BLAKE3\blake3_avx512_x86-64_windows_msvc.asm.obj
Microsoft (R) C/C++ Optimizing Compiler Version 19.29.30136 for x64
Copyright (C) Microsoft Corporation.  All rights reserved.

cl : Command line warning D9035 : option 'o' has been deprecated and will be removed in a future release
cl : Command line warning D9024 : unrecognized source file type 'C:\ws\w32-1\llvm-project\premerge-checks\llvm\lib\Support\BLAKE3\blake3_avx512_x86-64_windows_msvc.asm', object file assumed
cl : Command line warning D9027 : source file 'C:\ws\w32-1\llvm-project\premerge-checks\llvm\lib\Support\BLAKE3\blake3_avx512_x86-64_windows_msvc.asm' ignored
cl : Command line warning D9021 : no action performed

Do you have a hint what is going wrong, is the .asm file extension not correct for passing them to the compiler?

I think that you would need to do an enable_language(ASM) to have that be built properly, I don't think that we currently have any assembly routines in LLVM/clang and so that language has never been enabled.

akyrtzi updated this revision to Diff 414903.Mar 12 2022, 9:34 PM

Fix compilation of BLAKE3::hash<T>() for a non-default hash size.

akyrtzi marked an inline comment as done.Mar 12 2022, 9:37 PM
akyrtzi added inline comments.
llvm/include/llvm/Support/BLAKE3.h
79

Thank you for catching this! 🙇🏻‍♂️

akyrtzi updated this revision to Diff 414905.Mar 12 2022, 10:18 PM
akyrtzi marked an inline comment as done.

Use enable_language(ASM_MASM) for MSVC

MaskRay added a subscriber: joerg.Mar 12 2022, 10:46 PM

I think that you would need to do an enable_language(ASM) to have that be built properly, I don't think that we currently have any assembly routines in LLVM/clang and so that language has never been enabled.

Thank you! 🙇🏻‍♂️ enable_language(ASM_MASM) got the windows build working.

MaskRay accepted this revision.Mar 14 2022, 11:01 PM

LGTM.

And here's some timings comparing BLAKE3 with LLVM's SHA1/SHA256/MD5. The table shows the speed-up multiplier of BLAKE3 for hashing 100 MBs:

It will be worth mentioning which optimized implementation is used for Intel Xeon W and M1Pro, respectively.

The performance of using the generic implementation is also interesting. None of LLVM's SHA1/SHA256/MD5 libraries is well optimized.

This revision is now accepted and ready to land.Mar 14 2022, 11:01 PM
rnk accepted this revision.Mar 15 2022, 10:32 AM

Thank you for taking this on, as mentioned previously, we use SHA1 for PDB type merging, and this will speed that up.

What do you think about landing this as two patches, one which is a pristine source import, and the second which applies all the llvm_ prefixing changes, which I agree are necessary? This may make it easier to merge in future updates, should there ever be any.

It will be worth mentioning which optimized implementation is used for Intel Xeon W and M1Pro, respectively.

The performance of using the generic implementation is also interesting. None of LLVM's SHA1/SHA256/MD5 libraries is well optimized.

Due to an issue with detecting AVX512 on macOS the numbers for Intel I posted initially were for AVX2. Here are new speed-up multiplies that include AVX512 and the generic/portable implementations:

ProcessorSHA1SHA256MD5
Intel Xeon W (AVX512)10.4x27x9.4x
Intel Xeon W (AVX2)6.5x17x5.9x
Intel Xeon W (portable)1.3x3.3x1.1x
M1Pro (neon)2.1x4.7x2.8x
M1Pro (portable)1.1x2.4x1.5x

What do you think about landing this as two patches, one which is a pristine source import, and the second which applies all the llvm_ prefixing changes, which I agree are necessary? This may make it easier to merge in future updates, should there ever be any.

Seems like a good idea to me, I'll change this patch to make it pristine.

akyrtzi updated this revision to Diff 415630.Mar 15 2022, 4:29 PM

Keep the BLAKE3 source pristine for the initial patch, and hold the LLVM-specific changes for a follow-up.

MaskRay accepted this revision.Mar 21 2022, 10:22 AM

This patch has waited for an extended period of time. Perhaps proceed with it?

This patch has waited for an extended period of time. Perhaps proceed with it?

I'm waiting for a confirmation by @lattner (or someone from LLVM Foundation?) that it is ok to commit the BLAKE3 sources in llvm repo.
Unless someone else has another suggestion? I'm not sure what is the common procedure for bringing third-party sources in.

lattner requested changes to this revision.Mar 23 2022, 11:57 AM

thank you for working on integrating this! Please check out how the regex stuff got integrated, I think there are learnings in it that aren't obvious (we should also document this somewhere).

llvm/include/llvm-c/blake3.h
2

Please add the correct license boilerplate to this file. Plz point to the license file in the lib directory.

42

How is this going to compose with apps that use LLVM and also use blake3 for their own things? can we really take these global symbol names? You'll note that in regex_impl.h we had to prefix the C symbols, I expect that we'll have the same challenges here.

llvm/lib/Support/BLAKE3/README.md
2

I'd recommend dropping this file or significantly changing it, this isn't showing the preferred LLVM way to use this.

llvm/lib/Support/BLAKE3/blake3.c
3

Same

llvm/lib/Support/BLAKE3/blake3.h
1 ↗(On Diff #415630)

Needs a file header, point to the license file etc.

llvm/lib/Support/BLAKE3/blake3_sse41_x86-64_windows_gnu.S
2

Are these files generated from something? Should we integrate the generator into the build instead of checking in massive .s files?

This revision now requires changes to proceed.Mar 23 2022, 11:57 AM
akyrtzi added inline comments.Mar 23 2022, 1:26 PM
llvm/include/llvm-c/blake3.h
42

The original version of the patch had prefixes to avoid conflicts with clients having their own versions. The list of the changes in the original version are in the description, for reference:

blake.h:

  • Changes to avoid conflicts if a client also links with its own BLAKE3 version:
    • Renamed the header macro guard with LLVM_C_ prefix
    • Renamed the C symbols to add the llvm_ prefix
    • Added a top header comment that references the CC0 license

blake3_impl.h: Added #defines to remove some of llvm_ prefixes for the rest of the internal implementation.
Implementation files:

  • Used llvm_ prefix for the C public API functions
  • Used LLVM_LIBRARY_VISIBILITY for internal implementation functions
  • Added .private_extern/.hidden in assembly files to reduce visibility of the internal implementation functions

During review @rnk suggested to land this with two patches:

What do you think about landing this as two patches, one which is a pristine source import, and the second which applies all the llvm_ prefixing changes, which I agree are necessary? This may make it easier to merge in future updates, should there ever be any.

Per the suggestion I modified the patch to keep the originating BLAKE3 sources pristine, with the intention to have a follow-up patch with LLVM-specific changes.

Does this make sense? Should I go back to the patch that embeds LLVM-specific changes, or do you have another suggestion?

llvm/lib/Support/BLAKE3/blake3_sse41_x86-64_windows_gnu.S
2

There is no generator, AFAICT.

dexonsmith added inline comments.Mar 23 2022, 3:22 PM
llvm/include/llvm-c/blake3.h
42

I suggest:

  • For this (the Phabricator review), squash into a single commit as it was before.
  • Locally, keep it as two commits:
    1. git add for all the C implementation / README in the right place.
    2. Everything else (delete README, add license headers, add C++ wrapper, fix namespacing, etc.).
  • Push both commits at once.

That makes it clear about what is being LGTM'ed here, while also providing the update feature @rnk was looking for.

rnk added inline comments.Mar 23 2022, 3:43 PM
llvm/include/llvm-c/blake3.h
42

+1, thanks

+1, thank you again!

akyrtzi updated this revision to Diff 417803.Mar 23 2022, 6:33 PM

Introduce the LLVM-specific changes over the original BLAKE3 sources, to show the final result for reviewing.
When doing the actual committing there will be 2 separate commits (one with pristine BLAKE3 sources, another with LLVM-specific changes).

akyrtzi updated this revision to Diff 417813.Mar 23 2022, 7:10 PM
akyrtzi marked 7 inline comments as done.

Add file header for blake3.c

@lattner thank you for your feedback! Could you take another look with the latest changes? 🙏

lattner accepted this revision.Mar 24 2022, 9:49 AM

Super awesome, thank you for iterating on this!

This revision is now accepted and ready to land.Mar 24 2022, 9:49 AM
This revision was landed with ongoing or failed builds.Mar 24 2022, 10:26 AM
This revision was automatically updated to reflect the committed changes.
thakis added a subscriber: thakis.Mar 24 2022, 11:47 AM
thakis added inline comments.
llvm/lib/Support/BLAKE3/CMakeLists.txt
48

Out of interest, why is there a dedicated CMakeLists.txt file that does nothing but set a variable in the cmake file one level further up? It'd be more usual to either have this in the CMakeLists.txt file on further up, or make this declare an object library in this CMakeLists.txt file here.

thakis added inline comments.Mar 24 2022, 12:42 PM
llvm/lib/Support/BLAKE3/CMakeLists.txt
48

https://reviews.llvm.org/D122428 for your consideration.

dexonsmith added inline comments.Mar 30 2022, 6:42 PM
llvm/include/llvm/Support/BLAKE3.h
17

Might be nice to bury this #include in the source file to avoid polluting the global namespace with the C stuff. Inlining would still be available for release builds (assuming they use ThinLTO).

67–71

Other hashers (SHA1/MD5/SHA256) return a StringRef from the final() function. Generic APIs (such as llvm/Support/HashBuilder.h) rely on this being the same everywhere.

Is it reasonable to update this to return a StringRef, or can/should we change the other things to return a std::array to match this?

(If returning a StringRef here, we could add a std::array<uint8_t, 32> Hash field to BLAKE3, fill it using blake3_hasher_finalize(), and then call toStringRef(makeArrayRef(Hash)).take_front(NumBytes).)

akyrtzi added inline comments.Mar 30 2022, 9:15 PM
llvm/include/llvm/Support/BLAKE3.h
17

Including the header is useful to avoid needing to allocate the C structures on the heap.

67–71

I'll take a look to see if it would work well to change all the hashers to std::array, otherwise I think it'd be reasonable to add the field so we can return StringRef here.

akyrtzi added inline comments.Apr 4 2022, 5:19 PM
llvm/include/llvm/Support/BLAKE3.h
67–71

or can/should we change the other things to return a std::array to match this?

This was an excellent idea, I think the APIs are more ergonomic like this, see https://reviews.llvm.org/D123100

dexonsmith added inline comments.Apr 4 2022, 5:55 PM
llvm/include/llvm/Support/BLAKE3.h
17

Good point.

Could separate it out (put a big enough aligned char buffer in the header), but maybe not worth it.

Also might be worth considering making the stack allocation optional. It's clocking in at 1912B, which is pretty big. Some users might want it on the heap.

That'd result in something like:

// Heap by default.
class BLAKE3 {
public:
  BLAKE3() = default;

protected:
  struct StorageT { alignas(alignof(uint64_t)) char Impl[1912]; };
  BLAKE3(StorageT &Storage); // for SmallBLAKE3.

private:
  // Allocate and init Hasher/OwnedHasher on first use.
  llvm_blake3_hasher &getOrCreate();

  llvm_blake3_hasher *Hasher = nullptr;
  std::unique_ptr<StorageT> OwnedHasher;
};

// Stack.
class SmallBLAKE3 : public BLAKE3 {
public:
  SmallBLAKE3() : BLAKE3(Storage) {}

private:
  StorageT Storage;
};

// BLAKE3.cpp
#include <llvm-c/blake3.h>
BLAKE3::BLAKE3(StorageT &Storage) {
  static_assert(sizeof(Storage.Impl) >= sizeof(llvm_blake3_hasher));
  static_assert(alignof(Storage.Impl) >= alignof(llvm_blake3_hasher));
  Hasher = reinterpret_cast<llvm_blake3_hasher *>(Storage.Impl);
  llvm_blake3_init(Hasher);
}

But we can do one/both of those later if it's important.

akyrtzi added inline comments.Apr 4 2022, 6:12 PM
llvm/include/llvm/Support/BLAKE3.h
17

Also might be worth considering making the stack allocation optional. It's clocking in at 1912B, which is pretty big. Some users might want it on the heap.

If you want it on the heap you can do new BLAKE3(). I think it's better to have it on the stack by default and keep it simple with just one class that you need to be aware of, and, if necessary, put it on heap explicitly.

nikic added a subscriber: nikic.Apr 13 2022, 7:01 AM

It looks like https://github.com/llvm/llvm-project/commit/5426da8ffa4a6d55adab21026ce6ebe8f1cc6ef2 breaks building LLVM with LLD 13 + ThinLTO in some way I don't quite understand yet. The involved cmake line is:

"cmake" "/checkout/src/llvm-project/llvm" "-DLLVM_ENABLE_ASSERTIONS=ON" "-DLLVM_ENABLE_PLUGINS=OFF" "-DLLVM_TARGETS_TO_BUILD=AArch64;ARM;BPF;Hexagon;MSP430;Mips;NVPTX;PowerPC;RISCV;Sparc;SystemZ;WebAssembly;X86" "-DLLVM_EXPERIMENTAL_TARGETS_TO_BUILD=AVR;M68k" "-DLLVM_INCLUDE_EXAMPLES=OFF" "-DLLVM_INCLUDE_DOCS=OFF" "-DLLVM_INCLUDE_BENCHMARKS=OFF" "-DLLVM_INCLUDE_TESTS=OFF" "-DLLVM_ENABLE_TERMINFO=OFF" "-DLLVM_ENABLE_LIBEDIT=OFF" "-DLLVM_ENABLE_BINDINGS=OFF" "-DLLVM_ENABLE_Z3_SOLVER=OFF" "-DLLVM_PARALLEL_COMPILE_JOBS=32" "-DLLVM_TARGET_ARCH=x86_64" "-DLLVM_DEFAULT_TARGET_TRIPLE=x86_64-unknown-linux-gnu" "-DLLVM_INSTALL_UTILS=ON" "-DLLVM_BUILD_INSTRUMENTED=IR" "-DLLVM_BUILD_RUNTIME=No" "-DLLVM_ENABLE_ZLIB=ON" "-DLLVM_ENABLE_LTO=Thin" "-DLLVM_ENABLE_LLD=ON" "-DLLVM_LINK_LLVM_DYLIB=ON" "-DLLVM_ENABLE_LIBXML2=OFF" "-DLLVM_VERSION_SUFFIX=-rust-1.62.0-nightly" "-DCMAKE_INSTALL_MESSAGE=LAZY" "-DCMAKE_C_COMPILER_LAUNCHER=sccache" "-DCMAKE_CXX_COMPILER_LAUNCHER=sccache" "-DCMAKE_C_COMPILER=clang" "-DCMAKE_CXX_COMPILER=clang++" "-DCMAKE_ASM_COMPILER=clang" "-DCMAKE_C_FLAGS=-ffunction-sections -fdata-sections -fPIC --target=x86_64-unknown-linux-gnu" "-DCMAKE_CXX_FLAGS=-ffunction-sections -fdata-sections -fPIC --target=x86_64-unknown-linux-gnu" "-DCMAKE_AR=/rustroot/bin/llvm-ar" "-DCMAKE_SHARED_LINKER_FLAGS= -Wl,-Bsymbolic -static-libstdc++" "-DCMAKE_MODULE_LINKER_FLAGS= -Wl,-Bsymbolic -static-libstdc++" "-DCMAKE_EXE_LINKER_FLAGS= -Wl,-Bsymbolic -static-libstdc++" "-DCMAKE_INSTALL_PREFIX=/checkout/obj/build/x86_64-unknown-linux-gnu/llvm" "-DCMAKE_ASM_FLAGS= -ffunction-sections -fdata-sections -fPIC --target=x86_64-unknown-linux-gnu" "-DCMAKE_BUILD_TYPE=Release"

And then there are lots and lots of errors about undefined symbols from libLLVMSupport.a:

2022-04-07T16:05:15.3430029Z ld.lld: error: undefined symbol: llvm::StringMapImpl::LookupBucketFor(llvm::StringRef)
2022-04-07T16:05:15.3430575Z >>> referenced by llvm-config.cpp
2022-04-07T16:05:15.3431724Z >>>               /checkout/obj/build/x86_64-unknown-linux-gnu/llvm/build/lto.cache/Thin-b6b9e0.tmp.o:(std::pair<llvm::StringMapIterator<AvailableComponent*>, bool> llvm::StringMap<AvailableComponent*, llvm::MallocAllocator>::try_emplace<>(llvm::StringRef))
2022-04-07T16:05:15.3432297Z 
2022-04-07T16:05:15.3432688Z ld.lld: error: too many errors emitted, stopping now (use -error-limit=0 to see all errors)
2022-04-07T16:05:15.3459873Z clang-13: error: linker command failed with exit code 1 (use -v to see invocation)
2022-04-07T16:05:15.3475933Z make[2]: *** [bin/llvm-config] Error 1

It kind of seems like symbols from the bitcode files in the archive are no longer found after additional non-bitcode objects were added.

Anyone have an idea what the issue here is?

It looks like https://github.com/llvm/llvm-project/commit/5426da8ffa4a6d55adab21026ce6ebe8f1cc6ef2 breaks building LLVM with LLD 13 + ThinLTO in some way I don't quite understand yet.
[...]
And then there are lots and lots of errors about undefined symbols from libLLVMSupport.a:

2022-04-07T16:05:15.3430029Z ld.lld: error: undefined symbol: llvm::StringMapImpl::LookupBucketFor(llvm::StringRef)
2022-04-07T16:05:15.3430575Z >>> referenced by llvm-config.cpp
2022-04-07T16:05:15.3431724Z >>>               /checkout/obj/build/x86_64-unknown-linux-gnu/llvm/build/lto.cache/Thin-b6b9e0.tmp.o:(std::pair<llvm::StringMapIterator<AvailableComponent*>, bool> llvm::StringMap<AvailableComponent*, llvm::MallocAllocator>::try_emplace<>(llvm::StringRef))
2022-04-07T16:05:15.3432297Z 
2022-04-07T16:05:15.3432688Z ld.lld: error: too many errors emitted, stopping now (use -error-limit=0 to see all errors)
2022-04-07T16:05:15.3459873Z clang-13: error: linker command failed with exit code 1 (use -v to see invocation)
2022-04-07T16:05:15.3475933Z make[2]: *** [bin/llvm-config] Error 1

It kind of seems like symbols from the bitcode files in the archive are no longer found after additional non-bitcode objects were added.

Anyone have an idea what the issue here is?

I can't reproduce this with LLD 13.0.1 + the CMake arguments you posted that enable thinLTO. But given the nature of the error I would focus my attention on "-DCMAKE_AR=/rustroot/bin/llvm-ar", as not emitting a proper archive. I used the llvm-ar from TOT llvm and I didn't see an issue, do you see a difference if you use the latest one?

In any case, if you pass -DLLVM_DISABLE_ASSEMBLY_FILES=ON it will avoid including the assembly files and likely avoid the issue you are hitting.

nikic added a comment.Apr 19 2022, 2:44 AM

It looks like https://github.com/llvm/llvm-project/commit/5426da8ffa4a6d55adab21026ce6ebe8f1cc6ef2 breaks building LLVM with LLD 13 + ThinLTO in some way I don't quite understand yet.
[...]
And then there are lots and lots of errors about undefined symbols from libLLVMSupport.a:

2022-04-07T16:05:15.3430029Z ld.lld: error: undefined symbol: llvm::StringMapImpl::LookupBucketFor(llvm::StringRef)
2022-04-07T16:05:15.3430575Z >>> referenced by llvm-config.cpp
2022-04-07T16:05:15.3431724Z >>>               /checkout/obj/build/x86_64-unknown-linux-gnu/llvm/build/lto.cache/Thin-b6b9e0.tmp.o:(std::pair<llvm::StringMapIterator<AvailableComponent*>, bool> llvm::StringMap<AvailableComponent*, llvm::MallocAllocator>::try_emplace<>(llvm::StringRef))
2022-04-07T16:05:15.3432297Z 
2022-04-07T16:05:15.3432688Z ld.lld: error: too many errors emitted, stopping now (use -error-limit=0 to see all errors)
2022-04-07T16:05:15.3459873Z clang-13: error: linker command failed with exit code 1 (use -v to see invocation)
2022-04-07T16:05:15.3475933Z make[2]: *** [bin/llvm-config] Error 1

It kind of seems like symbols from the bitcode files in the archive are no longer found after additional non-bitcode objects were added.

Anyone have an idea what the issue here is?

I can't reproduce this with LLD 13.0.1 + the CMake arguments you posted that enable thinLTO. But given the nature of the error I would focus my attention on "-DCMAKE_AR=/rustroot/bin/llvm-ar", as not emitting a proper archive. I used the llvm-ar from TOT llvm and I didn't see an issue, do you see a difference if you use the latest one?

This was a pretty good guess -- it turns out that due to a build system bug, we ended up using ranlib rather than llvm-ranlib for archive finalization, which means that the bitcode symbols were not included in the archive map. Presumably there is some kind of fallback if the archive map is empty, so we didn't run into this before the archive started including both bitcode and ELF object files. Setting the correct ranlib via CMAKE_RANLIB fixes the issue.

Sorry for the noise!

It looks like https://github.com/llvm/llvm-project/commit/5426da8ffa4a6d55adab21026ce6ebe8f1cc6ef2 breaks building LLVM with LLD 13 + ThinLTO in some way I don't quite understand yet.
[...]
And then there are lots and lots of errors about undefined symbols from libLLVMSupport.a:

2022-04-07T16:05:15.3430029Z ld.lld: error: undefined symbol: llvm::StringMapImpl::LookupBucketFor(llvm::StringRef)
2022-04-07T16:05:15.3430575Z >>> referenced by llvm-config.cpp
2022-04-07T16:05:15.3431724Z >>>               /checkout/obj/build/x86_64-unknown-linux-gnu/llvm/build/lto.cache/Thin-b6b9e0.tmp.o:(std::pair<llvm::StringMapIterator<AvailableComponent*>, bool> llvm::StringMap<AvailableComponent*, llvm::MallocAllocator>::try_emplace<>(llvm::StringRef))
2022-04-07T16:05:15.3432297Z 
2022-04-07T16:05:15.3432688Z ld.lld: error: too many errors emitted, stopping now (use -error-limit=0 to see all errors)
2022-04-07T16:05:15.3459873Z clang-13: error: linker command failed with exit code 1 (use -v to see invocation)
2022-04-07T16:05:15.3475933Z make[2]: *** [bin/llvm-config] Error 1

It kind of seems like symbols from the bitcode files in the archive are no longer found after additional non-bitcode objects were added.

Anyone have an idea what the issue here is?

I can't reproduce this with LLD 13.0.1 + the CMake arguments you posted that enable thinLTO. But given the nature of the error I would focus my attention on "-DCMAKE_AR=/rustroot/bin/llvm-ar", as not emitting a proper archive. I used the llvm-ar from TOT llvm and I didn't see an issue, do you see a difference if you use the latest one?

This was a pretty good guess -- it turns out that due to a build system bug, we ended up using ranlib rather than llvm-ranlib for archive finalization, which means that the bitcode symbols were not included in the archive map. Presumably there is some kind of fallback if the archive map is empty, so we didn't run into this before the archive started including both bitcode and ELF object files. Setting the correct ranlib via CMAKE_RANLIB fixes the issue.

Sorry for the noise!

With 15.0.0 (D119074), the archive symbol table will be ignored. Using ranlib will be fine (though not recommended), too.