This patch adds binary id into profiles to easily associate binaries
with the corresponding profiles. There is also an RFC that
discusses the motivation, design and implementation in more detail:
https://lists.llvm.org/pipermail/llvm-dev/2021-June/151154.html
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
compiler-rt/include/profile/InstrProfData.inc | ||
---|---|---|
132 | Instead of HasBuildId as a boolean, I think it might be better to use BuildIdSize to allow multiple ids. While raw profiles should have either 0 or 1 in practice, indexed profiles created by merging multiple raw profiles could have multiple ids, so using BuildIdSize would allow using the same header format for both raw and indexed format. | |
compiler-rt/lib/profile/InstrProfilingWriter.c | ||
36 | I'm not sure if we should be using the term BuildId since it can be interpreted as ELF-specific. We want this to generalize to other binary formats like Mach-O and COFF. In Mach-O, it's called LC_UUID. In COFF, it's usually referred to as GUID. When searching online, I noticed that lld also build ID for COFF, see D36758, but I'm not sure if that's official name or if they just copied the name from the ELF implementation. So one option would be to just use BuildId but make it clear that it's a generic term. Another option would be a new name, for example BinaryId. | |
263 | This function implements ELF-specific logic so it cannot be in InstrProfilingWriter.c which is platform agnostic. It should be InstrProfilingPlatformLinux.c and we'll need COFF and Mach-O equivalents in InstrProfilingPlatformWindows.c and InstrProfilingPlatformDarwin.c respectively (which could be empty in the initial implementation). | |
291 | I think that we should avoid allocating the BuildId struct on the heap which could cause an issue when allocator itself is instrumented. Two solutions I can think of is to either allocate the struct as static (it's 8-16 bytes which is not so bad) or alternatively this function could use ProfDataWriter to directly write out the struct in which case it could be allocated on stack. |
- Move elf specific details into InstrProfilingPlatformLinux.c
- Remove build id struct
- Use binary id instead of build id
compiler-rt/include/profile/InstrProfData.inc | ||
---|---|---|
132 | Can a raw profile with build id and another raw profile without a build id be merged and create indexed profiled? | |
compiler-rt/lib/profile/InstrProfilingWriter.c | ||
36 | I renamed it to binary id, and I will briefly explain other ids used in other platforms in the RFC. | |
291 | It seems like ValueProfiling already uses heap allocation compiler-rt/lib/profile/InstrProfilingValue.c, but we don't really need heap allocation for build id. |
compiler-rt/include/profile/InstrProfData.inc | ||
---|---|---|
132 | @phosek, it seems to me like raw and indexed profile do not share the same profile format. Whereas, indexed profile header is defined in: I think, we might need to extend both formats with binary id then. |
compiler-rt/include/profile/InstrProfData.inc | ||
---|---|---|
132 | Yes, looks like we'll need to extend both formats but I still think it'd be preferable to support arbitrary number of binary id's in both formats since at least in ELF, you can have more than one build ID. The format of should also be the same in both the raw and indexed format, that is a sequence of tuples where each tuple is a length and a sequence of bytes of that length. | |
compiler-rt/lib/profile/InstrProfiling.h | ||
101 ↗ | (On Diff #344474) | I'd slightly prefer returning both the data and the size as output parameters for consistency. We might also consider having a boolean/int return argument to signal an error. |
compiler-rt/lib/profile/InstrProfilingWriter.c | ||
287–288 | You shouldn't need an extra variable for the pointer. | |
291 | That's true but it's also possible to allocate counter for value profiling statically if you use -mllvm -vp-static-alloc in which case you can avoid malloc which may be desirable on some platforms (this is what we would likely use on Fuchsia for example). | |
llvm/tools/llvm-profdata/llvm-profdata.cpp | ||
2256 | I think we'll want this functionality even in the final version but we should introduce additional flag, for example -show-binary-id, and only print the binary id if that flag is set. |
- Support multiple binary ids
- Extend indexed profile format to include binary ids
- Add tests
compiler-rt/lib/profile/InstrProfilingPlatformLinux.c | ||
---|---|---|
102 | This is insufficient. The note "name" determines the space of n_type values. You must verify that the n_namesz=4 and the name bytes are "GNU\0" as well, or else this is some unrelated note that happens to use n_type=3 where 3 does not mean NT_GNU_BUILD_ID. | |
105 | This arithmetic is not right since Note has a type with sizeof>1. An example of correct arithmetic is: Note = (const ElfW(Nhdr)*)((const char *)(Note + 1) + RoundUp(Note->n_namesz, 4) + RoundUp(Note->n_descsz, 4)); | |
120 | Can you abstract this logic into a subroutine so the code is not repeated in the two functions? | |
compiler-rt/test/profile/binary-id.c | ||
2 | You may need to pass explicit -Wl,--build-id=none through the compiler here since it's sometimes on by default. | |
llvm/include/llvm/ProfileData/InstrProfReader.h | ||
85 | Is this the number of distinct IDs, or the byte size of the total block? (AIUI the total block is a sequence of {uint64_t n, uint8_t[n]} blocks.) If it's the number of distinct IDs, then I don't understand how a single uint8_t* from getBinaryIds() is meant to be used. For a number like that, I think Count is a clearer name to use than Size. | |
91 | Should these really be two separate methods? Or should it just be one method that returns both the size and the data in a span-style data structure? | |
484 | I don't understand the commented-out declaration. | |
llvm/lib/ProfileData/InstrProfReader.cpp | ||
541 | This is wildly susceptible to bad data. |
compiler-rt/lib/profile/InstrProfilingPlatformLinux.c | ||
---|---|---|
105 | My understanding is that each note in notes section starts with a struct which includes n_namesz, n_descsz, and n_type members. | |
llvm/include/llvm/ProfileData/InstrProfReader.h | ||
85 |
It is the number of binary IDs. That file consistently uses Size to refer to number of elements like DataSize, CounterSize, NamesSize, etc. | |
llvm/lib/ProfileData/InstrProfReader.cpp | ||
541 |
| |
541 | How can I improve that? I need to somehow increment the pointer by binary id length. |
compiler-rt/include/profile/InstrProfData.inc | ||
---|---|---|
132 | I'd prefer to put this field at the end of the header to match the position of binary ID in the profile. | |
compiler-rt/lib/profile/InstrProfilingInternal.h | ||
202 | This name is a bit of a mouthful. I'd consider calling it just __llvm_write_binary_ids and say in the comment that the function always returns the number of binary ids even if writer is NULL. | |
compiler-rt/lib/profile/InstrProfilingPlatformLinux.c | ||
105 | I think we should be returning the number of binary ids in every case, even if writer != NULL. | |
105 | When you increment a pointer, the address is moved by the size of the underlying type: char* c = ...; c++; // c increments by 1 == sizeof(char) uint64_t* l = ...; l++; // l increments by 8 == sizeof(uint64_t) In this case sizeof(Note) > 1 so you first need to cast it to const char *. | |
llvm/lib/ProfileData/InstrProfReader.cpp | ||
517 | Where is this deallocated? It seems like this memory gets leaked. BinaryIdTy is just a plain struct with just two 8 byte fields (on a 64-bit machine), can we avoid allocating it on the heap and return/pass it by value to simplify lifetime management? | |
541 | I think at minimum, you need to check that you aren't reading past the end of the file in the case the length is incorrect. |
llvm/lib/ProfileData/InstrProfReader.cpp | ||
---|---|---|
517 | I used unique_ptr, so memory will be released after BinaryIdTy goes out of scope. |
llvm/lib/ProfileData/InstrProfReader.cpp | ||
---|---|---|
517 | I'd still prefer passing these by value, heap allocation in this case seems unnecessary and is going to be less efficient. |
Here's the link to the RFC:
https://lists.llvm.org/pipermail/llvm-dev/2021-June/151154.html
Repeating the same question asked in RFC: the profile data has builtin fine grained matching mechanism. For the top level stamping, is it enough to embed the build id in the file name of the profile data?
Just to clarify: we are suggesting to embed build in the profile file itself, not in the profile file name.
This will also allow us to embed multiple build ids if profiles are merged together.
Can you please elaborate on the builtin fine grained matching mechanism and why do you think it will be a problem?
We need to be able to match the profile back to the corresponding binary when generating coverage. Embedding build ID in filename doesn't work for us because on Fuchsia, most processes don't have access to filesystem (it's a capability that most processes shouldn't need). Rather the profile is exported via IPC as a memory object that doesn't have any name.
There might be a Fuchsia-specific solution, like wrapping the profile in an envelope that contains he build ID, but that solution would be Fuchsia-specific and we thought that making profiles more self-descriptive could be generally useful.
One idea we would like to pursue in the future is implementing support for the debuginfod protocol in LLVM, including llvm-cov, you could then use llvm-cov --instr-profile default.profraw and llvm-cov would fetch both the binary and the source from debuginfod server using the build ID inside the profile. In our case, this would not only simplify our infrastructure but also make it easy for developers to generate coverage reports from profiles fetched directly from our CI to reproduce results.
I am less concerned about raw profile format change (the version still needs to be bumped), but for the indexed format, the version needs to be bumped and it needs to guarantee that format is backward compatible. The patch does not seem to handle it.
Is it necessary to have binary id support in the indexed format?
Yes, we need to bump the version. There is still some work that needs to be done in this patch like bumping the version, adjusting some of the tests, and adding more tests, etc.
We would like to get early feedback to see whether there is any fundamental issue about the approach.
@phosek can correct me, but I think we need to have binary id support in the indexed format.
We merge raw profiles into indexed profiles before generating source code coverage reports, and in order to associate binaries while generating coverage, we need to have binary id in the indexed profiles.
Is the merging done in process or offline? Assuming it is offline, it seems possible to use name base approach.
It's done offline but since the profiles published via IPC don't have any names, we cannot rely or filenames, we would need to introduce a custom Fuchsia wrapper format (see my previous comment about the "envelope" idea).
We don't necessarily need binary ID in the indexed format, what we could do is:
llvm-profdata show --binary-id a.profraw >a.buildid # fetch a binary with a.buildid as a.out llvm-profdata show --binary-id b.profraw >b.buildid # fetch a binary with b.buildid as b.out ... llvm-profdata merge -o merged.profdata a.profraw b.profraw ... llvm-cov show --instr-profile=merged.profdata a.out b.out ...
If we also stored binary IDs inside the indexed format, we could simplify this and do:
llvm-profdata show --binary-ids merged.profdata >merged.buildids # fetch all binaries
This is more efficient but either would be fine with us for now.
Storing binary IDs inside the indexed profile would become really valuable if/once we have support debuginfod support at which point we should be able to just do:
llvm-profdata merge -o merged.profdata a.profraw b.profraw ... llvm-cov show --instr-profile=merged.profdata
and llvm-cov would fetch binaries directly from the debuginfod server using binary IDs stored inside the indexed profile. It'll take a while before we have debuginfod support available in LLVM though so this is not critical for now, we're just trying to plan ahead.
Is extending the indexed format a problem? Is there some way to make it less of an issue?
I suppose this workflow can also be done with
llvm-cov show --instr-profile=merged.profdata --buildids=<file of list of ids> without the need to fetch the binaries.
Is extending the indexed format a problem? Is there some way to make it less of an issue?
It is not an issue, but I think it is preferable minimize version bumps for indexed format to reduce longer term churns. Given this restriction, it is better to make version change tied to functionalities that can do without a format change. Scripts, wrappers etc. are better for non-essential changes -- especially for workflows that can be mostly automated (which minimizes human inconveniences).
- Add binary id between profile header and data
- Raw profile version bump
- Only add binary id into raw profiles
@aeubanks I just added you because I modified a test file (corrupted-profile.c), and you are the author of that test file.
that test change seems fine
I remember I copied the structure from another existing test though, somewhere in llvm/test/tools/llvm-profdata IIRC
@davidxl we split the indexed profile change out of this change since it's not necessary for us right now, would it be possible to take a look again?
compiler-rt/include/profile/InstrProfData.inc | ||
---|---|---|
140 | I'd prefer to move this field just after Version before DataSize to match the order of data inside the profile. |
compiler-rt/lib/profile/InstrProfilingPlatformLinux.c | ||
---|---|---|
92 | Is it possible to extract this into a common helper function like void ForEachNote(int note_type, .. note_start, .. note_end, ... call_back ) { ... } |
compiler-rt/lib/profile/InstrProfilingPlatformLinux.c | ||
---|---|---|
92 | I think your concern was about the readability of that function, so I broke that function into multiple functions to increase readability. |
LGTM
llvm/lib/ProfileData/InstrProfReader.cpp | ||
---|---|---|
537 | When there is no binary ids to be print, perhaps output a line with 'None' instead of empty (which looks like the output is truncated. |
llvm/lib/ProfileData/InstrProfReader.cpp | ||
---|---|---|
537 | When there is no binary id, we don't print anything at all. if (BinaryIdsSize == 0) return success(); |
Modified some raw profile test files to include the new profile version (version 6) and new header element (BinaryIdsSize).
compiler-rt/test/profile/binary-id.c | ||
---|---|---|
12 | Hi @gulfem, |
compiler-rt/test/profile/binary-id.c | ||
---|---|---|
12 | Hi @vitalybuka, |
Yes, looks like both LLVM servers are down https://lab.llvm.org/
FYI The special about that staging bot is that it is Ubuntu with recent
glibc 2.34. Our primary bot runs exactly the same build on Debian with
glibc 2.28 and the test passes there. It can be glibc or other dependencies.
I tried to reproduce it on my machine, but I it uses Debian (not Ubuntu), so the issue did not reproduce.
With that patch, we started embedding build id (a unique identifier) into llvm profiles.
That test basically enables build id in the binary by using -Wl,--build-id -O2 option.
It then generates a llvm profile, and tries to read embedded build id from the profile.
What is the output of the following command?
/b/sanitizer-x86_64-linux/build/llvm_build64/bin/clang -m64 -ldl -fprofile-instr-generate -Wl,--build-id -O2 -o /b/sanitizer-x86_64-linux/build/compiler_rt_build/test/profile/Profile-x86_64/Linux/Output/binary-id.c.tmp /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/profile/Linux/binary-id.c readelf -n /b/sanitizer-x86_64-linux/build/compiler_rt_build/test/profile/Profile-x86_64/Linux/Output/binary-id.c.tmp
If we have build id in the binary, readelf (you can also use llvm-readelf instead) should have some output like that:
Displaying notes found in: .note.gnu.build-id Owner Data size Description GNU 0x00000014 NT_GNU_BUILD_ID (unique build ID bitstring) Build ID: 8699f6e0c4e12b872aae7e1f37fa6ba2564e9702
If we have build id in the binary, we should then check whether build id is embedded in the profile:
env LLVM_PROFILE_FILE=/b/sanitizer-x86_64-linux/build/compiler_rt_build/test/profile/Profile-x86_64/Linux/Output/binary-id.c.tmp.profraw /b/sanitizer-x86_64-linux/build/compiler_rt_build/test/profile/Profile-x86_64/Linux/Output/binary-id.c.tmp llvm-profdata show --binary-ids /b/sanitizer-x86_64-linux/build/compiler_rt_build/test/profile/Profile-x86_64/Linux/Output/binary-id.c.tmp.profraw
What is the output of the above command?
buildbot@sanitizer-buildbot-vb-wlz1:/b/sanitizer-x86_64-linux/build/compiler_rt_build$ /b/sanitizer-x86_64-linux/build/llvm_build64/bin/clang -m64 -ldl -fprofile-instr-generate -Wl,--build-id=none -O2 -o /b/sanitizer-x86_64-linux/build/compiler_rt_build/test/profile/Profile-x86_64/Linux/Output/binary-id.c.tmp /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/profile/Linux/binary-id.c buildbot@sanitizer-buildbot-vb-wlz1:/b/sanitizer-x86_64-linux/build/compiler_rt_build$ env LLVM_PROFILE_FILE=/b/sanitizer-x86_64-linux/build/compiler_rt_build/test/profile/Profile-x86_64/Linux/Output/binary-id.c.tmp.profraw /b/sanitizer-x86_64-linux/build/compiler_rt_build/test/profile/Profile-x86_64/Linux/Output/binary-id.c.tmp buildbot@sanitizer-buildbot-vb-wlz1:/b/sanitizer-x86_64-linux/build/compiler_rt_build$ ../llvm_build64/bin/llvm-profdata show --binary-ids /b/sanitizer-x86_64-linux/build/compiler_rt_build/test/profile/Profile-x86_64/Linux/Output/binary-id.c.tmp.profraw > /b/sanitizer-x86_64-linux/build/compiler_rt_build/test/profile/Profile-x86_64/Linux/Output/binary-id.c.tmp.out buildbot@sanitizer-buildbot-vb-wlz1:/b/sanitizer-x86_64-linux/build/compiler_rt_build$ ../llvm_build64/bin/llvm-profdata show --binary-ids /b/sanitizer-x86_64-linux/build/compiler_rt_build/test/profile/Profile-x86_64/Linux/Output/binary-id.c.tmp.profraw Instrumentation level: Front-end Total functions: 3 Maximum function count: 1 Maximum internal block count: 0 buildbot@sanitizer-buildbot-vb-wlz1:/b/sanitizer-x86_64-linux/build/compiler_rt_build$ ../llvm_build64/bin/llvm-profdata merge -o /b/sanitizer-x86_64-linux/build/compiler_rt_build/test/profile/Profile-x86_64/Linux/Output/binary-id.c.tmp.profdata /b/sanitizer-x86_64-linux/build/compiler_rt_build/test/profile/Profile-x86_64/Linux/Output/binary-id.c.tmp.profraw buildbot@sanitizer-buildbot-vb-wlz1:/b/sanitizer-x86_64-linux/build/compiler_rt_build$ /b/sanitizer-x86_64-linux/build/llvm_build64/bin/clang -m64 -ldl -fprofile-instr-generate -Wl,--build-id -O2 -o /b/sanitizer-x86_64-linux/build/compiler_rt_build/test/profile/Profile-x86_64/Linux/Output/binary-id.c.tmp /b/sanitizer-x86_64-linux/build/llvm-project/compiler-rt/test/profile/Linux/binary-id.c buildbot@sanitizer-buildbot-vb-wlz1:/b/sanitizer-x86_64-linux/build/compiler_rt_build$ env LLVM_PROFILE_FILE=/b/sanitizer-x86_64-linux/build/compiler_rt_build/test/profile/Profile-x86_64/Linux/Output/binary-id.c.tmp.profraw /b/sanitizer-x86_64-linux/build/compiler_rt_build/test/profile/Profile-x86_64/Linux/Output/binary-id.c.tmp buildbot@sanitizer-buildbot-vb-wlz1:/b/sanitizer-x86_64-linux/build/compiler_rt_build$ ../llvm_build64/bin/llvm-profdata show --binary-ids /b/sanitizer-x86_64-linux/build/compiler_rt_build/test/profile/Profile-x86_64/Linux/Output/binary-id.c.tmp.profraw Instrumentation level: Front-end Total functions: 3 Maximum function count: 1 Maximum internal block count: 0
And no Binary IDs
../llvm_build64/bin/llvm-readelf -n test/profile/Profile-x86_64/Linux/Output/binary-id.c.tmp Displaying notes found in: .note.ABI-tag Owner Data size Description GNU 0x00000010 NT_GNU_ABI_TAG (ABI version tag) OS: Linux, ABI: 3.2.0 Displaying notes found in: .note.gnu.property Owner Data size Description GNU 0x00000010 NT_GNU_PROPERTY_TYPE_0 (property note) Properties: x86 ISA needed: x86-64-baseline Displaying notes found in: .note.gnu.build-id Owner Data size Description GNU 0x00000014 NT_GNU_BUILD_ID (unique build ID bitstring) Build ID: 3dff2793618e5a04cbfef83483bc82d7d63b807d Displaying notes found in: .note.gnu.gold-version Owner Data size Description GNU 0x00000009 NT_GNU_GOLD_VERSION (gold version) Version: gold 1.16
../llvm_build64/bin/llvm-profdata show --binary-ids test/profile/Profile-x86_64/Linux/Output/binary-id.c.tmp.profraw Instrumentation level: Front-end Total functions: 3 Maximum function count: 1 Maximum internal block count: 0
compiler-rt/lib/profile/InstrProfilingPlatformLinux.c | ||
---|---|---|
178 | Here is the problem |
compiler-rt/lib/profile/InstrProfilingPlatformLinux.c | ||
---|---|---|
178 | PT_NOTE is the note segment in the program header, and that segment can have multiple notes (https://man7.org/linux/man-pages/man5/elf.5.html). gulfem@gulfem:~/llvm-release-build$ bin/llvm-readelf -n /usr/local/google/home/gulfem/llvm-release-build/projects/compiler-rt/test/profile/Profile-x86_64/Linux/Output/binary-id.c.tmp Displaying notes found in: .note.gnu.build-id Owner Data size Description GNU 0x00000014 NT_GNU_BUILD_ID (unique build ID bitstring) Build ID: 56559dcdb2c5b111e7e875c183ca2e0a1f433c96 Displaying notes found in: .note.ABI-tag Owner Data size Description GNU 0x00000010 NT_GNU_ABI_TAG (ABI version tag) OS: Linux, ABI: 3.2.0 Here, we iterate through all the notes. |
@vitalybuka, it looks like we can have have multiple note segments in the program header, and each note segment can contain multiple notes.
Thank you @mcgrathr for pointing that out.
I think your case is like that, and I'll push a fix for that.
Instead of HasBuildId as a boolean, I think it might be better to use BuildIdSize to allow multiple ids. While raw profiles should have either 0 or 1 in practice, indexed profiles created by merging multiple raw profiles could have multiple ids, so using BuildIdSize would allow using the same header format for both raw and indexed format.