This is an archive of the discontinued LLVM Phabricator instance.

[LLD][LLVM] CG Graph profile using relocations
ClosedPublic

Authored by ayermolo on Jun 10 2021, 5:17 PM.

Details

Summary

Currently when .llvm.call-graph-profile is created by llvm it explicitly encodes the symbol indices. This section is basically a black box for post processing tools. For example, if we run strip -s on the object files the symbol table changes, but indices in that section do not. In non-visible behavior indices point to wrong symbols. The visible behavior indices point outside of Symbol table: "invalid symbol index".

This patch changes the format by using R_*_NONE relocations to indicate the from/to symbols. The Frequency (Weight) will still be in the .llvm.call-graph-profile, but symbol information will be in relocation section. In LLD information from both sections is used to reconstruct call graph profile. Relocations themselves will never be applied.

With this approach post processing tools that handle relocations correctly work for this section also. Tools can add/remove symbols and as long as they handle relocation sections with this approach information stays correct.

Doing a quick experiment with clang-13.
The size went up from 107KB to 322KB, aggregate of all the input sections. Size of clang-13 binary is ~118MB. For users of -fprofile-use/-fprofile-sample-use the size of object files will go up slightly, it will not impact final binary size.

Diff Detail

Event Timeline

ayermolo created this revision.Jun 10 2021, 5:17 PM
ayermolo requested review of this revision.Jun 10 2021, 5:17 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 10 2021, 5:17 PM
ayermolo updated this revision to Diff 351333.EditedJun 10 2021, 7:09 PM

Fixed up some build and clang-tidy issues.

The size went up from 107KB to 322KB, aggregate of all the input sections.

That sounds like a lot (3x increase). How does this impact link times?

Is this patch intended to be reviewed as-is, or is this still illustrative?

The size went up from 107KB to 322KB, aggregate of all the input sections.

That sounds like a lot (3x increase). How does this impact link times?

Is this patch intended to be reviewed as-is, or is this still illustrative?

Sorry, yes reviewed as-is.
The increase is driven by couple of things. This is for ELF64 so to/from total goes from 8 bytes to 16 bytes. I am also using RELA which adds another 8 bytes for addend. @MaskRay suggested using REL, but looking at the code where it's chooses, it seems like there is just one global flag for the writer. Not sure if this is on purpose or just for ease of implementation. If anyone has insights would love to hear it.
I did timings on clang.
Old:
0:01.16 real, 1.07 user, 6.63 sys, 0 amem, 685040 mmem
0:01.23 real, 1.17 user, 8.87 sys, 0 amem, 689600 mmem
0:01.17 real, 1.15 user, 7.19 sys, 0 amem, 690520 mmem

New
0:01.10 real, 1.11 user, 6.46 sys, 0 amem, 690772 mmem
0:01.10 real, 1.10 user, 7.25 sys, 0 amem, 685932 mmem
0:01.09 real, 1.09 user, 6.28 sys, 0 amem, 685900 mmem

user time went up negligibly, real time and sys time are noise.
Relocation sections do not participate in actual relocations, and it's just a linear scan through memory to process them. I doubt it will have significant impact on linking times with bigger binaries.

MaskRay added inline comments.Jun 15 2021, 3:51 PM
lld/ELF/Driver.cpp
870

uint32_t

871

uint32_t

lld/ELF/InputFiles.cpp
674

remove braces around one-line simple statement

llvm/lib/MC/MCELFStreamer.cpp
498

it's ugly, but you can use .reloc xxx, BFD_RELOC_NONE, sym

llvm/lib/Target/AVR/MCTargetDesc/AVRAsmBackend.h
53

unreachable doesn't need return

Are call graph sections expected in fully linked output?

llvm/lib/MC/MCELFStreamer.cpp
508
llvm/test/tools/llvm-readobj/ELF/call-graph-profile.test
41–51

We usually line up the values in a block to make it a little more readable.

llvm/test/tools/obj2yaml/ELF/call-graph-profile-section.yaml
34

obj2yaml doesn't care about the relocation section. The code for converting the section to yaml should be independent of the relocations as far as I understand it. I'd remove the relocations to avoid confusing the test. This should allow you then to remove most of the other changes in this test too.

57–58

We're not matching whitespace exactly, so I'd minimise the extra, much like it was before, so that the key and value are closer. This tends to be a little bit more readable. Same goes throughout.

llvm/test/tools/yaml2obj/ELF/call-graph-profile-section.yaml
48

Don't check the SectionData of the relocation section. In fact, don't bother having a relocation section at all - this isn't relevant to how yaml2obj generates the call graph section.

llvm/tools/llvm-readobj/ELFDumper.cpp
1783
1853–1860

This code will only allow one .cg_profile section per object, which may or may not be appropriate (it's true that this was the case before, but was it correct then?). Either way, we already have printRelocatableStackSizes which finds all .stack_sizes sections and their corresponding relocation sections. Maybe rather than having two similar pieces of code that do basically the same thing, but in different ways, we should combine them?

6703–6704

Seems like it should be some sort of warning and print a partial profile section when the relocations can't be found.

6718–6719

Error and warning messages shouldn't end with . as per the coding standards.

I'd actually rephrase this: "unable to load relocations for SHT_LLVM_CALL_GRAPH_PROFILE section: " + toString(CGProfileOrErr.takeError())

This error case needs a test case.

In a similar vein to my earlier comment, we might want to include the section index of the call graph section in this message, so that if there is more than one such section, we can easily identify the problematic one.

6722–6723

Seems like this shouldn't be an assert, but a proper warning? If you create a broken object with the wrong number of relocations, this assertion will fire.

6730

Match the return type of size().

ayermolo added inline comments.Jun 16 2021, 1:55 PM
llvm/lib/MC/MCELFStreamer.cpp
498

This will reduce the scope of changes in this patch, but then new backend creator will need to know of this trick. I don't know the history of BFD_RELOC_NONE. Is it expected for architectures that have _NONE to have the mapping?
. .Case("BFD_RELOC_NONE", ELF::R_<arch>_NONE)

llvm/lib/Target/AVR/MCTargetDesc/AVRAsmBackend.h
53

I was following example of fixupNeedsRelaxation. Will remove.

llvm/tools/llvm-readobj/ELFDumper.cpp
1853–1860

Hmm, well from my understanding if it comes from clang side it should be one per object file. I guess someone can construct object with multiple ones, using say yaml2obj, but what would be practical point of it? LLD right now doesn't support that scenario.
I looked in to printRelocatableStackSizes, not sure there is enough code worth combining, but seems like a good way to keep all the code related to printing cg profile in one place. Let me refactor and see what comes out of it.

6722–6723

I can change it to warning, but then will need to add a check not to execute a loop.

ayermolo added inline comments.Jun 16 2021, 3:34 PM
llvm/test/tools/yaml2obj/ELF/call-graph-profile-section.yaml
48

I think relocation is needed in this case. Check is done with llvm-readobj --cg-profile. Unless that flag can be dropped?

modimo added a subscriber: modimo.Jun 16 2021, 4:50 PM
jhenderson added inline comments.Jun 17 2021, 12:16 AM
llvm/test/tools/yaml2obj/ELF/call-graph-profile-section.yaml
48

The llvm-readobj check of the cg-profile data is a nice-to-have to confirm that our tools are self-consistent. However, I'm not sure it's really necessary, since we have a dedicated llvm-readobj test that uses the yaml output, which covers this sort of case.

llvm/tools/llvm-readobj/ELFDumper.cpp
1853–1860

I was thinking the factored out code would be something like:

std::vector<std::pair<Elf_Shdr*, Elf_Shdr*>> getSectionAndRelocations(std::function<bool(Elf_Shdr*)> IsMatch) {/*...*/}

where each of the different users would provide their own predicate (i.e. name or type check in the cg profile and stack sizes cases). Types in the example are just a rough idea. In the cgprofile case, we can just pick one pair from the vector.

6722–6723

This is what I had in mind, and is similar to the pattern elsewhere:

if (CGProfileRelaOrError->size() != (CGProfileOrErr->size() * 2)) {
  reportUniqueWarning(...);
  return;
}
ayermolo added inline comments.Jun 17 2021, 10:40 AM
llvm/tools/llvm-readobj/ELFDumper.cpp
1853–1860

Ah yeah, after I re-wrote printCGProfile yesterday evening, I see your point. The loop ended up being basically same as in printRelocatableStackSizes, except for checks. Will refactor it in to helper function. Also made it so that relocation section is optional. Will print warning if it's missing, but continue output weights. So some of the yaml tests can be cleaned up.

ayermolo updated this revision to Diff 352863.Jun 17 2021, 3:35 PM

Incorporated feedback. Notable chagnes is refactor of printCGProfile to be more self contained, and support potential multiple call graph sections. printRelocatableStackSizes was chagned to use same helper function.

ayermolo marked 15 inline comments as done.Jun 17 2021, 3:41 PM
ayermolo updated this revision to Diff 352865.Jun 17 2021, 3:52 PM

small description test fix

ayermolo updated this revision to Diff 352874.Jun 17 2021, 4:44 PM

small fix

ayermolo updated this revision to Diff 352881.Jun 17 2021, 5:22 PM

somehow missed a test.

jhenderson added inline comments.Jun 21 2021, 1:09 AM
llvm/lib/MC/MCELFStreamer.cpp
508

This hasn't been addressed properly: the = at the end of the comment in my inline edit is important, as clang-format interprets such a pattern differently (note that if you add the equals sign and then reformat, the space between the comment and 8 will disappear.

llvm/test/tools/llvm-readobj/ELF/call-graph-profile.test
130
161
211
llvm/test/tools/yaml2obj/ELF/call-graph-profile-section.yaml
84

I think you can drop this entire test case, since the YAML no longer refers to symbols for the call graph section.

134

Ditto: we no longer refer to symbols directly in the YAML for this section, so drop this case.

llvm/tools/llvm-readobj/ELFDumper.cpp
6732–6733

Get rid of the unnecessary "with relocations".

6740

The style is no capital for the first letter of warnings.

6745
6747

The return type of Rel.getSymbol(...) is already uint32_t, so no need for the trailing return type to be explicitly specified.

ayermolo updated this revision to Diff 353451.Jun 21 2021, 12:16 PM

Addressed comments.

ayermolo marked 8 inline comments as done.Jun 21 2021, 12:32 PM
ayermolo added inline comments.
llvm/lib/MC/MCELFStreamer.cpp
508

ah, sorry missed the =.

llvm/tools/llvm-readobj/ELFDumper.cpp
6747

Also modified lambda in lld driver to follow the same style.

jhenderson accepted this revision.Jun 22 2021, 12:29 AM

No more comments from me - I'm happy with the changes outside LLD, although haven't looked in depth at those within the LLD code. I also don't use the CG profile stuff or pretend to really follow it, so @MaskRay should sign off on that side of things.

This revision is now accepted and ready to land.Jun 22 2021, 12:29 AM
ayermolo marked 2 inline comments as done.Jun 22 2021, 10:06 AM

No more comments from me - I'm happy with the changes outside LLD, although haven't looked in depth at those within the LLD code. I also don't use the CG profile stuff or pretend to really follow it, so @MaskRay should sign off on that side of things.

Thanks! @MaskRay any further feedback?

Mostly looks good.

lld/ELF/Driver.cpp
866–873

This can happen so assert is not appropriate. You can use if (...) fatal

llvm/include/llvm/MC/MCAsmBackend.h
88

Delete

llvm/lib/MC/MCELFStreamer.cpp
498

Is it expected for architectures that have _NONE to have the mapping? . .Case("BFD_RELOC_NONE", ELF::R_<arch>_NONE)

Yes. In GNU as, BFD_RELOC_NONE is generic and can be used on all targets. I have added BFD_RELOC_NONE to many LLVM targets. Note: .llvm.call-graph-profile is used by -fprofile-use=. The targets which don't support BFD_RELOC_NONE are not used by people for PGO. (If they do use PGO, adding the support should be trivial anyway.)

It probably makes sense to save the return value in a variable and assert it is None. I find that an unknown BFD_RELOC_NONE does not cause an error.

llvm/lib/Target/X86/MCTargetDesc/X86AsmBackend.cpp
677

This function and its friends can all be deleted.

llvm/tools/llvm-readobj/ELFDumper.cpp
6752

I != Size is probably more common in LLVM code

MaskRay added a comment.EditedJun 22 2021, 10:33 PM

You can mention that this does increase -fprofile-use= object file sizes a bit but probably don't matter.

For instrumentation PGO users, -fprofile-generate= object files are much larger than -fprofile-use=, so -fprofile-use= slightly size increases don't matter.
For sample PGO users, this may increase the peak object file sizes a bit, but doesn't matter as your measurement for clang-13 shows.

Test Plan:
ninja check all, manual rtesting

This should be deleted when committing.

The size went up from 107KB to 322KB, aggregate of all the input sections.

How large is your clang-13? ~100MiB? Then this increase looks neligible.

ayermolo added inline comments.Jun 23 2021, 10:37 AM
llvm/lib/MC/MCELFStreamer.cpp
498

ok, will remove getNoneRelcationName(), and use BFD_RELOC_NONE here. Thanks for the background.

ayermolo updated this revision to Diff 354059.Jun 23 2021, 1:07 PM
ayermolo marked 5 inline comments as done.

addressign feedback

ayermolo edited the summary of this revision. (Show Details)Jun 23 2021, 1:08 PM
ayermolo marked an inline comment as done.
ayermolo added inline comments.
lld/ELF/Driver.cpp
866–873

What is general guideline for fatal vs assert in llvm? I looked at Coding Standard, and didn't see anything (might have missed it)

ayermolo updated this revision to Diff 354062.Jun 23 2021, 1:16 PM
ayermolo marked an inline comment as done.

somehow random file made it in to diff, probably clang format related.

MaskRay added inline comments.Jun 23 2021, 1:29 PM
lld/ELF/Driver.cpp
866–873

My personal guideline is:

assert is for ensuring an invariant doesn't break - it shouldn't break even for malicious user input.

If you know some user input would crash, prefer a proper error reporting mechanism, in lld errorOrWarn/fatal/error. In LLVM internal code you can find many report_fatal_error(similar to assert, just with diagnostic even when assertions is disabled). Many are lazy examples which are not good references.

867

drop trailing period

Don't capitalize

MaskRay accepted this revision.Jun 23 2021, 1:31 PM

LG baring the comment

ayermolo updated this revision to Diff 354069.Jun 23 2021, 1:35 PM

Addressed last comment.

jrtc27 added inline comments.Jun 23 2021, 1:36 PM
lld/ELF/InputFiles.cpp
674

What if it's SHT_REL? This won't fail nicely, it'll give an error about sh_entsize not matching. You should either handle SHT_REL properly (i.e. support it) or not even attempt to get an Elf_Rela for it.

ayermolo updated this revision to Diff 354087.Jun 23 2021, 3:18 PM

Changed lld to handle call graph as RELA only, and emits warning for REL.

ayermolo added inline comments.Jun 23 2021, 3:20 PM
lld/ELF/InputFiles.cpp
674

Something like this?

Patch description is non-descriptive, does not provide any motivation as to why this is wanted e.g.

jrtc27 added inline comments.Jun 23 2021, 3:21 PM
lld/ELF/InputFiles.cpp
674

Why not error?

ayermolo added inline comments.Jun 23 2021, 3:27 PM
lld/ELF/InputFiles.cpp
674

With error it will prevent binary from being linked. I thought warning would be more appropriate so that result will still be a valid binary minus this optimization. I can change to error if you and @MaskRay think that is more appropriate.

ayermolo added inline comments.Jun 23 2021, 3:31 PM
lld/ELF/InputFiles.cpp
674

With error it will prevent binary from being linked. I thought warning would be more appropriate so that result will still be a valid binary minus this optimization. I can change to error if you and @MaskRay think that is more appropriate.

readCallGraphsFromObjectFiles will also need to be modified, otherwise it will fatal there due to obj->cgProfileRela.size() != obj->cgProfile.size() * 2

ayermolo edited the summary of this revision. (Show Details)Jun 23 2021, 3:33 PM
MaskRay requested changes to this revision.Jun 23 2021, 3:36 PM
MaskRay added inline comments.
lld/ELF/InputFiles.cpp
674

Why doesn't SHT_REL work?

A warning is appropriate, as it is an optional optimization anyway.

This revision now requires changes to proceed.Jun 23 2021, 3:36 PM

For example, if we run strip -s on the object files the symbol table changes, but indices in that section do not.

The description needs to mention what would happen with the old format.

We propose to change this section to use relocations. The Frequency will still be in the .llvm.call-graph-profile, but symbol information will be in relocation section. In LLD information from both sections is used to reconstruct call graph profile. Relocations themselves will never be applied.

We propose => This patch changes the format by using R_*_NONE relocations to indicate the from/to symbols.

With this approach post processing tools that handle relocations correctly work for this section also.

I think you might want to say adding/removing symbols?

ayermolo added inline comments.Jun 23 2021, 4:12 PM
lld/ELF/InputFiles.cpp
674

I didn't think SHT_REL is a common case/used. For X86 only I386 and IAMCU use REL, looking at some of the other architectures it's either ON by default or in case of MIPS enabled for 64 bit.

bool HasRelocationAddend = TT.isArch64Bit();
  return std::make_unique<MipsELFObjectWriter>(OSABI, HasRelocationAddend,
                                                IsN64);

I didn't want to over-complicate the implementation un-necessary, but might have been wrong about it. Should I add support?

MaskRay added inline comments.Jun 23 2021, 4:17 PM
lld/ELF/InputFiles.cpp
674

32-bit arm uses REL so it is not that uncommon. x86-32 does matter less these days and I don't know who want to use -fprofile-use for it. You can leave REL as-is. I can fix it afterwards.

The description needs to be fixed, though.

ayermolo edited the summary of this revision. (Show Details)Jun 23 2021, 4:52 PM
ayermolo edited the summary of this revision. (Show Details)
ayermolo updated this revision to Diff 354110.Jun 23 2021, 4:55 PM

Modified readCallGraphsFromObjectFiles to continue if cgProfileRela is empty. In line with just printing warning if relocaiton is REL.

MaskRay accepted this revision.Jun 23 2021, 7:21 PM
This revision is now accepted and ready to land.Jun 23 2021, 7:21 PM
ayermolo updated this revision to Diff 354150.Jun 23 2021, 9:02 PM

rebased on latest.

@ayermolo is oof for a few days and asked me to commit this for him. I will do that shortly.

This revision was automatically updated to reflect the committed changes.

You can mention that this does increase -fprofile-use= object file sizes a bit but probably don't matter.

For instrumentation PGO users, -fprofile-generate= object files are much larger than -fprofile-use=, so -fprofile-use= slightly size increases don't matter.
For sample PGO users, this may increase the peak object file sizes a bit, but doesn't matter as your measurement for clang-13 shows.

Test Plan:
ninja check all, manual rtesting

This should be deleted when committing.

The size went up from 107KB to 322KB, aggregate of all the input sections.

How large is your clang-13? ~100MiB? Then this increase looks neligible.

I've noticed some increases in object size over here. What's the percentage increase you're expecting here in object file sizes?

You can mention that this does increase -fprofile-use= object file sizes a bit but probably don't matter.

For instrumentation PGO users, -fprofile-generate= object files are much larger than -fprofile-use=, so -fprofile-use= slightly size increases don't matter.
For sample PGO users, this may increase the peak object file sizes a bit, but doesn't matter as your measurement for clang-13 shows.

Test Plan:
ninja check all, manual rtesting

This should be deleted when committing.

The size went up from 107KB to 322KB, aggregate of all the input sections.

How large is your clang-13? ~100MiB? Then this increase looks neligible.

I've noticed some increases in object size over here. What's the percentage increase you're expecting here in object file sizes?

So original implementation was 16-bytes (2x4 bytes for index, 8 bytes for Weight). After this @MaskRay changed it to REL implementation. So that's 40 bytes (2x16 bytes for index in relocation form + 8 bytes for Weight).
When I measured with RELA approach on clang-13 it went from 107KB to 322KB in 118MB binary. With REL, current implementation, that's around 230KB out of 118MB. Now this is final binary size so proportion, in aggregate, of object files will be smaller.