This is an archive of the discontinued LLVM Phabricator instance.

[llvm-profdata] ProfileReader cleanup - preparation for MD5 refactoring
ClosedPublic

Authored by huangjd on Apr 20 2023, 6:54 PM.

Details

Summary

Cleanup profile reader classes to prepare for complex refactoring as propsed in D147740 (Use MD5 as key for profile map). Change is too complicated so I am cleaning up the reader implementation first with these goals.

  • Reduce duplicated/similar logic
  • Reduce virtual functions, changing them to non-virtual
  • Reduce unnecessry checks, indirections, and dead writes.

This is patch 1/n. This patch refactors NameTable

Explaining several decisions here

  1. useMD5() means whether names of the profiles (the ProfileMap) are represented as MD5. It is NOT whether the input profile format is MD5. This function is an interface for IPO passes to decide whether to match function names or function MD5. There are two motives here:

(a) Eventually we want to use MD5 to represent all function contexts because it is much faster to use it as a key for lookup tables (prototype implementation D147740), so in compilation mode we call setProfileUseMD5() to force use MD5. While in tools mode (llvm-profdata) we want to keep the function name info if it's in the input profile.
(b) We also propose to allow multiple name tables and profile sections in ExtBinary format, and it could consist of name tables with or without using MD5, in this case MD5 prevails and other name tables are converted to MD5.

  1. MD5 handling logic is pushed up to BinaryReader base class, because this trades a non-devirtualized virtual function call with a predictable branch. ReadStringFromTable() accounts for >5% time when loading a full 1 GB profile, it should not be virtual.

Diff Detail

Event Timeline

huangjd created this revision.Apr 20 2023, 6:54 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 20 2023, 6:54 PM
huangjd requested review of this revision.Apr 20 2023, 6:54 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 20 2023, 6:54 PM
davidxl added inline comments.Apr 24 2023, 10:49 AM
llvm/include/llvm/ProfileData/SampleProfReader.h
500

Is there a need to make it virtual? Even though text format does not support it, it is harmless to keep the method (it does nothing).

590

no need for this override.

llvm/lib/ProfileData/SampleProfReader.cpp
538

Can you explain the cleanup done here?

huangjd marked an inline comment as done.Apr 25 2023, 12:16 PM
huangjd added inline comments.
llvm/include/llvm/ProfileData/SampleProfReader.h
500

Yes this is needed (for now, until MD5 refactoring patch is in place). If this is set, the context uses MD5 as function name, and for the case with Binary format or ExtBinary with function names, these names are converted to MD5 upon read. ProfileIsMD5 is referred by FunctionSamples::UseMD5, which is referred by IPO passes for profile matching. The transform pass needs to know if the profiles are Actually in MD5 because it behaves differently.

llvm/lib/ProfileData/SampleProfReader.cpp
538

This corresponds to

Data = MD5NameMemStart + ((*Idx) * sizeof(uint64_t));
End = reinterpret_cast<const uint8_t *>(
        std::numeric_limits<uintptr_t>::max());
auto FID = readUnencodedNumber<uint64_t>();
if (std::error_code EC = FID.getError())
  return EC;

However the check is not necessary because when reading the name table we already ensured the name table contains correct number of entries, and readStringIndex in this function ensures the index is in range, so the fixed length MD5 can be directly accessed with an index into the base address

huangjd updated this revision to Diff 516925.Apr 25 2023, 2:39 PM

Check endianess when loading fixed length MD5

huangjd added inline comments.Apr 25 2023, 2:41 PM
llvm/lib/ProfileData/SampleProfReader.cpp
538

Modified code to use endian::read because it is necessary for correctness on big endian platform

huangjd updated this revision to Diff 516927.Apr 25 2023, 2:45 PM

Remove unnecessary cast

wenlei added subscribers: wlei, hoy.Apr 25 2023, 10:17 PM

Hoisting all the MD5 stuff out of compact binary format, and merge them with extended binary format makes sense. Initially compact binary format was the only one using MD5, but later on Wei added MD5 support in extended binary as well, which caused some of the duplication this patch is trying to address.

However, from design POV, it's a good practice to keep the specific implementation down to leaf type as much as possible. Here MD5 only comes into play for extended binary and compact binary, so exposing that to all binary format is less than ideal. I think the problem is that compact binary and extended binary started out to be very different, but involved into something similar, so the type structure no longer reflects that commonality. Now in order to deduplicate code, you have to expose MD5 stuff to their lowest common ancestor, which is binary format.

Actually in https://reviews.llvm.org/D76255, we talked about eventually remove compact binary. Can we simply remove compact binary now? Then we can keep MD5 stuff in extended binary still, and there will be no duplicates.

cc @hoy @wlei

llvm/include/llvm/ProfileData/SampleProfReader.h
571

typo: Conexts->Contexts, for->or

huangjd added a comment.EditedApr 26 2023, 1:38 PM

Hoisting all the MD5 stuff out of compact binary format, and merge them with extended binary format makes sense. Initially compact binary format was the only one using MD5, but later on Wei added MD5 support in extended binary as well, which caused some of the duplication this patch is trying to address.

However, from design POV, it's a good practice to keep the specific implementation down to leaf type as much as possible. Here MD5 only comes into play for extended binary and compact binary, so exposing that to all binary format is less than ideal. I think the problem is that compact binary and extended binary started out to be very different, but involved into something similar, so the type structure no longer reflects that commonality. Now in order to deduplicate code, you have to expose MD5 stuff to their lowest common ancestor, which is binary format.

Actually in https://reviews.llvm.org/D76255, we talked about eventually remove compact binary. Can we simply remove compact binary now? Then we can keep MD5 stuff in extended binary still, and there will be no duplicates.

cc @hoy @wlei

I would like to limit the scope of the refactoring not to include another major change, since other reviewers would like to go through small patches. The purpose of my series of refactoring is to change the data representation of profile map, using MD5 as the key, which could bring significant speedup to profile load time. Reducing overloaded functions with similar logic make the code less error prone when changing the data representation.

TL;DR Eventually, based on external settings (whether we are using ProfileData in compiler or tools), we want the ability to choose whether to store names as MD5, regardless if the actual name table from the file is using Strings, MD5, or fixed length MD5, so the logic should be unified.

huangjd updated this revision to Diff 517307.Apr 26 2023, 1:44 PM
huangjd marked an inline comment as done.

typo

llvm/include/llvm/ProfileData/SampleProfReader.h
571

fixed

Hoisting all the MD5 stuff out of compact binary format, and merge them with extended binary format makes sense. Initially compact binary format was the only one using MD5, but later on Wei added MD5 support in extended binary as well, which caused some of the duplication this patch is trying to address.

However, from design POV, it's a good practice to keep the specific implementation down to leaf type as much as possible. Here MD5 only comes into play for extended binary and compact binary, so exposing that to all binary format is less than ideal. I think the problem is that compact binary and extended binary started out to be very different, but involved into something similar, so the type structure no longer reflects that commonality. Now in order to deduplicate code, you have to expose MD5 stuff to their lowest common ancestor, which is binary format.

Actually in https://reviews.llvm.org/D76255, we talked about eventually remove compact binary. Can we simply remove compact binary now? Then we can keep MD5 stuff in extended binary still, and there will be no duplicates.

cc @hoy @wlei

I would like to limit the scope of the refactoring not to include another major change, since other reviewers would like to go through small patches. The purpose of my series of refactoring is to change the data representation of profile map, using MD5 as the key, which could bring significant speedup to profile load time. Reducing overloaded functions with similar logic make the code less error prone when changing the data representation.

Sure, limit changes to NFC only is good. But the problem is you are exposing MD5 stuff higher up in the type hierarchy to types that shouldn't need to be aware of MD5 details - this created a weird structure. The real problem is the out of sync type structure between compact binary and extended binary. I think that removing compact binary will help this refactoring and also avoid weird structure.

In terms of change size, you can split up a change just to remove compact binary support, then go back to the refactoring. If the goal is to clean things up, it's better to remove obsolete stuff first.

TL;DR Eventually, based on external settings (whether we are using ProfileData in compiler or tools), we want the ability to choose whether to store names as MD5, regardless if the actual name table from the file is using Strings, MD5, or fixed length MD5, so the logic should be unified.

Yeah, and compact binary doesn't have that flexibility. Compact binary become deprecated/obsolete the moment we introduce MD5 to extended binary. We wanted wait for some time before removing it. Now 3 years went by and it's time, and it sort of is getting in the way of this refactoring.

Hoisting all the MD5 stuff out of compact binary format, and merge them with extended binary format makes sense. Initially compact binary format was the only one using MD5, but later on Wei added MD5 support in extended binary as well, which caused some of the duplication this patch is trying to address.

However, from design POV, it's a good practice to keep the specific implementation down to leaf type as much as possible. Here MD5 only comes into play for extended binary and compact binary, so exposing that to all binary format is less than ideal. I think the problem is that compact binary and extended binary started out to be very different, but involved into something similar, so the type structure no longer reflects that commonality. Now in order to deduplicate code, you have to expose MD5 stuff to their lowest common ancestor, which is binary format.

Actually in https://reviews.llvm.org/D76255, we talked about eventually remove compact binary. Can we simply remove compact binary now? Then we can keep MD5 stuff in extended binary still, and there will be no duplicates.

cc @hoy @wlei

I would like to limit the scope of the refactoring not to include another major change, since other reviewers would like to go through small patches. The purpose of my series of refactoring is to change the data representation of profile map, using MD5 as the key, which could bring significant speedup to profile load time. Reducing overloaded functions with similar logic make the code less error prone when changing the data representation.

Sure, limit changes to NFC only is good. But the problem is you are exposing MD5 stuff higher up in the type hierarchy to types that shouldn't need to be aware of MD5 details - this created a weird structure. The real problem is the out of sync type structure between compact binary and extended binary. I think that removing compact binary will help this refactoring and also avoid weird structure.

In terms of change size, you can split up a change just to remove compact binary support, then go back to the refactoring. If the goal is to clean things up, it's better to remove obsolete stuff first.

TL;DR Eventually, based on external settings (whether we are using ProfileData in compiler or tools), we want the ability to choose whether to store names as MD5, regardless if the actual name table from the file is using Strings, MD5, or fixed length MD5, so the logic should be unified.

Yeah, and compact binary doesn't have that flexibility. Compact binary become deprecated/obsolete the moment we introduce MD5 to extended binary. We wanted wait for some time before removing it. Now 3 years went by and it's time, and it sort of is getting in the way of this refactoring.

Deprecating Compact format seems like a reasonable preparation step for these set of patches (if it does not add too much complexity). My understanding is that it will even simplify the task?

The deprecation itself also helps cleanup LLVM code base. Also users depending on compact format have a migration path (via llvm-profdata) I assume.

Pushing the MD5 name handling logic up to the base class trades a virtual function call (doesn't seem like being de-virtualized) with a branch, and does lead to a very slight speed improvement on our benchmarking suite.

huangjd updated this revision to Diff 518929.May 2 2023, 6:39 PM

Rebase/updated code after removing compact binary

huangjd edited the summary of this revision. (Show Details)May 2 2023, 6:59 PM
davidxl added inline comments.May 3 2023, 11:14 AM
llvm/lib/ProfileData/SampleProfReader.cpp
539

Add a comment here referencing readUnencodedNumber and mention bounds check is not needed.

717

Is this flag still used or can be asserted to be true?

huangjd updated this revision to Diff 519286.May 3 2023, 3:38 PM
huangjd edited the summary of this revision. (Show Details)

update comment on ReadStringFromTable

huangjd added inline comments.May 3 2023, 3:44 PM
llvm/lib/ProfileData/SampleProfReader.cpp
717

It cannot be assumed true. User can specify 3 modes normally:

  1. function name strings : ~SecFlagMD5Name && ~SecFlagFixedLengthMD5
  2. ULEB128 MD5 : SecFlagMD5Name && ~SecFlagFixedLengthMD5
  3. Fixed length MD5 : SecFlagMD5Name && SecFlagFixedLengthMD5

A malformed profile can specify ~SecFlagMD5Name && SecFlagFixedLengthMD5 and crash on the third added test case. I fixed the logic so that LLVM treats this case same as (3).

davidxl added inline comments.May 3 2023, 3:53 PM
llvm/lib/ProfileData/SampleProfReader.cpp
717

is 3 the common and the most efficient one? What is the default with extbinary format? If 3 is the default, we should consider deprecate settings.

huangjd added inline comments.May 3 2023, 5:36 PM
llvm/lib/ProfileData/SampleProfReader.cpp
717

It is not default. The default is Binary format, which I propose to switch to ExtBinary first.

huangjd updated this revision to Diff 519349.May 3 2023, 9:00 PM
  • Added const qualifier to useMD5()
davidxl added inline comments.May 3 2023, 10:27 PM
llvm/lib/ProfileData/SampleProfReader.cpp
717

Looking forward to the followup cleanup of binary format and related code.

1038

Add assert(IsMD5).

1080

add assert (!FixedLengthMD5);

huangjd updated this revision to Diff 519601.May 4 2023, 12:24 PM
  • Add assert
huangjd marked 3 inline comments as done.May 4 2023, 12:26 PM
huangjd added inline comments.
llvm/lib/ProfileData/SampleProfReader.cpp
1038

Added

1080

This is not needed because the previous if(FixedLengthMD5) always returns, so this assert is always true.

davidxl added inline comments.May 4 2023, 12:31 PM
llvm/lib/ProfileData/SampleProfReader.cpp
1080

This is not needed because the previous if(FixedLengthMD5) always returns, so this assert is always true.

right it is redundant with the current control flow. It is probably still worth adding one to prevent problems in the future if the code structure changes.

huangjd updated this revision to Diff 519640.May 4 2023, 1:50 PM
huangjd marked 2 inline comments as done.
  • Add assert
huangjd marked an inline comment as done.May 4 2023, 1:51 PM
davidxl accepted this revision.May 4 2023, 1:58 PM

lgtm. Perhaps also add hoy@ or wenlei@ or wlei@ for a quick look.

This revision is now accepted and ready to land.May 4 2023, 1:58 PM
huangjd updated this revision to Diff 519688.May 4 2023, 4:08 PM
  • Relaxed assert into a warning instead. FixedLengthMD5 & ~UseMD5 should
This revision was landed with ongoing or failed builds.May 5 2023, 5:21 PM
This revision was automatically updated to reflect the committed changes.