This is an archive of the discontinued LLVM Phabricator instance.

[SampleFDO] Port MD5 name table support to extbinary format.
ClosedPublic

Authored by wmi on Mar 16 2020, 2:52 PM.

Details

Summary

Compbinary format uses MD5 to represent strings in name table. That gives smaller profile. The patch adds the support in extbinary format. It is off by default but user can choose to enable it.

Note the feature of using MD5 in name table can bring very small chance of name conflict leading to profile mismatch. Besides, profile using the feature won't have the profile remapping support.

Diff Detail

Event Timeline

wmi created this revision.Mar 16 2020, 2:52 PM
davidxl added inline comments.Mar 17 2020, 9:11 AM
llvm/include/llvm/ProfileData/SampleProfReader.h
637

What is the use of MD5Names? It seems only used in useMD5() method.

Also why does it need to be unique_ptr?

wmi marked an inline comment as done.Mar 17 2020, 9:56 AM
wmi added inline comments.
llvm/include/llvm/ProfileData/SampleProfReader.h
637

Profile only contains md5 numbers (type of uint64_t). NameTable contains StringRef to the names (All formats of profile use the same NameTable). When reading profile using md5, to populate the NameTable structure, I need to convert uint64_t number to std::string and then convert it to StringRef, so I need a buffer to save those std::string because StringRef doesn't own any string data. To be a buffer of NameTable, that is the use of MD5Names.

I make it a unique_ptr because for formats not using md5, MD5Names is useless, so I think an empty unique_ptr may have less cost than an empty std::vector in that case.

davidxl added inline comments.Mar 17 2020, 10:34 AM
llvm/include/llvm/ProfileData/SampleProfReader.h
637

The type of NameTable is vector<string> not vector<StringRef> right?

668

NameTable Decl here

llvm/lib/ProfileData/SampleProfReader.cpp
733

The string is pushed into MD5Names, and immediately copied to NameTable. Can MD5Names be skipped here?

wmi marked 3 inline comments as done.Mar 17 2020, 3:45 PM
wmi added inline comments.
llvm/include/llvm/ProfileData/SampleProfReader.h
637

There is another one defined as vector<StringRef> in the base class SampleProfileReaderBinary.

668

Yes, Compbinary format uses std::vector<std::string> type of NameTable. When we use std::vector<std::string> for NameTable, we can avoid the buffer, but we are going to add more override functions.

Currently I used the buffer in the patch, but it is confusing that we used two different ways in Compbinary format and in ExtBinary format for md5 support. So maybe using std::vector<std::string> type for NameTable is better. I will change it.

llvm/lib/ProfileData/SampleProfReader.cpp
733

Ok, I will change it.

wmi marked an inline comment as done.Mar 17 2020, 5:24 PM
wmi added inline comments.
llvm/include/llvm/ProfileData/SampleProfReader.h
668

I just tried then I realized a problem I forgot to mention: When ExtBinary format doesn't use md5, it wants NameTable to be std::vector<StringRef> type. When ExtBinary format uses md5, it wants NameTable to be std::vector<std::string>. I can create a new nametable called MD5NameTable, but it makes things more complicate.

We also don't want to use std::vector<std::string> type NameTable for both cases. That is because for ExtBinary format without using md5 the string has already existed in file buffer. If NameTable can use StringRef, it can refer to the string in file buffer directly without having to copy the string to NameTable. That will save some cost.

So looks like keeping the MD5Names buffer is a simpler way to me.

davidxl added inline comments.Mar 18 2020, 1:50 PM
llvm/include/llvm/ProfileData/SampleProfReader.h
637

Make the comment clearer that MD5Names serves as string buffer that is referenced by NameTable (vector of StringRef).

Is the lifetime of this buffer guarenteed to be longer than NameTable?

llvm/lib/ProfileData/SampleProfReader.cpp
733

Add a comment here NameTable is a vector of StringRef, and the second push_back is not a copy but a refernece. Otherwise the code looks confusing.

Great to MD5 support added to ext-bin, thanks @wmi. After this change it seems compact-bin is completely superseded by ext-bin, do you plan to deprecate/remove it?

llvm/include/llvm/ProfileData/SampleProf.h
157–191

I thought SecFlags is a set of general purpose flags that can be applied to all section, at least SecFlagCompress falls under that category. For Name vs MD5Name, to some extended it's like section with different content. So would it make sense to have SecMD5NameTable in addition to SecNameTable for SecType, instead of relying on SecFlags? All reader/writer code can still be shared between the two.

llvm/tools/llvm-profdata/llvm-profdata.cpp
477

The incompatibility between -use-md5 and format is detected and handled at runtime. Can we do the same for setUseMD5? Specifically, make setUseMD5 part of SampleProfileWriter, and error/assert in the base version of it - that would avoid the casting and simplifies a bit in couple places. (Could probably do the same for compression).

wmi marked 3 inline comments as done.Mar 20 2020, 2:33 PM

Great to MD5 support added to ext-bin, thanks @wmi. After this change it seems compact-bin is completely superseded by ext-bin, do you plan to deprecate/remove it?

Sorry for the late reply.

I plan to deprecate it, but only after people migrate to the new format. The problem is I don't know who is using the old format. I think I can leave it for longer time before deprecate it.

llvm/include/llvm/ProfileData/SampleProf.h
157–191

SecFlags was not designed to be general purpose only. I want to make it more flexible so that every section can define their own flags. I want to set aside some bits for general purpose only flags and some bits for section specific flags. For section specific flag, every section can reuse the same bit for different purpose. We may need a verification in place to prevent misuse. If it sounds good to you, I will add some comments describing it and put the verification in place.

llvm/include/llvm/ProfileData/SampleProfReader.h
637

Yes, they are both in SampleProfileReader object so they have the same lifetime.

llvm/tools/llvm-profdata/llvm-profdata.cpp
477

I was trying to find some balance here and not to put too much special things in the base class. For SampleProfileReader, if we don't add the interfaces in it, we will have many places that we need the casting, so it has a bunch of special interfaces there. For SampleProfileWriter, we only have a few places needing the special interfaces so adding the casting is still ok. I can change the code and only add casting once here.

I am just not quite sure which way is better. If you feel adding the interfaces in SampleProfileWriter is better, I will be happy to do it.

wenlei added inline comments.Mar 20 2020, 3:08 PM
llvm/include/llvm/ProfileData/SampleProf.h
157–191

Thanks for clarification - that sounds good to me, and yes some comments and verification would be useful.

llvm/tools/llvm-profdata/llvm-profdata.cpp
477

Conceptually, MD5 name doesn't have to be tied to extbinary, e.g. we could potentially have MD5 as name even for text profile. So maybe it's indeed general enough that warrant it being in base class, it's just we only support that for extbinary today. That said, I don't have a strong opinion either..

davidxl added inline comments.Mar 25 2020, 10:32 AM
llvm/include/llvm/ProfileData/SampleProf.h
157–191

Perhaps reserve a range of common flags?

wmi marked 3 inline comments as done.Mar 27 2020, 8:52 PM
wmi added inline comments.
llvm/include/llvm/ProfileData/SampleProf.h
157–191

I reserved the lower 32bits in SecHdrTableEntry::Flags for common flags and left the higher 32bits for section specific flags. Now each section can define 32 custom flags maximumly. I added the verification to make sure no custom flag will be misused on unrelevant section.

llvm/lib/ProfileData/SampleProfReader.cpp
733

Done.

llvm/tools/llvm-profdata/llvm-profdata.cpp
477

I added the interfaces in SampleProfileWriter.

wmi updated this revision to Diff 253288.Mar 27 2020, 9:05 PM

Address David and Wenlei's comments.

wenlei accepted this revision.Mar 29 2020, 10:45 PM

LGTM, thanks!

This revision is now accepted and ready to land.Mar 29 2020, 10:45 PM
davidxl accepted this revision.Mar 30 2020, 8:57 AM

lgtm

This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptMar 30 2020, 10:22 PM