This is an archive of the discontinued LLVM Phabricator instance.

[SampleFDO] Store fixed length MD5 in NameTable instead of using ULEB128 if MD5 is used.
ClosedPublic

Authored by wmi on Dec 3 2020, 5:17 PM.

Details

Summary

Currently during sample profile loading, NameTable has to be loaded entirely up front before any name string is retrieved. That is because NameTable is stored using ULEB128 encoding and cannot be directly accessed like an array. However, if MD5 is used to represent name in the NameTable, it has fixed length. If MD5 names are stored in uint64_t type instead of ULEB128, NameTable can be accessed like an array then in many cases only part of the NameTable has to be read. This is helpful for reducing compile time especially when small source file is compiled.

We find that after this change, the elapsed time to build a large application distributively is reduced by 5% and the accumulative cpu time used for building is also reduced by 5%. The size of the profile is slightly reduced with this change by ~0.2%, and that also indicates encoding MD5 in ULEB128 doesn't save the storage space.

Diff Detail

Repository
rL LLVM

Event Timeline

wmi created this revision.Dec 3 2020, 5:17 PM
wmi requested review of this revision.Dec 3 2020, 5:17 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 3 2020, 5:17 PM
wmi retitled this revision from [SampleFDO] [SampleFDO] Store fixed length MD5 in NameTable instead of using ULEB128 if MD5 is used. to [SampleFDO] Store fixed length MD5 in NameTable instead of using ULEB128 if MD5 is used..Dec 3 2020, 5:17 PM
davidxl added inline comments.Dec 3 2020, 7:38 PM
llvm/lib/ProfileData/SampleProfReader.cpp
794

Does FixLengthMD5 imply IsMD5, so IsMD5 is enough?

hoy added a subscriber: hoy.Dec 3 2020, 8:57 PM
hoy added inline comments.
llvm/include/llvm/ProfileData/SampleProf.h
171

Nit: SecFlagFixedLengthMD5?

llvm/lib/ProfileData/SampleProfWriter.cpp
177

I guess the writer does this unconditionally because we no longer want the variant encoding. The reader supports both for downwards compatibility?

wmi added inline comments.Dec 3 2020, 10:12 PM
llvm/include/llvm/ProfileData/SampleProf.h
171

Will change it.

llvm/lib/ProfileData/SampleProfReader.cpp
794

That is right. Will fix it.

llvm/lib/ProfileData/SampleProfWriter.cpp
177

Yes, you are right.

wmi updated this revision to Diff 309633.Dec 4 2020, 1:16 PM

Address David and Hongtao's comments.

hoy added inline comments.Dec 4 2020, 3:00 PM
llvm/lib/ProfileData/SampleProfReader.cpp
372

Can this be just readStringIndex(NameTable) since NameTable is preallocated?

wmi added inline comments.Dec 4 2020, 4:28 PM
llvm/lib/ProfileData/SampleProfReader.cpp
372

Good point. That makes the code simpler.

wmi updated this revision to Diff 309718.Dec 4 2020, 9:55 PM

Address Hongtao's comment.

hoy accepted this revision.Dec 4 2020, 11:47 PM
hoy added inline comments.
llvm/lib/ProfileData/SampleProfReader.cpp
729

Nit: this might still be helpful to the non-fixed MD5 path.

778

Nit: should NameTable be always empty right before here? Could an assert be useful?

This revision is now accepted and ready to land.Dec 4 2020, 11:47 PM
hoy removed a reviewer: hoyFB.Dec 4 2020, 11:48 PM
wenlei added a comment.Dec 6 2020, 6:58 AM

We find that after this change, the elapsed time to build a large application distributively is reduced by 5% and the accumulative cpu time used for building is also reduced by 5%.

This is great, thanks for the patch! Just to clarify, did you mean 5% of end to end ThinLTO+AutoFDO build? I'm curious what is the total profile reading time your saw in terms of %? and how different that % is between md5 profile vs non-md5 profile? Currently we mostly use non-md5 profile, assuming md5 profile is more about saving profile size.

llvm/lib/ProfileData/SampleProfReader.cpp
372

Actually it looks like both fixed path and ULEB128 path need to access NameTable[*Idx] in addition to readStringIndex(NameTable)? What about just call SampleProfileReaderBinary::readStringFromTable for both, and diverge only after that - return for ULEB128 path, check error or empty StringRef for fixed path?

380

If we return error code before reaching this point, Data pointer will not be restored, is that ok? What about using RAII to make sure restoration always happens?
I also noticed this is not the only place where we do save/restore for Data pointer, but other instances are not part of this change.

wmi added a comment.Dec 7 2020, 11:12 AM

We find that after this change, the elapsed time to build a large application distributively is reduced by 5% and the accumulative cpu time used for building is also reduced by 5%.

This is great, thanks for the patch! Just to clarify, did you mean 5% of end to end ThinLTO+AutoFDO build? I'm curious what is the total profile reading time your saw in terms of %? and how different that % is between md5 profile vs non-md5 profile? Currently we mostly use non-md5 profile, assuming md5 profile is more about saving profile size.

Yes, it is 5% saving of end to end ThinLTO + AutoFDO build time.

Each module compiling has different % of profile reading time. The percentage is negligible for large source module but can be significant for small source module. I tried it on a very small file.

The build time using md5 profile was 0.55 second. 
The build time using non-md5 profile was 0.75 second. 
The build time using fixlenmd5 profile was 0.15 second. 
The build time not using any profile was 0.1 second.

Because for a large project, many source files are small or medium sizes so the building time for small files matters especially for build cpu resource. It also matters for end-to-end build time but usually not as significant. For the experiment I did (a 20M binary profile), the end-to-end build time saving was about the same as the aggregate build cpu resource saving (both were 5%), but from our experience using larger profile (like 300M profile), it will affect aggregate build cpu resource more than end-to-end build time.

llvm/lib/ProfileData/SampleProfReader.cpp
372

For fixed length MD5 path, we want to get a reference of NameTable[*Idx] and change it after we read the real name from memory. This is different from the ULEB128 path, where NameTable has been populated up front, and readStringIndex will only return a copy of StringRef.

380

When error happens, the reader/writer won't proceed and will return to caller calling read/write API interface, so the inconsistent Data pointer doesn't matter. The reason we don't want to issue a fatal error is the caller may want to skip the processing of the current problematic profile and proceed to the next one.

729

Good catch.

778

Currently NameTable is always empty before here but it is possible we have multiple NameTable sections. I have a followup NFC to support that.

wenlei accepted this revision.Dec 7 2020, 8:54 PM
In D92621#2437658, @wmi wrote:

We find that after this change, the elapsed time to build a large application distributively is reduced by 5% and the accumulative cpu time used for building is also reduced by 5%.

This is great, thanks for the patch! Just to clarify, did you mean 5% of end to end ThinLTO+AutoFDO build? I'm curious what is the total profile reading time your saw in terms of %? and how different that % is between md5 profile vs non-md5 profile? Currently we mostly use non-md5 profile, assuming md5 profile is more about saving profile size.

Yes, it is 5% saving of end to end ThinLTO + AutoFDO build time.

Each module compiling has different % of profile reading time. The percentage is negligible for large source module but can be significant for small source module. I tried it on a very small file.

The build time using md5 profile was 0.55 second. 
The build time using non-md5 profile was 0.75 second. 
The build time using fixlenmd5 profile was 0.15 second. 
The build time not using any profile was 0.1 second.

Because for a large project, many source files are small or medium sizes so the building time for small files matters especially for build cpu resource. It also matters for end-to-end build time but usually not as significant. For the experiment I did (a 20M binary profile), the end-to-end build time saving was about the same as the aggregate build cpu resource saving (both were 5%), but from our experience using larger profile (like 300M profile), it will affect aggregate build cpu resource more than end-to-end build time.

Thanks for sharing those numbers, and it's good evidence that the name string in profile can be non-trivial for build time. I think CSSPGO may need to mitigate the overhead introduced by long names - we will get to that.