This is an archive of the discontinued LLVM Phabricator instance.

[PGO] Add profile instrumentation sampling support
Needs ReviewPublic

Authored by xur on Jun 28 2019, 11:47 AM.

Details

Reviewers
davidxl
Summary

GO instrumentation can incur huge slowdown, especially for highly threaded
programs -- we have seen 100x. This patch adds profile instrumentation
sampling support.

It transforms:

Increment_Instruction;
Instructions_after;

to:

CountVar = CountVar + 1;
if (CountVar <= SampleDuration)
  Increment_Instruction;
else if CountVar >= WholeDuration)
  CountVar = 0;
Instructions_after;

Here CountVar is a thread-local global shared by all PGO instrumentation
variables (value-instrumentation and edge instrumentation).

Some statistics we collect using one of our large and highly threaded program:
This is using default sample-rate of 100:100019.

  • sampling speeds up the instrumentation binary by 3.3x.
  • overlap tool shows resulted profiles are close: Edge profile overlap: 92.771% IndirectCall profile overlap: 80.493% MemOP profile overlap: 95.114%

FDO optimize build binary is performance neutral using sampled profile in the above application.

Compile time can increase due to the added control flows.

Diff Detail

Event Timeline

xur created this revision.Jun 28 2019, 11:47 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 28 2019, 11:47 AM
vsk added a subscriber: vsk.Jun 28 2019, 4:00 PM

Does the bursty style sampling provide additional benefit? A non-bursty style sampling can be cheaper:

sample_count++;
if (sample_count == SamplePeriod) {

increment_prof_counter();
 sample_count = 0;

}

xur added a comment.Jul 3 2019, 2:39 PM

Does the bursty style sampling provide additional benefit? A non-bursty style sampling can be cheaper:

sample_count++;
if (sample_count == SamplePeriod) {

increment_prof_counter();
 sample_count = 0;

}

This is the only implementation that I had done. I can implement a pure sampling method and compare the performance.