Page MenuHomePhabricator

[SampleFDO] Flow Sensitive Sample FDO (FSAFDO)
Needs ReviewPublic

Authored by xur on Mar 22 2021, 3:36 PM.

Details

Summary

This patch implements Flow Sensitive Sample FDO (FSAFDO). It has the following
changes:
(1) disable current discriminator coding scheme.
(2) new hierarchical discriminator for FSAFDO
(3) FSAFDO profile loader.

For this patch, "-enable-fs-discriminator=true" turns on the new functionality.
"-enable-fs-discriminator=false" (the default) keeps current Sample FDO behavior.

This patch is not intended for check-in. I post it mainly to get the advises on to break
into smaller patches. Also because of this, I did not include the test cases.

Diff Detail

Event Timeline

xur created this revision.Mar 22 2021, 3:36 PM
xur requested review of this revision.Mar 22 2021, 3:36 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 22 2021, 3:36 PM
davidxl added inline comments.Mar 22 2021, 8:57 PM
llvm/lib/CodeGen/TargetPassConfig.cpp
1178

Is it necessary for this pass? BranchFolding does not create new clones, but merge them, so discriminator subsections can be reused (even though after the branch folding, some of the discriminator in that section gets removed)?

1478

I can see the importance of adding pass to add discriminator after MBP due to tail dup in MBP, but how important (performance wise) it is to load sample profile again for branch folding pass?

xur added inline comments.Mar 23 2021, 10:12 AM
llvm/lib/CodeGen/TargetPassConfig.cpp
1178

I had this is mainly for the tail duplication in block placement pass. Conceptually it for all the new clones in the pipeline.
I have the statistics of the counter for different rounds of discriminators. We do have plenty of this.

That said, more of the performance are from the round before block placement. If I disable this, the performance change is within the noise range (for the benchmark I used).

1478

This is from my experiments. The default is off anyway. I will probably remove this.

Regarding the staging of the patch, here is the suggestion:

  1. Replace DF with FSDescriminator.

In this patch, FSdescriminator needs to be added/inserted in the last phase (with ranges of discriminator sections reserved but not used), and there will be no new sample loading passes added. The functionality of current DF should be mostly preserved.

  1. Add other AddDiscriminator passes
  1. Add new sampleProfile loader passes one by one.

For 1), the profile reading offline tool should be able to handle two versions of formats to allow co-existence of profiles.

llvm/lib/CodeGen/TargetPassConfig.cpp
1178

Can probably make use this discriminator range for other purposes (target optimizations)

1478

If it is not important for performance, suggest making it (the sample profile loading part) off by default to avoid unnecessary compile time increase.

snehasish added inline comments.Mar 29 2021, 9:45 AM
llvm/include/llvm/Support/FSAFDODiscriminator.h
19

A few questions about the discriminator bits:

  • Depending on the transformation in the target pass the requirement of bits may be different, i.e. 5 bits for each may be too many or too few. Do you have any data to share about how many bits are used by each?
  • How do we alert authors of new target optimizations (or code refactoring) additional discriminator bits are needed to disambiguate? Would a late stage analysis only pass which enumerates different instructions with the same debug+discriminator info be useful to commit?
  • If I understand correctly, we bump the bit for each level of cloning. This seems to be a less efficient coding scheme, max 5 bits where by enumeration you could identify 31 clones? Have you considered other coding schemes?
davidxl added inline comments.Mar 29 2021, 10:38 AM
llvm/include/llvm/Support/FSAFDODiscriminator.h
19

A few questions about the discriminator bits:

  • Depending on the transformation in the target pass the requirement of bits may be different, i.e. 5 bits for each may be too many or too few. Do you have any data to share about how many bits are used by each?

I assume most of the transformations produce few clones except for unrolling (which depends on unroll factor).

  • How do we alert authors of new target optimizations (or code refactoring) additional discriminator bits are needed to disambiguate? Would a late stage analysis only pass which enumerates different instructions with the same debug+discriminator info be useful to commit?

The problem with this is that the authors won't have any means to change anything. Rong and I discussed about this. Longer term when this becomes and issue, increasing the size of the discriminator container type will be the way to go.

  • If I understand correctly, we bump the bit for each level of cloning. This seems to be a less efficient coding scheme, max 5 bits where by enumeration you could identify 31 clones? Have you considered other coding schemes?

The biggest advantage of fixed width is simplicity, I think.

snehasish added inline comments.Mar 29 2021, 11:10 AM
llvm/include/llvm/Support/FSAFDODiscriminator.h
19

The biggest advantage of fixed width is simplicity, I think.

Not sure if I made the point clearly, so repeating - I'm trying to distinguish between using 5 values (one per bit) in a pass vs 2^5-1 values which can be enumerated by 5 bits. For example, with a loop unroll factor of >5, this coding scheme will not be able to assign unique ids for all clones. Is this understanding correct?

I think redistributing the bits based on the pass transformations should be sufficient to avoid a more complex coding scheme. I agree using fixed width bits makes for a simple implementation and we should prefer this unless data shows otherwise.

davidxl added inline comments.Mar 29 2021, 11:39 AM
llvm/include/llvm/Support/FSAFDODiscriminator.h
19

Rong can answer this question more thoroughly. By just looking at line 373 of FlowSensitiveSampleProfile.cpp file, it seems it is actually sequentially increasing FSD, instead of bumping the bit?

snehasish added inline comments.Mar 30 2021, 3:15 PM
llvm/include/llvm/Support/FSAFDODiscriminator.h
19

Thanks for the pointer and Rong clarified the usage. My understanding was incorrect and the encoding space is not as limited as I thought. Since there is no measurable performance difference the current approach seems simplest.