This is an archive of the discontinued LLVM Phabricator instance.

[FSAFDO] Improve FS discriminator encoding
ClosedPublic

Authored by xur on Mar 2 2023, 10:11 AM.

Details

Summary

This change improves FS discriminators in the following ways:
(1) use call-stack debug information in the discriminator hash:
the same (src/line) DILs can now have different hash if they
come from different call-stacks. This reduces the hash conflicts.
(2) don't generate the FS discriminator for meta instructions
(i.e. instructions not emitted). This reduces the number
discriminators conflicts (for the case we run out of discriminator
bits for that pass).
(3) use less expensive hashing of xxHash64.

These improvements should bring better performance for FSAFDO
and they should be used by default. But this change creates
incompatible FS discriminators. For the iterative profile users,
they might see an performance drop in the first release with
this change (due to the fact that the profiles have the old
discriminators and the compiler uses the new discriminator).
We have measured that this is not more than 1.5% on several
benchmarks. Note the degradation should be gone in the second
release and one should expect a performance gain over the binary
without this change.

One possible solution to the iterative profile issue would be
separating discriminators for profile-use and the ones emitted to
the binary. This would require a mechanism to allow two sets of
discriminators to be maintained and then phasing out the first
approach. This is too much churn in the compiler and the
performance implications do not seem to be worth the effort.

Instead, we put the changes under an option so iterative profile
users can do a gradual rollout of this change. I will make the
option default value to true in a later patch and eventually
purge this option from the code base.

Diff Detail

Event Timeline

xur created this revision.Mar 2 2023, 10:11 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 2 2023, 10:11 AM
xur requested review of this revision.Mar 2 2023, 10:11 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 2 2023, 10:11 AM
hoy added a comment.Mar 2 2023, 11:49 AM

Thanks for the change! I'm going to start the integration with pseudo probe from here.

llvm/lib/CodeGen/MIRFSDiscriminator.cpp
65

BB could have different names between the release and the debug compiler, or using or not using -fdiscard-value-names. This may lead to non-determinism in computing FS discriminators. Maybe consider using an integer id for BB? It doesn't exist today though.

136

Should zero line number also get a discriminator? This may result in invalid line offsets with which all such instructions will end up sharing the same sample.

142

Wondering if DIL->getFilename() is still needed since CallStackHashVal includes the caller linkage names which should help avoid hash conflicts.

153

Wondering if BBSizeHash is stable enough run-to-run. If consecutive builds have slight change in block size, their discriminators may not match?

Including callsite hash LocationDiscriminator sounds a great improvement. I'm wondering how helpful it is to include BBSizeHash in the discriminator encoding. Have you evaluated the two changes separately?

xur added inline comments.Mar 2 2023, 4:12 PM
llvm/lib/CodeGen/MIRFSDiscriminator.cpp
65

You are exactly right! Note this is old version of Hash that will be deprecated. The new version will not use the name anymore.

136

Zero line numbers do not get a discriminator. The number of zero line number instructions is actually pretty big. They can easily overflow the discriminators. We have another changes to deal with zero line number and instructions with empty DIL. They are still ongoing.

142

CallStackHashValue is for DIL with getInlinedAt(). If this is not inlined instrution. CallStackHashValue returns 0.

153

We have tried a few other ways for the Hash. It seems that it's better to be more strict on the match -- if it's mistach, we will not use the pass specific counters, but we still use the average count (branch portability) from previous rounds. This seems to be better than applying wrong BB weights.

I don't have the data for with and without BBSizeHash. I can a quick run to see if it has performance impact.

hoy added inline comments.Mar 2 2023, 4:46 PM
llvm/include/llvm/Transforms/Utils/SampleProfileLoaderBaseImpl.h
310 ↗(On Diff #501911)

How about making this function virtual and override it in MIR sample loader? Something like

ErrorOr<uint64_t> getInstWeightImpl(const MachineInstr &Inst) {
  if (ImprovedFSDiscriminator && Inst.isMetaInstruction())
    return std::error_code();
  return SampleProfileLoaderBaseImpl<MachineBasicBlock>::getInstWeightImpl(
      Inst);
}
llvm/lib/CodeGen/MIRFSDiscriminator.cpp
65

I see. It's only used in the old version.

83

getLinkageName may return an empty string C functions. We often use a trick to work it around:

// Use linkage name for C++ if possible.
auto Name = SP->getLinkageName();
if (Name.empty())
  Name = SP->getName();

Actually there is a similar function getCallStackHash in SampleProfileProbe.cpp. Do you think it's possible to unify them and place it in a command file like DebubInfo.h?

113

Is this fixing a bug of the old encoding?

BTW, should we assert the current pass is always not the base pass or LowBit is never zero?

136

Good to know you are working on a fix.

142

Thanks for pointing it out.

hoy added inline comments.Mar 2 2023, 5:14 PM
llvm/include/llvm/Transforms/Utils/SampleProfileLoaderBaseImpl.h
310 ↗(On Diff #501911)

Or maybe just override the existing virtual function getInstWeight, just like SampleProfileLoader::getInstWeight

xur added a comment.Mar 6 2023, 10:44 AM

Thanks Hongtao for the reviews and comments. I'll update the patch shortly.

llvm/include/llvm/Transforms/Utils/SampleProfileLoaderBaseImpl.h
310 ↗(On Diff #501911)

I can do that. This way we don't need shouldIgnoreInst() function.

I have this shouldIgnoreInst() because there are types of instructions in SamplePorfiles try to ignore too. I want to have a unify interface for that. But this seems to be an overkill.

llvm/lib/CodeGen/MIRFSDiscriminator.cpp
83

Good to know this. Have see seen cases that empty string leads to hash conflicts?
Will getName() also return empty string?

We can add another interface in DISubprogram? in DebuginfoMetadata.h?

113

This is a bug fix. I notice that issue for a while but I did fix it because it changes of some discriminators.
I will add an assert here.

hoy added inline comments.Mar 6 2023, 11:28 AM
llvm/include/llvm/Transforms/Utils/SampleProfileLoaderBaseImpl.h
310 ↗(On Diff #501911)

Thanks.

llvm/lib/CodeGen/MIRFSDiscriminator.cpp
83

I haven't taken a deep look into whether empty strings can lead to hash conflicts for this case actually.

getName() will always return the original demangled function name, and for C, it is just the linkage name.

A new interface in DebuginfoMetadata.h sounds good. Perhaps a new member function for DILocation?

153

This seems to be better than applying wrong BB weights.

This makes sense.

I don't have the data for with and without BBSizeHash. I can a quick run to see if it has performance impact.

Yeah, curious to see how impactful it is. Thanks.

llvm/test/CodeGen/X86/fsafdo_test1.ll
4–5

I have a question about keeping the original discriminator, i.e, 2 here. IIUC, the MIR sample loader will skip loading samples for the instruction. Do you think it should get a new discriminator so that it can use pass specific counters too? Let me know if I miss anything.

xur added inline comments.Mar 6 2023, 1:55 PM
llvm/test/CodeGen/X86/fsafdo_test1.ll
4–5

No. It will be loaded in MIR samples profile. 2 will be bit masked and the counter will be contributed to version 0 (i.e. discriminator value of 0).

xur updated this revision to Diff 502867.Mar 6 2023, 5:14 PM

Integrated Hongtao's review suggestion.

This reduces the hash conflicts.

Curious how did you check/detect conflicts/collisions?

one should expect a performance gain over the binary without this change.

How big is the gain you saw?

llvm/lib/CodeGen/MIRFSDiscriminator.cpp
68

same here, converting to string first seem unnecessary.

78

MD5 as cryptographic hash is expensive, xxHash64 should be good enough in terms of distribution and collision avoidance. Now that we're changing discriminator algorithm, wondering if we should take the opportunity to move to xxhash for the new version. (For compilation with compact-binary profile where MD5 is used a lot, MD5 shows up quite hot in perf profile)

82

If collision is a concern and we're working to reduce collision, ^= is a weak blend/combine function (symmetric among other problems.)

Something like this below is a safer combine function. We have a number of similar hashCombine in LLVM code base.

inline int64_t hashCombine(const int64_t Seed, const int64_t Val) {
  std::hash<int64_t> Hasher;
  return Seed ^ (Hasher(Val) + 0x9e3779b9 + (Seed << 6) + (Seed >> 2));
}
129

converting int to string just to get an hash seems like an overkill. xxHash64(ArrayRef<uint8_t> Data) should be fast and good enough.

hoy added inline comments.Mar 7 2023, 9:10 AM
llvm/include/llvm/IR/DebugInfoMetadata.h
1924

nit: name it getSubprogramLinkageName?

llvm/lib/CodeGen/AsmPrinter/PseudoProbePrinter.cpp
35

Thanks for fixing this!

xur added a comment.Mar 8 2023, 1:44 PM

This reduces the hash conflicts.

Curious how did you check/detect conflicts/collisions?

Most from eyeballing the change -- I manually look at some functions.
The other indicator was the number of discriminators created in some lines in templates header. For some line, we overflow the bits.

one should expect a performance gain over the binary without this change.

How big is the gain you saw?

We measured this for a variety of programs -- we are seeing improvement range from 0.7% to 2.0% on top of current FSAFDO.
The gain usually is higher for iterative profiles. (This is with the BBSize in the hash).

llvm/include/llvm/IR/DebugInfoMetadata.h
1924

This is a better name. I will change to this.

llvm/lib/CodeGen/MIRFSDiscriminator.cpp
68

Will switch to xxHash64.

78

This is a good idea. I think this is a good opportunity to switch to less expensive hash algorithm.

129

Got it. I will probably remove BBSize from the hash.

xur added a comment.Mar 8 2023, 1:47 PM

The performance for removing BBSize from discriminator hash is out: it shows a slightly gain on performance, like ~0.2% vs with BBsize in the hash. I will remove BBSize from the hash.

xur updated this revision to Diff 503527.Mar 8 2023, 2:43 PM

Integrated review comments/suggestions from Wenlei and Hongtao:
(1) remove BBSize from hash
(2) use less expensive hash function
(3) better names for getSubprogramLinkageName()

hoy accepted this revision.Mar 8 2023, 3:58 PM

The performance for removing BBSize from discriminator hash is out: it shows a slightly gain on performance, like ~0.2% vs with BBsize in the hash. I will remove BBSize from the hash.

Good to know removing it helps perf. Can you please update the summary as well? LGTM otherwise.

llvm/lib/CodeGen/MIRFSDiscriminator.cpp
153

The callsite hash is being removed from the discriminator for V1. I guess the discriminator conflicts will be way less than V0?

This revision is now accepted and ready to land.Mar 8 2023, 3:58 PM
wenlei accepted this revision.Mar 8 2023, 6:05 PM

This reduces the hash conflicts.

Curious how did you check/detect conflicts/collisions?

Most from eyeballing the change -- I manually look at some functions.
The other indicator was the number of discriminators created in some lines in templates header. For some line, we overflow the bits.

I'm wondering if we can have a way to systemically detect and report such cases -- that should make us aware when this is happening which can throttle perf.

one should expect a performance gain over the binary without this change.

How big is the gain you saw?

We measured this for a variety of programs -- we are seeing improvement range from 0.7% to 2.0% on top of current FSAFDO.
The gain usually is higher for iterative profiles. (This is with the BBSize in the hash).

That's quite promising. Just to double check, 0.7%-2.0% was the additional improvement from this patch alone, right? What's the total perf improvement from FSAFDO after this change that you saw?
We will measure this internally too.

The change looks good, thanks!

xur added a comment.Mar 9 2023, 10:06 AM

The performance for removing BBSize from discriminator hash is out: it shows a slightly gain on performance, like ~0.2% vs with BBsize in the hash. I will remove BBSize from the hash.

Good to know removing it helps perf. Can you please update the summary as well? LGTM otherwise.

Definitely. I will update the summary and the check-in message.

This reduces the hash conflicts.

Curious how did you check/detect conflicts/collisions?

Most from eyeballing the change -- I manually look at some functions.
The other indicator was the number of discriminators created in some lines in templates header. For some line, we overflow the bits.

I'm wondering if we can have a way to systemically detect and report such cases -- that should make us aware when this is happening which can throttle perf.

I have another follow-up patch that will track and report the match. It does not track the conflict. But let me think about it to add this support.

one should expect a performance gain over the binary without this change.

How big is the gain you saw?

We measured this for a variety of programs -- we are seeing improvement range from 0.7% to 2.0% on top of current FSAFDO.
The gain usually is higher for iterative profiles. (This is with the BBSize in the hash).

That's quite promising. Just to double check, 0.7%-2.0% was the additional improvement from this patch alone, right? What's the total perf improvement from FSAFDO after this change that you saw?
We will measure this internally too.

The total improvement really depends on programs. On average, I would say 1.5% to 2.0%. For some programs, like clang itself, FSAFDO improve ~3% over AFDO.
There is also a small change on the create_llvm_prof tool side. We will update the tool soon.

I'm also looking forward to hearing the performance number on your tests.

The change looks good, thanks!

llvm/lib/CodeGen/MIRFSDiscriminator.cpp
153

Yes, the callsite hash is not in the discriminator hash for V1 -- the callsite hash is now the part of the key for the map. We now can have discriminators with the same value but they will not conflict with each other as they belong to different maps. This indirectly increases the range of the discriminators and thus results in less conflicts.

xur edited the summary of this revision. (Show Details)Mar 9 2023, 10:08 AM
hoy added a comment.Mar 9 2023, 10:21 AM

The total improvement really depends on programs. On average, I would say 1.5% to 2.0%. For some programs, like clang itself, FSAFDO improve ~3% over AFDO.
There is also a small change on the create_llvm_prof tool side. We will update the tool soon.

The numbers are exciting! I'll let you know our numbers once ready.

BTW, I'll started the integration with CSSPGO base off here. Pseudo probes have a potential to allow more bits for the FS discriminator and additional FS passes. Implementation-wise, the potential comes from placing the FS discriminator field as a separate int64 operand of the pseudo probe intrinsic, instead of using the existing dbg metadata. I would like your thoughts about whether it's worth the diversion. Thanks.

This revision was automatically updated to reflect the committed changes.

There is also a small change on the create_llvm_prof tool side. We will update the tool soon.

What kind of change is that about? Appreciate if you could drop us a note when the change is up since we don't closely follow create_llvm_prof updates.

BTW, I'll started the integration with CSSPGO base off here. Pseudo probes have a potential to allow more bits for the FS discriminator and additional FS passes. Implementation-wise, the potential comes from placing the FS discriminator field as a separate int64 operand of the pseudo probe intrinsic, instead of using the existing dbg metadata. I would like your thoughts about whether it's worth the diversion. Thanks.

All thing being equal, I'd favor consistency. But there's potential benefit by allowing more FS profile loading, though whether that is material depends on whether we will actually leverage the ability to load more than 4 profiles. My intuition is that loading base profile + pre-RA profile + pre-layout profile as the way it is today is probably good enough.