This is an archive of the discontinued LLVM Phabricator instance.

[llvm-profdata] Use llvm::DenseMap in SampleProfileMap
Needs ReviewPublic

Authored by huangjd on Aug 23 2023, 6:22 PM.

Details

Summary

Changing the container type for SampleProfileMap from std::unordered_map to llvm::DenseMap. This brings up to 8% speed up (31.4s vs 29.0s) when reading a large test profile, and 5% speedup (0.82s vs 0.78s) when reading the function offset table alone.

Since DenseMap does not guarantee reference validity after insertion (when rehash happens), code that use a reference to SampleProfiles while performing insertion to the map must be modified to make sure not to have dangling reference.

Diff Detail

Event Timeline

huangjd created this revision.Aug 23 2023, 6:22 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 23 2023, 6:22 PM
Herald added a subscriber: wenlei. · View Herald Transcript
huangjd requested review of this revision.Aug 23 2023, 6:22 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 23 2023, 6:22 PM
huangjd updated this revision to Diff 552948.Aug 23 2023, 6:48 PM
This comment was removed by huangjd.

Fix a dangling reference bug in ProfileConverter::flattenNestedProfile . This is an existing bug only revealed after D147740 changes the profile map container type.

Actually this is not true. Before D147740, SampleProfileMap is a unordered_map, which guarantees reference stability. See the follow statement from https://cplusplus.com/reference/unordered_map/unordered_map/rehash/

If a rehash happens, all iterators are invalidated, but references and pointers to individual elements remain valid. If no actual rehash happens, no changes.

Implementations are depending on reference stability of SampleProfileMap to simplify things. And your change in D147740 broke that. As evident by the various breakages and reverts, such change is dangerous and can be impactful. I suggest you restore reference stability for SampleProfileMap, otherwise they may be more breakages from your change yet to be uncovered.

Would HashKeyMap<unordered_map, SampleContext, FunctionSamples> be enough to restore reference stability?

wenlei requested changes to this revision.Aug 23 2023, 9:57 PM
This revision now requires changes to proceed.Aug 23 2023, 9:57 PM

This does seem to fix memory corruption.
That being said I think I am with @wenlei on this. The implementation seems fragile as DenseMap, apparently, doesn't offer reference stability. Which leads to these kind of workarounds, and more then likely more issues down the road.

I think https://reviews.llvm.org/D147740 should be reverted and concerns addressed in that patch next re-land.

DenseMap is significantly faster than std::unordered_map as previously used.

Benchmarking with an internal profile for production, when reading the function offset table alone, the original implementation (before D147740), DenseMap, and unordered_map takes 1.2, 0.78, 0.82 s respectively, and reading the full profiles takes 34.6, 29.0, 31.4 s respectively. That means using a std::unordered_map is 5-8% slower. (Meanwhile the refactoring itself has been very significant regardless of which container to use)

DenseMap is significantly faster than std::unordered_map as previously used.

Benchmarking with an internal profile for production, when reading the function offset table alone, the original implementation (before D147740), DenseMap, and unordered_map takes 1.2, 0.78, 0.82 s respectively, and reading the full profiles takes 34.6, 29.0, 31.4 s respectively. That means using a std::unordered_map is 5-8% slower. (Meanwhile the refactoring itself has been very significant regardless of which container to use)

Regardless of how fast DenseMap is, correctness and robustness are higher priority. Evidently the change breaks reference stability used with the implementation and has caused a lot of breakages. I'm not confident this is the last fixed needed.

I'm fine with HashKeyMap<unordered_map, SampleContext, FunctionSamples>, but I see not providing reference stability as fragile. Spraying the code with pointer/reference updates here and there is not only fragile but also hacky IMO.

DenseMap is significantly faster than std::unordered_map as previously used.

Benchmarking with an internal profile for production, when reading the function offset table alone, the original implementation (before D147740), DenseMap, and unordered_map takes 1.2, 0.78, 0.82 s respectively, and reading the full profiles takes 34.6, 29.0, 31.4 s respectively. That means using a std::unordered_map is 5-8% slower. (Meanwhile the refactoring itself has been very significant regardless of which container to use)

IMHO a more robust solution wins over marginal performance gains. The unordered_map provides fundamental guarantees and doesn't require future code owners to deal with how/when insert into SampleProfileMap.

What is the root cause of unordered_map being slower?

What is the root cause of unordered_map being slower?

I think it is just the implementation in general, their algorithms are quite different.

What is the root cause of unordered_map being slower?

I think it is just the implementation in general, their algorithms are quite different.

The perf difference is likely directly related to reference stability. unordered_map stores pointers to values in the buckets, whereas DenseMap store the values directly in the hash buckets. DenseMap has one less indirection and better locality, at the cost of reference stability - rehash would lead to reallocation for values.

Sometimes it's okay to give up reference stability for performance, but as the container actually owning all FunctionSamples, not providing reference stability means anyone holding a reference or pointer to any FunctionSamples is prone to dangling reference issues.

I have slightly different opinion on this -- I am on the camp for better performance over a small loss of convenience.

This is no different from iterator stability (lack of) for vector type when the loop adds new elements. We don't disallow use of vector because this. This should be dealt with the right coding idiom (for DenseMap) and proper documutation (on all loops that insert new samples). Once we have good baseline, the maintenance overhead in the future will be small.

I have slightly different opinion on this -- I am on the camp for better performance over a small loss of convenience.

This is no different from iterator stability (lack of) for vector type when the loop adds new elements.

It is different. iterators tend to be local and have short lifetime, so it's much more manageable for a reason.

In fact, standard only requires reference stability but not iterator stability for unordered_map.

Quote from standardese: "If a rehash happens, all iterators are invalidated, but references and pointers to individual elements remain valid."

We don't disallow use of vector because this. This should be dealt with the right coding idiom (for DenseMap) and proper documutation (on all loops that insert new samples). Once we have good baseline, the maintenance overhead in the future will be small.

It it less about the container itself, but how it is used. Here the container is used to own low level data, with long lifetime, which is why reference stability is useful.

That said, even if we still want to attempt the change to use DenseMap, I suggest we try that in a separate patch, so the majority of the original changes can get in and be stabilized first, and others can be unblocked now without needing to revert the whole patch. Evidently attempts to remove reference stability is non-trivial and also needs more testing, which warrants a change of its own.

I'd also add that robustness aside, we'd at least need to adapt current implementation cleanly. I don't think we can go with DenseMap if we have to make changes like https://reviews.llvm.org/D157061?id=547068 that patches pointers. OTOH, something like final version there is more acceptable: https://reviews.llvm.org/D157061?id=550866

I have slightly different opinion on this -- I am on the camp for better performance over a small loss of convenience.

This is no different from iterator stability (lack of) for vector type when the loop adds new elements.

It is different. iterators tend to be local and have short lifetime, so it's much more manageable for a reason.

In fact, standard only requires reference stability but not iterator stability for unordered_map.

Quote from standardese: "If a rehash happens, all iterators are invalidated, but references and pointers to individual elements remain valid."

True for unordered_map, but what I meant is that iterator/reference stability is not always essential to the algorithm -- and we should not be afraid of using containers without such guarantees.

We don't disallow use of vector because this. This should be dealt with the right coding idiom (for DenseMap) and proper documutation (on all loops that insert new samples). Once we have good baseline, the maintenance overhead in the future will be small.

It it less about the container itself, but how it is used. Here the container is used to own low level data, with long lifetime, which is why reference stability is useful.

Owning the data is probably not the problem but exposing the pointer/reference to them and keep them for a long life time (in particular, across changes that change the size of the container). How pervasive are those use cases?

@davidxl While discussion continues can the patch be reverted, or at least changed to unordered_map for now (preferably the former). It has been impacting our internal testing of trunk for a week now.

@wenlei, may I propose having this patch in first (to unblock) @ayermolo while we are discussing longer term strategy? Another benefit is that we can wait to see if there are other incidence (of using long lifetime reference) can show up so that we have more cases to consider the redesign (if needed).

@wenlei, may I propose having this patch in first (to unblock) @ayermolo while we are discussing longer term strategy? Another benefit is that we can wait to see if there are other incidence (of using long lifetime reference) can show up so that we have more cases to consider the redesign (if needed).

Back-and-forth reverts/relands are costly and somewhat disruptive, I hope we can gain more confidence before reattempting. Here's my suggestion:

Step 1: use HashKeyMap<unordered_map, SampleContext, FunctionSamples> instead of HashKeyMap<DenseMap, SampleContext, FunctionSamples> for SampleProfileMap. It's a one line change to unblock.

Step 2: a separate patch to attempt DenseMap, coupled with all fixes needed. we can also help test correctness and performance on that change. actually the performance picture may not be that clear. reallocation is expensive, and the fact you didn't run into reallocation in your testing may indicate that the perf benchmark result may also be skewed. We also need to take into account the cost to update references etc.

WDYT?

The proposal sounds good to me.

@wenlei, may I propose having this patch in first (to unblock) @ayermolo while we are discussing longer term strategy? Another benefit is that we can wait to see if there are other incidence (of using long lifetime reference) can show up so that we have more cases to consider the redesign (if needed).

Back-and-forth reverts/relands are costly and somewhat disruptive, I hope we can gain more confidence before reattempting. Here's my suggestion:

Step 1: use HashKeyMap<unordered_map, SampleContext, FunctionSamples> instead of HashKeyMap<DenseMap, SampleContext, FunctionSamples> for SampleProfileMap. It's a one line change to unblock.

Step 2: a separate patch to attempt DenseMap, coupled with all fixes needed. we can also help test correctness and performance on that change. actually the performance picture may not be that clear. reallocation is expensive, and the fact you didn't run into reallocation in your testing may indicate that the perf benchmark result may also be skewed. We also need to take into account the cost to update references etc.

WDYT?

The proposal sounds good to me.

@wenlei, may I propose having this patch in first (to unblock) @ayermolo while we are discussing longer term strategy? Another benefit is that we can wait to see if there are other incidence (of using long lifetime reference) can show up so that we have more cases to consider the redesign (if needed).

Back-and-forth reverts/relands are costly and somewhat disruptive, I hope we can gain more confidence before reattempting. Here's my suggestion:

Step 1: use HashKeyMap<unordered_map, SampleContext, FunctionSamples> instead of HashKeyMap<DenseMap, SampleContext, FunctionSamples> for SampleProfileMap. It's a one line change to unblock.

Step 2: a separate patch to attempt DenseMap, coupled with all fixes needed. we can also help test correctness and performance on that change. actually the performance picture may not be that clear. reallocation is expensive, and the fact you didn't run into reallocation in your testing may indicate that the perf benchmark result may also be skewed. We also need to take into account the cost to update references etc.

WDYT?

Great. I would be happy to test it in our environment with bigger profiles.

@huangjd can you follow up and change to unordered_map, so whole diff doesn't have to be reverted?

@huangjd Do you think you will have time to change implementation to unordered_map as discussed by EOD, otherwise I am reverting the diff to unblock out validation testing.

@huangjd Do you think you will have time to change implementation to unordered_map as discussed by EOD, otherwise I am reverting the diff to unblock out validation testing.

D159014

huangjd updated this revision to Diff 554123.EditedAug 28 2023, 5:37 PM
huangjd edited the summary of this revision. (Show Details)

Update patch to use llvm::DenseMap.
Using postorder traversal when flattening the profile. In this case there is no need to lookup for the inserted profile again after its callee profiles are flattened. The computational complexity is the same (no additional operations added)

huangjd retitled this revision from [llvm-profdata] Fix dangling reference after D147740 to [llvm-profdata] Use llvm::DenseMap in SampleProfileMap.Aug 28 2023, 5:43 PM
huangjd edited the summary of this revision. (Show Details)
huangjd updated this revision to Diff 554125.Aug 28 2023, 5:48 PM
huangjd edited the summary of this revision. (Show Details)

nit: remove extra new line

huangjd added inline comments.Aug 28 2023, 5:51 PM
llvm/include/llvm/ProfileData/SampleProf.h
1575

Since flatten does not change the original sample profile (const reference), applying the same logic to each profile, it should be traversal-order agnostic. Processing the profiles in any order should yield the same results

We did some measurement internally on DenseMap vs. std::unordered_map.

The test was setup on a dedicated devserver as follows:

  • Pick the top 100 slow compilations from a major internal build target.
  • The profile (csspgo) size is 774MB including 170K profiled functions and 80M samples.
  • Run every compilation using either version for 5 runs and calculate the average.
  • Report both instruction count (very low variance) and elapsed wall clock time (relatively high variance).

The result

  • 0.02% less instructions from the unordered_map version.
  • 1.2% less wall clock time from the unordered_map version.

We didn't measure the time spent within sample profile loader though, the above numbers are all e2e compilations.

  • 0.02% less instructions from the unordered_map version.
  • 1.2% less wall clock time from the unordered_map version.

We didn't measure the time spent within sample profile loader though, the above numbers are all e2e compilations.

Thanks for the measurement. This result doesn't seem to justify all the trouble to remove reference stability. The benefit of DenseMap could be offset by the cost of rehashing.

Pick the top 100 slow compilations from a major internal build target.

Among the 100 compilations measured, is there anything that shows a meaningful improvement from DenseMap?

Pick the top 100 slow compilations from a major internal build target.

Among the 100 compilations measured, is there anything that shows a meaningful improvement from DenseMap?

Not really. For instruction count, the biggest win from DenseMap is 0.14%, and unordered_map also wins by 0.17% in one case. For wall clock time, there is a bigger swing due to bigger variance. DenseMap wins as much as 4.3%, but unordered_map also shows comparable wins in other cases.

Interesting. @williamjhuang can you double check your measurement ? If the improvement can be reproduced, do some profiling and understand where the performance come from is useful.

@weiwang Have you tried to generate a flame graph showing how much time each function takes?

huangjd added a comment.EditedAug 30 2023, 1:05 PM

Interesting. @williamjhuang can you double check your measurement ? If the improvement can be reproduced, do some profiling and understand where the performance come from is useful.

I measured the time on reading the function offset table section and reading the entire profile, not the compilation time. The other steps in the compilation pipeline are expected to take up a much longer time than reading the profile (actually only the function table is always being read, the samples are read only if they match some functions in the source code), so we need to have a common ground on what's being measured.

The objective of this series of patch is to improve the profile reader (of course need to make sure the overall compilation time is improved as well).

Interesting. @williamjhuang can you double check your measurement ? If the improvement can be reproduced, do some profiling and understand where the performance come from is useful.

I measured the time on reading the function offset table section and reading the entire profile, not the compilation time. The other steps in the compilation pipeline are expected to take up a much longer time than reading the profile (actually only the function table is always being read, the samples are read only if they match some functions in the source code), so we need to have a common ground on what's being measured.

The objective of this series of patch is to improve the profile reader (of course need to make sure the overall compilation time is improved as well).

The e2e compilation should still see reduced instructions and cycles. Can you collect e2e data for more compilations (and also report profile size)? While the slowest compilation determines the build latency, the goal here is to save build pool's utilization, so you don't have to pick the slowest compilations for measurement.

@weiwang Have you tried to generate a flame graph showing how much time each function takes?

I didn't collect this for the measurement. I think this change is pretty isolated to the loader pass, so if there is a win, it would show up in the e2e compilation.

I'll take a few cases and report back.

@weiwang Have you tried to generate a flame graph showing how much time each function takes?

I didn't collect this for the measurement. I think this change is pretty isolated to the loader pass, so if there is a win, it would show up in the e2e compilation.

I'll take a few cases and report back.

and at the end of the day e2e is what we care about, no?

This brings up to 8% speed up (31.4s vs 29.0s) when reading a large test profile, and 5% speedup (0.82s vs 0.78s) when reading the function offset table alone.

@huangjd what were you measuring exactly? is that the total wall clock / cycles for SampleProfileLoader for all modules, or were you just measuring SampleProfileReader time? From the line above, it sounds like you were measuring the latter. However, the impact of changing a container is not limited the creation of that container.

So measuring SampleProfileLoader would give us a more accurate picture of overall impact, including container construction (which is SampleProfileReader), updating and consumption (which are other parts of SampleProfileLoader).

OTOH, if the impact on SampleProfileLoader is significant, we should be able to see a difference in e2e compilation. I agree that in the end e2e compilation is what matters.

This brings up to 8% speed up (31.4s vs 29.0s) when reading a large test profile, and 5% speedup (0.82s vs 0.78s) when reading the function offset table alone.

@huangjd what were you measuring exactly? is that the total wall clock / cycles for SampleProfileLoader for all modules, or were you just measuring SampleProfileReader time? From the line above, it sounds like you were measuring the latter. However, the impact of changing a container is not limited the creation of that container.

So measuring SampleProfileLoader would give us a more accurate picture of overall impact, including container construction (which is SampleProfileReader), updating and consumption (which are other parts of SampleProfileLoader).

OTOH, if the impact on SampleProfileLoader is significant, we should be able to see a difference in e2e compilation. I agree that in the end e2e compilation is what matters.

true.

However, when picking the compilations to examine, the slowest ones may not be the good candidate -- as they are likely triggering pathological compile time issue elsewhere thus makes profile loading less significant.

William's experiment may also have FSAFDO on, thus have more rounds of loading.