This is an archive of the discontinued LLVM Phabricator instance.

[PGO] Add Value Profiling for Loop Trip Count (WIP)
Needs ReviewPublic

Authored by w2yehia on Nov 25 2019, 11:01 AM.
This revision needs review, but there are no reviewers specified.

Details

Reviewers
None
Summary

This is Work In Progress.
I'm posting here to solicit feedback.
One motivation for doing this is to improve the loop versioning currently done by LoopVectorizer (LoopVectorizePass::processLoop).

Description:
For certain loops, (TBD what kind), we would like to profile the exact trip count and the
frequency of that trip count. In other words, we want to value profile an expression that
represents the trip count of a loop, whenever it is computable.
The instrumentation point is the loop pre-header, and the value profile (VP) metadata (MD)
is appended to the llvm.loop MD which sits on the branch instruction of the latch block.
In order to find or create a loop pre-header, we run the LoopSimplifyPass pass in the
pipeline for the new pass manager.
The logic in LoopInfoPlugin in ValueProfilePlugins.inc determines the instrumentation
point (needed during the -fprofile-generate step) and what to associate the MD (needed
during the -fprofile-use step).
Instead of having the plugin decide the exact instruction for associating the MD to, we allow
the plugin to select a Loop to associate it to, and then in PGOUseFunc::annotateValueSites
in PGOInstrumentation.cpp we call setLoopTripCount(MD) on the Loop object.
This way, the Loop object maintains control (set/get) over what MD is associated to the
loop id MD node (a.k.a. the !llvm.loop MD).

Diff Detail

Repository
rL LLVM

Event Timeline

w2yehia created this revision.Nov 25 2019, 11:01 AM
Herald added a project: Restricted Project. · View Herald TranscriptNov 25 2019, 11:01 AM
w2yehia edited the summary of this revision. (Show Details)Nov 25 2019, 11:20 AM
w2yehia edited the summary of this revision. (Show Details)Nov 26 2019, 7:22 AM
fhahn added a subscriber: fhahn.Nov 26 2019, 7:42 AM
w2yehia edited the summary of this revision. (Show Details)Nov 26 2019, 8:52 AM
hoy added a subscriber: hoy.Mar 20 2022, 8:09 PM

The work is interesting. I'm wondering if it is still ongoing. I haven't looked into the implementation details here yet, but here are a couple basic questions:

  1. Does the generated loop MD reflect the average loop trip count per loop run or the accumulated total trip count for all runs of the loop?
  2. How does the generated loop MD differ from the synthesized loop tripe count based on the branch_weight metadata?
  3. Have you seen a particular loop optimization pass benefits from this work?

Thanks.

Herald added a project: Restricted Project. · View Herald TranscriptMar 20 2022, 8:09 PM
In D70688#3395357, @hoy wrote:

The work is interesting. I'm wondering if it is still ongoing. I haven't looked into the implementation details here yet, but here are a couple basic questions:

  1. Does the generated loop MD reflect the average loop trip count per loop run or the accumulated total trip count for all runs of the loop?
  2. How does the generated loop MD differ from the synthesized loop tripe count based on the branch_weight metadata?
  3. Have you seen a particular loop optimization pass benefits from this work?

Thanks.

Hi, we currently have this implementation downstream more or less. Feel free to contribute or even take over. I just haven't found the time to push this through further.

  1. No, not the average. The loop MD will contain a set of the most frequent trip counts observed, with the frequency of each trip count.
  2. implied from (1): the branch_weight MD will give the average trip count, while the loop MD will give exact numbers
  3. you first version the code and then knowing the trip count at compile time you can vectorize and any loop optimization that benefits from having a constant trip count.

Indeed this looks interesting.

How do you maintain the new profile metadata? Trip count can change as you unroll/reroll/vectorize a loop.

The new profile metadata seems disconnected with existing branch weights, which can also derive total trip count. However branch weights are adjusted for optimizations (i.e. scaling for inlining etc), so chances are branch weights derived total trip count can be more accurate in some cases. Do you consider using branch weights derived total trip count to scale trip count value profile (based on histogram's total)?

I'm also curious if you have any perf evaluation done based on this work, i.e. what target and on what workload/benchmark did you see perf improvement through versioning, by how much?