This is an archive of the discontinued LLVM Phabricator instance.

Scale frequencies of a set of blocks
ClosedPublic

Authored by eraman on Jan 10 2017, 2:19 PM.

Details

Summary

Scaling is done with respect to a reference block whose new frequency is specified.

Diff Detail

Repository
rL LLVM

Event Timeline

eraman updated this revision to Diff 83871.Jan 10 2017, 2:19 PM
eraman retitled this revision from to Scale frequencies of a set of blocks.
eraman updated this object.
eraman added reviewers: davidxl, chandlerc.
eraman added a subscriber: llvm-commits.
davidxl accepted this revision.Jan 10 2017, 2:57 PM
davidxl edited edge metadata.

lgtm

This revision is now accepted and ready to land.Jan 10 2017, 2:57 PM
chandlerc added inline comments.Jan 10 2017, 3:03 PM
lib/Analysis/BlockFrequencyInfo.cpp
182 ↗(On Diff #83871)

Extraneous blank line.

183 ↗(On Diff #83871)

This does 2+N heap allocations where N is the number of basic blocks passed in.... Can we do something more efficient?

Can we use some of the other scaling infrastructure that is already used in BFI such as ScaledNumber?

At the very least, please don't declare a new APInt inside the loop.

eraman updated this revision to Diff 83894.Jan 10 2017, 4:49 PM
eraman edited edge metadata.

Address Chandler's comments.

eraman marked an inline comment as done.Jan 10 2017, 4:53 PM
eraman added inline comments.
lib/Analysis/BlockFrequencyInfo.cpp
183 ↗(On Diff #83871)

I can't think of a good way to efficiently do this with ScaledNumber.

I have moved BBFreq declaration out of the loop. Now, there is a call to = operator inside the loop which calls a memset. If the right set of optimizations kick in this memset should reduce to a store. Does this look reasonable?

Chandler, if you strongly prefer the use of ScaledNumber to APInt, I have a patch ready and will happily upload that. Even after this change, the multiply and divide operations allocate memory. Multiply seems similar in complexity (since we always multiply two 64 bit words). I don't know how Knuth's algorithm used in APInt compare with ScaledNumber's long division in the case when the dividend is more than 64 bits. The tradeoff here is the precision loss with ScaledNumber - whose effects in practice is unknown. I personally prefer to keep the APInt, but as I said above ready to switch to ScaledNumber to get this patch in.

chandlerc accepted this revision.Jan 18 2017, 5:56 PM

Chandler, if you strongly prefer the use of ScaledNumber to APInt, I have a patch ready and will happily upload that. Even after this change, the multiply and divide operations allocate memory. Multiply seems similar in complexity (since we always multiply two 64 bit words). I don't know how Knuth's algorithm used in APInt compare with ScaledNumber's long division in the case when the dividend is more than 64 bits. The tradeoff here is the precision loss with ScaledNumber - whose effects in practice is unknown. I personally prefer to keep the APInt, but as I said above ready to switch to ScaledNumber to get this patch in.

Sorry I missed your earlier update.

This isn't about using ScaledNumber to *directly* implement the algorithm you have, this is about using a different algorithm. Specifically, if you look at how BFI computes things, it goes to great lengths to accumulate and normalize the frequency for each block without doing division. Instead, it uses shifts and other techniques to scale things. But this isn't really compatible with the style of update you are using here.

Now that the memory allocation is gone, the part I am concerned with is the 'udiv'. Knuth's algorithm is still frighteningly expensive. But we can wait and revisit this once it shows up on a profile. Please comment clearly that this udiv is slow and that there are ways to remove it if it shows up for the future maintainer of the code.

I see three fundamental options that would eliminate this. Two are simpler but lose precision more rapidly, the other is more complex but retains precision:

  1. Factor the division out of the loop into a pre-computed scale, and shift that scale and the input frequencies to maximize the precision that remains.
  2. Reduce the precision a-priori to 32-bits so that a 64-bit multiply and divide suffice and then re-scale to the 64-bit values.
  3. Recast this using the core BFI algorithm for normalizing block mass after scaling due to backedges (or inlining in this case). There might be a way to do this locally, I'm not sure. This would essentially mean adding live updates to the core BFI algorithm used.

#3 seems like it will have the best results but will be *very* hard to implement. Both #1 and #2 seem possible but I think its fine to wait and do that in a follow-up patch. I particularly like #1 because it isn't just hiding the cost in the hardware divide instruction which, for example, won't exist on 32-bit ARM hosts.

Anyways, feel free to submit now that the memory allocation is handled separately and with a comment that the udiv is a known likely hot spot.

This revision was automatically updated to reflect the committed changes.