This is an archive of the discontinued LLVM Phabricator instance.

Encode duplication factor from loop vectorization and loop unrolling to discriminator.
ClosedPublic

Authored by danielcdh on Nov 8 2016, 1:52 PM.

Details

Summary

This patch starts the implementation as discuss in the following RFC: http://lists.llvm.org/pipermail/llvm-dev/2016-October/106532.html

When optimization duplicates code that will scale down the execution count of a basic block, we will record the duplication factor as part of discriminator so that the offline process tool can find the duplication factor and collect the accurate execution frequency of the corresponding source code. Two important optimization that fall into this category is loop vectorization and loop unroll. This patch records the duplication factor for these 2 optimizations.

The recording will be guarded by a flag encode-duplication-in-discriminators, which is off by default.

Event Timeline

danielcdh updated this revision to Diff 77252.Nov 8 2016, 1:52 PM
danielcdh retitled this revision from to Encode duplication factor from loop vectorization and loop unrolling to discriminator..
danielcdh updated this object.
danielcdh added a subscriber: llvm-commits.
aprantl added inline comments.Nov 15 2016, 3:13 PM
include/llvm/IR/DebugInfoMetadata.h
1772–1773
  1. This magic formula needs to be documented somewhere.
  2. Is there anything preventing / detecting potential collisions here?
hfinkel added inline comments.Nov 15 2016, 3:37 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
876

I apologize for rehashing this, but now I'm confused again. We have two situations:

  1. Loop is unrolled (or code is otherwise duplicated). In this case, the discriminator value must be different (so that the relevant counts are summed)
  2. Instruction is vectorized. In this case we need a discriminator value with a duplication factor (so that the counts are multiplied by the duplication factor because each vector instruction represents DF scalar instructions).

And both situations can obviously be combined. It seems like, in general, we're trying to take a shortcut here: instead of giving each instruction from an unrolled loop a different discriminator (so that all of the relevant counts will be summed), we're giving them all the same discriminator with a duplication factor. This will work in some cases, but not if some of the loop iterations (or instructions therein) are executed conditionally. I don't see why this is worthwhile. Can you please explain?

danielcdh added inline comments.Nov 15 2016, 4:23 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
876

You are right. For unrolled loop, the most accurate way would be assign a distinct discriminator for each cloned body, and use "sum" to compute the source count. We are taking a short-cut here to assign all copies with one discriminator (with duplication factor), and use max*dup_factor to compute the source count. This works fine when all cloned copy has uniformed behavior and thus similar execution count, otherwise will overcount the source.

So the cons of using duplication factor for unrolled loop is less accuracy. The pros is that it saves on line table size.

Maybe a better solution would be: if there is control flow inside the unrolled loop, then use distinct discriminator for each clone, otherwise use duplication factor.

In this patch, I only added the logic for the duplication factor. I will have another patch to add the logic to "add distinct discriminator for clones" in another patch. How about I add a FIXME here and address loop-unroll-with-control-flow in that patch?

hfinkel added inline comments.Nov 21 2016, 2:40 PM
lib/Transforms/Vectorize/LoopVectorize.cpp
876

So the cons of using duplication factor for unrolled loop is less accuracy. The pros is that it saves on line table size.

Can you explain again why this saves on line-table size?

In this patch, I only added the logic for the duplication factor. I will have another patch to add the logic to "add distinct discriminator for clones" in another patch. How about I add a FIXME here and address loop-unroll-with-control-flow in that patch?

That's fine by me.

Thanks for the review.

Any update comments about the proposed encoding before it's ready to land?

Thanks,
Dehao

lib/Transforms/Vectorize/LoopVectorize.cpp
876

For the following case:

#1 for (i = 0; i < 4; i++)
#2 a[i] = b[i];

If we use duplication factor, the line table for the expanded code will be:

a[0] = b[0]; line 2, discriminator 0x400
a[1] = b[1];
line 2, discriminator 0x400
a[2] = b[2]; line 2, discriminator 0x400
a[3] = b[3];
line 2, discriminator 0x400

The all share the same location so there is no new line table entry for the clone code.

But with ditinct discriminator, the expanded code will be:

a[0] = b[0]; line 2, discriminator 0x10000
a[1] = b[1];
line 2, discriminator 0x20000
a[2] = b[2]; line 2, discriminator 0x30000
a[3] = b[3];
line 2, discriminator 0x40000

There are 2 issues:

  1. each line has distinct location, so we need 4 entries in the debug line table.
  2. the encoding for distinct discriminator is longer than duplication factor.
hfinkel edited edge metadata.Nov 21 2016, 3:34 PM

Thanks for the review.

Any update comments about the proposed encoding before it's ready to land?

Could we be smarter about the encoding so that the distinct-copies case did not take up so much space? It seems like the design right now gives a fixed number of low bits to the duplication factor. Could we use a variable number of bits? For example, because of the underlying LEB128 encoding, we could use some of the lowest-order bits as indicators, or better, use a variable-length code:

low: < duplication factor > < copy id > : high

where each factor is a variable-length code (e.g. a prefix code: https://en.wikipedia.org/wiki/Prefix_code). For example, the Fibonacci code (https://en.wikipedia.org/wiki/Fibonacci_coding) is a well-known code which is good when the encoded numbers are likely to be small. In short, we might encode:

bitreverse( Fibonacci(copy-id + 1) | Fibonacci(dup-factor + 1) )

In this case, if either field is empty, then we waste only 2 bits encoding that field. Small numbers take only a small number of bits. What do you think?

Thanks,
Dehao

Thanks for the review.

Any update comments about the proposed encoding before it's ready to land?

Could we be smarter about the encoding so that the distinct-copies case did not take up so much space? It seems like the design right now gives a fixed number of low bits to the duplication factor. Could we use a variable number of bits? For example, because of the underlying LEB128 encoding, we could use some of the lowest-order bits as indicators, or better, use a variable-length code:

low: < duplication factor > < copy id > : high

where each factor is a variable-length code (e.g. a prefix code: https://en.wikipedia.org/wiki/Prefix_code). For example, the Fibonacci code (https://en.wikipedia.org/wiki/Fibonacci_coding) is a well-known code which is good when the encoded numbers are likely to be small. In short, we might encode:

bitreverse( Fibonacci(copy-id + 1) | Fibonacci(dup-factor + 1) )

In this case, if either field is empty, then we waste only 2 bits encoding that field. Small numbers take only a small number of bits. What do you think?

Thanks for the suggestion. I spent some time study the prefix based encoding, which is a great idea. But when it applies here, with fibonacci encoding, 3 bits can only represent numbers up to 3, 4 bits can represent up to 5 and 7 bits can represent up to 21. This does not seem enough as duplication factor usually will need 5+ bits, making it hard to fit multiple-pieces into 1-byte ULEB.

Another approach, as you mentioned, is to use the lower bits to encode which info it represents, so that when there is only 1 info available, we can more efficiently encode it into one byte. This is a great idea. I did some experiments to explore its potential: when assigning duplication factor, I always remove the original discriminator and put the DP in the lower 7 bits to make it always fit into 1 byte. The debug_line size increase comparing with the current patch is show below:

447.dealII 8.01% 6.16%
453.povray 6.50% 5.04%
482.sphinx3 7.54% 5.74%
470.lbm 0.00% 0.00%
444.namd 6.19% 4.89%
433.milc 23.12% 17.63%
450.soplex 2.66% 1.99%
445.gobmk 7.51% 5.02%
471.omnetpp 0.52% 0.36%
458.sjeng 10.31% 7.37%
473.astar 5.44% 4.26%
456.hmmer 9.74% 7.50%
401.bzip2 9.01% 6.33%
462.libquantum 10.79% 8.36%
403.gcc 2.74% 1.77%
464.h264ref 29.62% 21.14%
483.xalancbmk 1.42% 1.12%
429.mcf 9.55% 7.45%
400.perlbench 1.96% 1.21%
mean 7.81% 5.84%

The first column of data represents the current implementation. The 2nd column is the experiment (fitting all DP into one byte), which represent the upper-bound of any encoding.

From the data, looks like the improvement is marginal. So I'm wondering does it justify the added complexity comparing with fixed-width encoding?

Thanks,
Dehao

anemet added a subscriber: anemet.Dec 1 2016, 10:56 AM

Thanks for the review.

Any update comments about the proposed encoding before it's ready to land?

Could we be smarter about the encoding so that the distinct-copies case did not take up so much space? It seems like the design right now gives a fixed number of low bits to the duplication factor. Could we use a variable number of bits? For example, because of the underlying LEB128 encoding, we could use some of the lowest-order bits as indicators, or better, use a variable-length code:

low: < duplication factor > < copy id > : high

where each factor is a variable-length code (e.g. a prefix code: https://en.wikipedia.org/wiki/Prefix_code). For example, the Fibonacci code (https://en.wikipedia.org/wiki/Fibonacci_coding) is a well-known code which is good when the encoded numbers are likely to be small. In short, we might encode:

bitreverse( Fibonacci(copy-id + 1) | Fibonacci(dup-factor + 1) )

In this case, if either field is empty, then we waste only 2 bits encoding that field. Small numbers take only a small number of bits. What do you think?

Thanks for the suggestion. I spent some time study the prefix based encoding, which is a great idea. But when it applies here, with fibonacci encoding, 3 bits can only represent numbers up to 3, 4 bits can represent up to 5 and 7 bits can represent up to 21. This does not seem enough as duplication factor usually will need 5+ bits, making it hard to fit multiple-pieces into 1-byte ULEB.

Another approach, as you mentioned, is to use the lower bits to encode which info it represents, so that when there is only 1 info available, we can more efficiently encode it into one byte. This is a great idea. I did some experiments to explore its potential: when assigning duplication factor, I always remove the original discriminator and put the DP in the lower 7 bits to make it always fit into 1 byte. The debug_line size increase comparing with the current patch is show below:

447.dealII 8.01% 6.16%
453.povray 6.50% 5.04%
482.sphinx3 7.54% 5.74%
470.lbm 0.00% 0.00%
444.namd 6.19% 4.89%
433.milc 23.12% 17.63%
450.soplex 2.66% 1.99%
445.gobmk 7.51% 5.02%
471.omnetpp 0.52% 0.36%
458.sjeng 10.31% 7.37%
473.astar 5.44% 4.26%
456.hmmer 9.74% 7.50%
401.bzip2 9.01% 6.33%
462.libquantum 10.79% 8.36%
403.gcc 2.74% 1.77%
464.h264ref 29.62% 21.14%
483.xalancbmk 1.42% 1.12%
429.mcf 9.55% 7.45%
400.perlbench 1.96% 1.21%
mean 7.81% 5.84%

The first column of data represents the current implementation. The 2nd column is the experiment (fitting all DP into one byte), which represent the upper-bound of any encoding.

From the data, looks like the improvement is marginal. So I'm wondering does it justify the added complexity comparing with fixed-width encoding?

Thanks for doing those experiments! Regarding the second form, many of those changes are on the order of a few percent -- that seems worthwhile. Can you post the patch?

Thanks,
Dehao

danielcdh updated this revision to Diff 80015.Dec 1 2016, 6:31 PM
danielcdh edited edge metadata.

Updated the encoding algorithm to use the lowest bit to represent whether BaseDiscriminator/DuplicationFactor is available.

With this change, the debug info size impact is demonstrated in the last column of below:

447.dealII 8.01% 6.16% 6.92%
453.povray 6.50% 5.04% 5.38%
482.sphinx3 7.54% 5.74% 6.22%
470.lbm 0.00% 0.00% 0.00%
444.namd 6.19% 4.89% 5.14%
433.milc 23.12% 17.63% 19.96%
450.soplex 2.66% 1.99% 2.25%
445.gobmk 7.51% 5.02% 6.22%
471.omnetpp 0.52% 0.36% 0.45%
458.sjeng 10.31% 7.37% 8.38%
473.astar 5.44% 4.26% 4.46%
456.hmmer 9.74% 7.50% 8.36%
401.bzip2 9.01% 6.33% 7.72%
462.libquantum 10.79% 8.36% 8.75%
403.gcc 2.74% 1.77% 2.28%
464.h264ref 29.62% 21.14% 27.37%
483.xalancbmk 1.42% 1.12% 1.16%
429.mcf 9.55% 7.45% 7.69%
400.perlbench 1.96% 1.21% 1.73%
mean 7.81% 5.84% 6.68%

Updated the encoding algorithm to use the lowest bit to represent whether BaseDiscriminator/DuplicationFactor is available.

With this change, the debug info size impact is demonstrated in the last column of below:

What are the other columns?

Updated the encoding algorithm to use the lowest bit to represent whether BaseDiscriminator/DuplicationFactor is available.

With this change, the debug info size impact is demonstrated in the last column of below:

What are the other columns?

They are copied from the previous experiment:

The first column of data represents the current implementation. The 2nd column is the experiment (fitting all DP into one byte), which represent the upper-bound of any encoding.

hfinkel added inline comments.Dec 3 2016, 6:53 PM
include/llvm/IR/DebugInfoMetadata.h
1803

Please move the comments that explain the encoding to above DILocation::getBaseDiscriminator and write some text to explain what is going on.

lib/Transforms/Utils/AddDiscriminators.cpp
192–193

Please add a utility function for the encoding.

lib/Transforms/Vectorize/LoopVectorize.cpp
218

Based on one of the other threads, I suppose we're going to add some command-line flag to enable/disable this? That being the case, we'll read this setting from some function attribute string instead of using a cl::opt.

danielcdh added inline comments.Dec 5 2016, 9:48 AM
lib/Transforms/Utils/AddDiscriminators.cpp
192–193

Thanks for the comment!

I agree with all the comments, but before I address the comments and move forward with this encoding, let's discuss more whether we want to use lower bits to indicate discriminator type. Comparing with fixed position encoding (i.e. DP always consume 2 ULEB128 bytes):

Pros:

  • saves ~1% debug line table size (on average)

Cons:

  • added complexity to clang code and also offline profile creation code
  • limited representation scope: with the new algorithm we need to have 1 extra bit to represent the profile type, which means we need to limit the useful bit to be 6 in order to fit it to 1-byte ULEB128. As a result, the maximum raw discriminator will be 63 instead of 127.

Comments?

lib/Transforms/Vectorize/LoopVectorize.cpp
218

Yes, we will guard all the logic by check profile-debug flag once https://reviews.llvm.org/D25435 is submitted. I'm not sure whether it will be a function attribute string, or simply a flag passed down from frontend.

Meanwhile, I think it will be worth to have a flag to control whether we will encode duplication in discriminator so that we have a choice to fall back to the old discriminator assignment algorithm. We can remove the flag later if we find it useless.

hfinkel added inline comments.Dec 7 2016, 12:02 PM
lib/Transforms/Utils/AddDiscriminators.cpp
192–193

saves ~1% debug line table size (on average)

Yes. but this 1% is 20% or more of the increase. More importantly for me, it reduces the relative expense (in terms of binary-size increase) of using distinct location tags vs. using duplication factors.

added complexity to clang code and also offline profile creation code

If this patch is any indication, then the extra complexity seems minor.

192–193

limited representation scope: with the new algorithm we need to have 1 extra bit to represent the profile type, which means we need to limit the useful bit to be 6 in order to fit it to 1-byte ULEB128. As a result, the maximum raw discriminator will be 63 instead of 127.

I don't like the fixed bit limit here. How about using the high bit to indicate that there are more bits in the discriminator (that's how ULEB128 itself works, right?)?

lib/Transforms/Vectorize/LoopVectorize.cpp
218

Yes, we will guard all the logic by check profile-debug flag once https://reviews.llvm.org/D25435 is submitted. I'm not sure whether it will be a function attribute string, or simply a flag passed down from frontend.

I think it needs to be an attribute. Otherwise, it won't work correctly with LTO.

danielcdh updated this revision to Diff 81115.Dec 12 2016, 11:32 AM
danielcdh marked 2 inline comments as done.

update the encoding, add comments, update test.

lib/Transforms/Vectorize/LoopVectorize.cpp
218

I did not realize the LTO issue. But let's address that problem in https://reviews.llvm.org/D25435

ping...

Thanks,
Dehao

ping...

Thanks,
Dehao

ping...

Thanks,
Dehao

hfinkel accepted this revision.Jan 24 2017, 3:14 PM

I apologize for my delay in getting back to this. Thanks for updating this with a more-flexible encoding scheme. It looks like in the future, should we desire, we can extend the number of fields (and/or their size). This LGTM. We can fixup replacing/supplementing the cl::opt with a function attribute in follow-up.

include/llvm/IR/DebugInfoMetadata.h
1321

We should say that, for example, the AddDiscriminators pass will do this.

lib/Transforms/Utils/AddDiscriminators.cpp
100

I'm assuming that this is for testing, correct? We'll need to add a function attribute to really control this.

This revision is now accepted and ready to land.Jan 24 2017, 3:14 PM
danielcdh added inline comments.Jan 24 2017, 4:08 PM
lib/Transforms/Utils/AddDiscriminators.cpp
100

This is for testing purpose, and will be "true" by default when -fdebug-info-profiling is set.

I'm not sure why would we need this as a function attribute. I would suppose all binaries built with this on or off. Or am I missing something?

hfinkel added inline comments.Jan 24 2017, 5:22 PM
lib/Transforms/Utils/AddDiscriminators.cpp
100

Because we shouldn't use command-line options to communicate between the frontend and the backend. Some legacy aside, this is the strongly-preferred mechanism. It is also necessary for this to work correctly with LTO.

danielcdh added inline comments.Jan 24 2017, 5:37 PM
lib/Transforms/Utils/AddDiscriminators.cpp
100

How about using TargetOption.DebugInfoForProfiling to decide, as we check it in lib/CodeGen/AsmPrinter/DwarfDebug.cpp?

hfinkel added inline comments.
lib/Transforms/Utils/AddDiscriminators.cpp
100

I believe that's deprecated too (because it also does not work with LTO). @echristo , can you comment on this (i.e. do we need to use a function-attribute here)?

danielcdh marked an inline comment as done.Feb 1 2017, 4:53 PM
danielcdh added inline comments.
lib/Transforms/Utils/AddDiscriminators.cpp
100

Removed the flag and checks CompileUnit flag (debugInfoForProfiling) to decided if we want to encode or not.

danielcdh updated this revision to Diff 86750.Feb 1 2017, 4:53 PM

rebase and remove the flag and use CompileUnit flag to decide if we want to encode duplication factor.

PTAL

rebase and remove the flag and use CompileUnit flag to decide if we want to encode duplication factor.

PTAL

Makes sense to me, please go ahead.

Could you please add a note about debugInfoForProfiling to the LangRef section on DICompileUnit?

danielcdh updated this revision to Diff 88024.Feb 10 2017, 11:17 AM

update LangRef

hfinkel added inline comments.Feb 10 2017, 11:58 AM
docs/LangRef.rst
4007

debugInfoForProfiling is not a 'tuple containing debug info to be emitted'. Instead of adding it to that list, how about adding another sentence like:

The ``debugInfoForProfiling:`` field is a boolean indicating whether or not line-table discriminators are updated to provide more-accurate profiling results.
danielcdh closed this revision.Feb 10 2017, 1:20 PM