This is an archive of the discontinued LLVM Phabricator instance.

[SimplifyLibCalls] Optimization for pow(x, n) where n is some constant
ClosedPublic

Authored by mgrang on Oct 22 2015, 2:04 PM.

Details

Summary
In order to avoid calling pow function we generate repeated fmul when n is a
positive or negative whole number.

For each exponent we pre-compute Addition Chains in order to minimize the no.
of fmuls.
Refer: http://wwwhomes.uni-bielefeld.de/achim/addition_chain.html

We pre-compute addition chains for exponents upto 32 (which results in a max of
7 fmuls).

For eg:
4 = 2+2
5 = 2+3
6 = 3+3 and so on

Hence,
pow(x, 4.0) ==> y = fmul x, x
                x = fmul y, y
                ret x

For negative exponents, we simply compute the reciprocal of the final result.

Note: This transformation is only enabled under fast-math.

Patch by Mandeep Singh Grang <mgrang@codeaurora.org>

Diff Detail

Event Timeline

mgrang updated this revision to Diff 38171.Oct 22 2015, 2:04 PM
mgrang retitled this revision from to Optimization for pow(x, n) where n is some constant.
mgrang updated this object.
mgrang added reviewers: weimingz, apazos, zinob, beanz.
mgrang set the repository for this revision to rL LLVM.
mgrang added a subscriber: llvm-commits.
mcrosier edited reviewers, added: majnemer; removed: beanz.Oct 22 2015, 3:30 PM
majnemer edited edge metadata.Oct 22 2015, 5:41 PM

Please upload a diff with more context.

lib/Transforms/Utils/SimplifyLibCalls.cpp
1065

Pointers go on the right side. Why is Exp signed?

1069

Sentences start with an upper case letter.

1077

Please don't strength reduce manually.

mgrang added inline comments.Oct 22 2015, 5:58 PM
lib/Transforms/Utils/SimplifyLibCalls.cpp
1077

Can you please clarify what do you mean by "don't strength reduce manually".

majnemer added inline comments.Oct 22 2015, 6:14 PM
lib/Transforms/Utils/SimplifyLibCalls.cpp
1077

Unless I am mistaken, you can rewrite that expression as Exp / 2.

mgrang added inline comments.Oct 22 2015, 6:41 PM
lib/Transforms/Utils/SimplifyLibCalls.cpp
1077

Yes, I can rewrite that as Exp / 2 but wouldn't that be slower than Exp >> 1 since the former case would involve generating a div instruction?

majnemer added inline comments.Oct 22 2015, 6:45 PM
lib/Transforms/Utils/SimplifyLibCalls.cpp
1077

I would recommend against premature optimization.

mgrang updated this revision to Diff 38200.Oct 22 2015, 6:57 PM
mgrang updated this object.
mgrang edited edge metadata.
mgrang removed rL LLVM as the repository for this revision.
majnemer added inline comments.Oct 22 2015, 7:42 PM
lib/Transforms/Utils/SimplifyLibCalls.cpp
1197

If Exp is 72057594037927937, it will be converted to 72057594037927936.0 when passing it to isExactlyValue. Is that OK?

joerg edited edge metadata.Oct 22 2015, 10:15 PM

Somewhat unrelated side note, for fast math -0.5 (generally) and +/- 1/3 (for targets with cbrt) are likely also useful to consider.

Trivial nits.

lib/Transforms/Utils/SimplifyLibCalls.cpp
1190

Maximize 80-column here.

test/Transforms/InstCombine/pow-4.ll
10

I don't see the attributes associated with #1 anywhere, so it's probably safe to drop the #1 here.

31

I believe you can safely drop the tail marker (i.e., tail call -> call). There are many instances of this.

mgrang updated this revision to Diff 38248.Oct 23 2015, 12:11 PM
mgrang edited edge metadata.
mgrang marked 7 inline comments as done.Oct 23 2015, 12:16 PM
mgrang added inline comments.
lib/Transforms/Utils/SimplifyLibCalls.cpp
1197

I limit the PowerOptThreshold so that we don't run into this issue.

Folks, could you please review this patch and comment on whether we are good or if any more changes are needed?

Thanks escha.
It seems Addition Chain Exponentiation is NP-complete.

My solution calculates pow(x, n) by computing powers of 2.
See: https://en.wikipedia.org/wiki/Exponentiation_by_squaring#Computation_by_powers_of_2

escha edited edge metadata.Oct 25 2015, 8:16 PM

It is NP-complete, but that's only for the general case; in practice you really only need to list a few common cases (it's not like we need to calculate optimal chains for pow(x, 1723)).

Hi Mandeep,
The general rule of thumb for pinging a patch is about a week. Also, David and others are likely busy getting prepared for the Dev Meeting later this week, so response may be delayed.

Also, I don't think you've properly addressed David's comment about the use of the isExactlyValue() function. I could be mistaken, but I believe David is pointing out that there may be conversion error with some inputs. The default value of 32 works, but will this work for all user specified values?

Personally, my suggestion would be to drop the command-line option all together.

Chad

Wouldn't the breakeven point be target dependent? A hard-coded default feels wrong.

Thanks Chad and apologies to everyone for pinging too early!

I guess doing this transformation for exponents greater than 32 would not really be helpful since the no. of fmul(s) would increase.
So I would like to limit this only to exponents <= 32 (+ve and -ve).

I would then change the command line option to bool (default value true), so that the user can turn off the transformation if he wants to.

Thoughts?

Hi Paul,

I think for smaller values of exponents (<=32), the break-even point does not really depend on the target. Although I am open to suggestions to consider otherwise :)

Thanks,
Mandeep

mgrang updated this revision to Diff 38572.Oct 27 2015, 11:08 AM
mgrang updated this object.
mgrang edited edge metadata.
mgrang set the repository for this revision to rL LLVM.
mgrang retitled this revision from Optimization for pow(x, n) where n is some constant to [SimplifyLibCalls] Optimization for pow(x, n) where n is some constant.Oct 28 2015, 11:18 AM

I wonder if an argument could be made that this expansion should happen in CodeGenPrep instead of SimplifyLibCalls...

David,
I see that all the optimizations for various math related functions is done in SimplifyLibCalls. So I guess SimplifyLibCalls should be the place to do it.
Did you have a particular reason why this expansion should be done in CodeGenPrep?

scanon added inline comments.Oct 30 2015, 6:53 AM
lib/Transforms/Utils/SimplifyLibCalls.cpp
1197

Can we really not find a more elegant way to handle this than by checking for equality to every integer in range? How was 32 chosen? It would make more sense to bound the number of multiplies in the resulting lowering than bounding the magnitude of the exponent. I would also expect this to not be symmetric, because division is expensive.

1199

s/divide by 1.0/compute the reciprocal/

Added Davide Italiano since I see that he is working on similar/related stuff.

mgrang added inline comments.Nov 4 2015, 4:29 PM
lib/Transforms/Utils/SimplifyLibCalls.cpp
1197

Thanks for the review Steve. I am working on fixing this.
GCC generates fmul for all exponent values under fast-math. But I don't know if it would really be efficient since the no. of fmuls increases with the exponent.

joerg added a comment.Nov 5 2015, 6:22 AM

Instead of a hard-coded limit of 32, would about building a table of multiplication chains with at most 8 multiplications or something like that? Precomputing them is easy and the number of intermediate registers should also be low enough to be. Might be worth checking much temporaries are generated and tweak the limit a bit.

mgrang updated this revision to Diff 39752.Nov 9 2015, 2:07 PM
mgrang updated this object.
mgrang removed rL LLVM as the repository for this revision.
mgrang marked an inline comment as done.Nov 9 2015, 2:12 PM

I have limited the transformation to a max of 8 fmuls (thus max exponent=32).

Using addition-chains can reduce the no. of fmuls in few cases but storing the addition-chains for each exponent will incur space overhead.
The current implementation of exponentiation by binary expansion seems much cleaner IMO.

mgrang updated this revision to Diff 40166.Nov 13 2015, 12:01 PM
mgrang updated this object.
mgrang updated this object.

I have now implemented this transformation by pre-computing multiplications (using Addition-Chains).
Can the reviewers please review this patch?

scanon added inline comments.Nov 16 2015, 12:27 PM
lib/Transforms/Utils/SimplifyLibCalls.cpp
1194

I think either this does the wrong thing for x86 80-bit long doubles (or does it not apply to powl( )?) and a hypothetical future quad type, or it could simply be Op2C->getValueAPF().convertToDouble().

1202

Super minor nit, ceil(V) != floor(V) is an overly-complex way to check if V is an integer. trunc(V) != V is simpler and seems more idiomatic to me. Longer-term, we should just add isInteger( ) to APFloat, but that's outside the scope of this change.

scanon edited edge metadata.Nov 16 2015, 1:59 PM

I went ahead and added isInteger( ) to APFloat (r253254). You can use that to simplify some of the logic and make it somewhat more general.

mcrosier edited edge metadata.Nov 16 2015, 2:25 PM
mcrosier removed a subscriber: mcrosier.

Thanks Steve.

I see that x86_FP80 and FP128 types are not handled by this transformation. Also we will lose info if we convert them to float or double. So I am in favor of limiting this transformation only to strict float and double types (ie. powl will not be handled).

Would like to know what you think.

scanon added inline comments.Nov 16 2015, 5:57 PM
lib/Transforms/Utils/SimplifyLibCalls.cpp
1194

With r253254, I think you should be able to make this general pretty easily. Something like:

APFloat V = abs(Op2C->getValueAPF( ));
if ((V.compare(APFloat(32.0)) == APFloat::cmpGreaterThan) || !V.isInteger())
    return nullptr;
// V is known to be an integer in [0, 32], so it can safely be converted to double now
// int would be even nicer, but APFloat doesn't have convertToInt (yet?). 
Value *FMul = getPow(Op1, convertToDouble(V), B);

I think we cannot simply call convertToDouble(V) on something which is not a double.

For example;
powf (x, 8) --> results in assertion: "semantics == (const llvm::fltSemantics*)&IEEEdouble && "Float semantics are not IEEEdouble"' failed."

That's the reason I checked the Type of V and accordingly converted to float or double.

But the problem is that long doubles are neither float nor doubles (they are type FP128) and hence cannot be converted readily.

Hence the "something like". We should really just be able to pull and integer value out of APFloat (but again, outside the scope of this change). Short-term, I think you can use V.convert(APFloat::IEEEdouble, ...) to get something that *can* be converted to double.

mgrang updated this revision to Diff 40421.Nov 17 2015, 12:14 PM
mgrang updated this revision to Diff 40443.Nov 17 2015, 3:31 PM

Made the test case target independent.

This looks good to me now. One of the owners should probably sign off on it too.

majnemer added inline comments.Nov 17 2015, 8:14 PM
lib/Transforms/Utils/SimplifyLibCalls.cpp
1068–1069

Perhaps we should assert that Exp != 0 ?

1088–1101

You will end up creating unnecessary instructions if Exp < 32, please do not do this.

test/Transforms/InstCombine/pow-4.ll
4

You don't need --check-prefix=CHECK

Thanks a lot everyone for your valuable comments and suggestions!

@majnemer @escha
Can you please review and sign off this patch?

mgrang added inline comments.Nov 17 2015, 8:31 PM
lib/Transforms/Utils/SimplifyLibCalls.cpp
1068–1069

I will fix this.

1088–1101

Thanks David.

My previous implementation used Binary Exponentiation but that results in more fmuls getting generated.

Addition-Chain Exponentiation generates the optimal (least) no. of fmuls.

Other reviewers were of the opinion that we should minimize fmuls (by precomputing the multiplication chains).

I think it will always be a trade-off between runtime and code-size; and here we opted for optimal runtime.

test/Transforms/InstCombine/pow-4.ll
4

I will fix this.

mcrosier added inline comments.Nov 18 2015, 6:16 AM
lib/Transforms/Utils/SimplifyLibCalls.cpp
1101

I believe the point David is trying to make is that you don't need to compute all 32 chains (at compile-time) when exp is less than 32.

How about something like this:

static const unsigned ExpTable[32][2] = {
  { 0, 0 }, // Unused
  { 0, 0 }, // pow1
  { 0, 0 }, // pow1, pow1
  { 1, 2 }, // 
  { 2, 2 },
  { 2, 3 },
  { 3, 3 },
  { 2, 5 },
  { 4, 4 },
  { 1, 8 },
     ...
  { 16, 16 } }

if (exp == 1)

return pow1;

if (exp == 2)

return B.CreateFMul(pow1, pow1);

AddChain[1] = pow1;
AddChain[2] = B.CreateFMul(pow1, pow1);
for (unsigned i = 3; i < exp; ++i)

AddChain[i] = B.CreateFMul(AddChain[ExpTable[i][0]], AddChain[ExpTable[i][1]]);

return AddChain[exp];

joerg added inline comments.Nov 18 2015, 7:28 AM
lib/Transforms/Utils/SimplifyLibCalls.cpp
1088–1101

Almost. There are two different ways to store the pre-computed table. The first is to store the summands, e.g. 15 = 12 + 3. Advantage is that it needs exactly one entry per exponent, but downside is that you need to deduplicate, since 12 and 3 will use the same sub-expressions. A map and recursion is easiest approach for that.

The other approach is to just compile the list explicitly like (0,0), (1,0), (2,2), (3,3), (4,2). Entry 0 is the input, every other entry is created in the loop. This version requires slightly more space in the binary, but allows using a simple for loop.

As long as we're really going to dig into the addition-chain issue, why are we duplicating Reassociate::buildMinimalMultiplyDAG here?

mgrang updated this revision to Diff 40539.Nov 18 2015, 12:42 PM

Thanks for the suggestions everyone.

I have now implemented Addition-Chains using recursion with memoization. The result being that we do not need to generate extraneous instructions.

Here is an example:

7 --> (2, 5) --> (2, (2, 3)) --> (2, (2, (1, 2)))

We now generate instructions for only the intermediate products needed by our exponent. Storing the results at each step saves us repetitive recursive calls.

I have also addressed David's comments in this patch (added an assertion for Exp!=0 and removed check-prefix from the test case).

Please let me know your thoughts on this.

mgrang marked 3 inline comments as done.Nov 18 2015, 12:43 PM
mgrang updated this revision to Diff 40543.Nov 18 2015, 12:47 PM

Hi @majnemer,

I guess I have addressed all your concerns in my latest patch.
Please let me know if you have any other concerns.

Thanks,
Mandeep

Please run clang-format on the patch. At minimum it should reduce the amount of unnecessary vertical space in the getPow function.

mgrang updated this revision to Diff 40714.Nov 19 2015, 4:08 PM

Ran clang-format on the patch as per Chad's suggestion.

mgrang updated this revision to Diff 40724.Nov 19 2015, 5:19 PM

D14466 provides a variable (bool unsafeFPMath) to check for -fast-math. Made use of this in the check for fast-math.
Also updated comment for fast-math check.

Changed if (InnerChain[Exp] != nullptr) ==> if (InnerChain[Exp])

Can we please get this patch reviewed?

A gentle reminder for reviews please.

joerg accepted this revision.Nov 28 2015, 1:15 PM
joerg edited edge metadata.

My concerns have been addressed.

This revision is now accepted and ready to land.Nov 28 2015, 1:15 PM
weimingz closed this revision.Dec 4 2015, 2:03 PM