In order to avoid calling pow function we generate repeated fmul when n is a positive or negative whole number. For each exponent we pre-compute Addition Chains in order to minimize the no. of fmuls. Refer: http://wwwhomes.uni-bielefeld.de/achim/addition_chain.html We pre-compute addition chains for exponents upto 32 (which results in a max of 7 fmuls). For eg: 4 = 2+2 5 = 2+3 6 = 3+3 and so on Hence, pow(x, 4.0) ==> y = fmul x, x x = fmul y, y ret x For negative exponents, we simply compute the reciprocal of the final result. Note: This transformation is only enabled under fast-math. Patch by Mandeep Singh Grang <mgrang@codeaurora.org>
Details
Diff Detail
Event Timeline
lib/Transforms/Utils/SimplifyLibCalls.cpp | ||
---|---|---|
1077 | Can you please clarify what do you mean by "don't strength reduce manually". |
lib/Transforms/Utils/SimplifyLibCalls.cpp | ||
---|---|---|
1077 | Unless I am mistaken, you can rewrite that expression as Exp / 2. |
lib/Transforms/Utils/SimplifyLibCalls.cpp | ||
---|---|---|
1077 | Yes, I can rewrite that as Exp / 2 but wouldn't that be slower than Exp >> 1 since the former case would involve generating a div instruction? |
lib/Transforms/Utils/SimplifyLibCalls.cpp | ||
---|---|---|
1077 | I would recommend against premature optimization. |
lib/Transforms/Utils/SimplifyLibCalls.cpp | ||
---|---|---|
1205 | If Exp is 72057594037927937, it will be converted to 72057594037927936.0 when passing it to isExactlyValue. Is that OK? |
Somewhat unrelated side note, for fast math -0.5 (generally) and +/- 1/3 (for targets with cbrt) are likely also useful to consider.
Trivial nits.
lib/Transforms/Utils/SimplifyLibCalls.cpp | ||
---|---|---|
1198 | Maximize 80-column here. | |
test/Transforms/InstCombine/pow-4.ll | ||
10 | I don't see the attributes associated with #1 anywhere, so it's probably safe to drop the #1 here. | |
31 | I believe you can safely drop the tail marker (i.e., tail call -> call). There are many instances of this. |
lib/Transforms/Utils/SimplifyLibCalls.cpp | ||
---|---|---|
1205 | I limit the PowerOptThreshold so that we don't run into this issue. |
Folks, could you please review this patch and comment on whether we are good or if any more changes are needed?
https://en.wikipedia.org/wiki/Addition-chain_exponentiation This might be relevant here.
Thanks escha.
It seems Addition Chain Exponentiation is NP-complete.
My solution calculates pow(x, n) by computing powers of 2.
See: https://en.wikipedia.org/wiki/Exponentiation_by_squaring#Computation_by_powers_of_2
It is NP-complete, but that's only for the general case; in practice you really only need to list a few common cases (it's not like we need to calculate optimal chains for pow(x, 1723)).
Hi Mandeep,
The general rule of thumb for pinging a patch is about a week. Also, David and others are likely busy getting prepared for the Dev Meeting later this week, so response may be delayed.
Also, I don't think you've properly addressed David's comment about the use of the isExactlyValue() function. I could be mistaken, but I believe David is pointing out that there may be conversion error with some inputs. The default value of 32 works, but will this work for all user specified values?
Personally, my suggestion would be to drop the command-line option all together.
Chad
Thanks Chad and apologies to everyone for pinging too early!
I guess doing this transformation for exponents greater than 32 would not really be helpful since the no. of fmul(s) would increase.
So I would like to limit this only to exponents <= 32 (+ve and -ve).
I would then change the command line option to bool (default value true), so that the user can turn off the transformation if he wants to.
Thoughts?
Hi Paul,
I think for smaller values of exponents (<=32), the break-even point does not really depend on the target. Although I am open to suggestions to consider otherwise :)
Thanks,
Mandeep
I wonder if an argument could be made that this expansion should happen in CodeGenPrep instead of SimplifyLibCalls...
David,
I see that all the optimizations for various math related functions is done in SimplifyLibCalls. So I guess SimplifyLibCalls should be the place to do it.
Did you have a particular reason why this expansion should be done in CodeGenPrep?
lib/Transforms/Utils/SimplifyLibCalls.cpp | ||
---|---|---|
1205 | Can we really not find a more elegant way to handle this than by checking for equality to every integer in range? How was 32 chosen? It would make more sense to bound the number of multiplies in the resulting lowering than bounding the magnitude of the exponent. I would also expect this to not be symmetric, because division is expensive. | |
1207 | s/divide by 1.0/compute the reciprocal/ |
lib/Transforms/Utils/SimplifyLibCalls.cpp | ||
---|---|---|
1205 | Thanks for the review Steve. I am working on fixing this. |
Instead of a hard-coded limit of 32, would about building a table of multiplication chains with at most 8 multiplications or something like that? Precomputing them is easy and the number of intermediate registers should also be low enough to be. Might be worth checking much temporaries are generated and tweak the limit a bit.
I have limited the transformation to a max of 8 fmuls (thus max exponent=32).
Using addition-chains can reduce the no. of fmuls in few cases but storing the addition-chains for each exponent will incur space overhead.
The current implementation of exponentiation by binary expansion seems much cleaner IMO.
I have now implemented this transformation by pre-computing multiplications (using Addition-Chains).
Can the reviewers please review this patch?
lib/Transforms/Utils/SimplifyLibCalls.cpp | ||
---|---|---|
1202 | I think either this does the wrong thing for x86 80-bit long doubles (or does it not apply to powl( )?) and a hypothetical future quad type, or it could simply be Op2C->getValueAPF().convertToDouble(). | |
1210 | Super minor nit, ceil(V) != floor(V) is an overly-complex way to check if V is an integer. trunc(V) != V is simpler and seems more idiomatic to me. Longer-term, we should just add isInteger( ) to APFloat, but that's outside the scope of this change. |
I went ahead and added isInteger( ) to APFloat (r253254). You can use that to simplify some of the logic and make it somewhat more general.
Thanks Steve.
I see that x86_FP80 and FP128 types are not handled by this transformation. Also we will lose info if we convert them to float or double. So I am in favor of limiting this transformation only to strict float and double types (ie. powl will not be handled).
Would like to know what you think.
lib/Transforms/Utils/SimplifyLibCalls.cpp | ||
---|---|---|
1202 | With r253254, I think you should be able to make this general pretty easily. Something like: APFloat V = abs(Op2C->getValueAPF( )); if ((V.compare(APFloat(32.0)) == APFloat::cmpGreaterThan) || !V.isInteger()) return nullptr; // V is known to be an integer in [0, 32], so it can safely be converted to double now // int would be even nicer, but APFloat doesn't have convertToInt (yet?). Value *FMul = getPow(Op1, convertToDouble(V), B); |
I think we cannot simply call convertToDouble(V) on something which is not a double.
For example;
powf (x, 8) --> results in assertion: "semantics == (const llvm::fltSemantics*)&IEEEdouble && "Float semantics are not IEEEdouble"' failed."
That's the reason I checked the Type of V and accordingly converted to float or double.
But the problem is that long doubles are neither float nor doubles (they are type FP128) and hence cannot be converted readily.
Hence the "something like". We should really just be able to pull and integer value out of APFloat (but again, outside the scope of this change). Short-term, I think you can use V.convert(APFloat::IEEEdouble, ...) to get something that *can* be converted to double.
lib/Transforms/Utils/SimplifyLibCalls.cpp | ||
---|---|---|
1068–1069 | I will fix this. | |
1088–1101 | Thanks David. My previous implementation used Binary Exponentiation but that results in more fmuls getting generated. Addition-Chain Exponentiation generates the optimal (least) no. of fmuls. Other reviewers were of the opinion that we should minimize fmuls (by precomputing the multiplication chains). I think it will always be a trade-off between runtime and code-size; and here we opted for optimal runtime. | |
test/Transforms/InstCombine/pow-4.ll | ||
3 | I will fix this. |
lib/Transforms/Utils/SimplifyLibCalls.cpp | ||
---|---|---|
1101 | I believe the point David is trying to make is that you don't need to compute all 32 chains (at compile-time) when exp is less than 32. How about something like this: static const unsigned ExpTable[32][2] = { { 0, 0 }, // Unused { 0, 0 }, // pow1 { 0, 0 }, // pow1, pow1 { 1, 2 }, // { 2, 2 }, { 2, 3 }, { 3, 3 }, { 2, 5 }, { 4, 4 }, { 1, 8 }, ... { 16, 16 } } if (exp == 1) return pow1; if (exp == 2) return B.CreateFMul(pow1, pow1); AddChain[1] = pow1; AddChain[i] = B.CreateFMul(AddChain[ExpTable[i][0]], AddChain[ExpTable[i][1]]); return AddChain[exp]; |
lib/Transforms/Utils/SimplifyLibCalls.cpp | ||
---|---|---|
1088–1101 | Almost. There are two different ways to store the pre-computed table. The first is to store the summands, e.g. 15 = 12 + 3. Advantage is that it needs exactly one entry per exponent, but downside is that you need to deduplicate, since 12 and 3 will use the same sub-expressions. A map and recursion is easiest approach for that. The other approach is to just compile the list explicitly like (0,0), (1,0), (2,2), (3,3), (4,2). Entry 0 is the input, every other entry is created in the loop. This version requires slightly more space in the binary, but allows using a simple for loop. |
As long as we're really going to dig into the addition-chain issue, why are we duplicating Reassociate::buildMinimalMultiplyDAG here?
Thanks for the suggestions everyone.
I have now implemented Addition-Chains using recursion with memoization. The result being that we do not need to generate extraneous instructions.
Here is an example:
7 --> (2, 5) --> (2, (2, 3)) --> (2, (2, (1, 2)))
We now generate instructions for only the intermediate products needed by our exponent. Storing the results at each step saves us repetitive recursive calls.
I have also addressed David's comments in this patch (added an assertion for Exp!=0 and removed check-prefix from the test case).
Please let me know your thoughts on this.
Hi @majnemer,
I guess I have addressed all your concerns in my latest patch.
Please let me know if you have any other concerns.
Thanks,
Mandeep
Please run clang-format on the patch. At minimum it should reduce the amount of unnecessary vertical space in the getPow function.
D14466 provides a variable (bool unsafeFPMath) to check for -fast-math. Made use of this in the check for fast-math.
Also updated comment for fast-math check.
Changed if (InnerChain[Exp] != nullptr) ==> if (InnerChain[Exp])
Pointers go on the right side. Why is Exp signed?