This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Use the reciprocal estimation machinery
ClosedPublic

Authored by evandro on Apr 22 2016, 11:52 AM.

Details

Summary

This patch adds support for estimating the square root, its reciprocal and division or reciprocal using the combiner generic reciprocal machinery.

I intend to follow this patch with support for using the respective step instructions. Later, with support to consider the context, specifically whether it's bound by latency or throughput and whether the additional multiplications of the series overwhelm execution units already under pressure.

Diff Detail

Event Timeline

evandro updated this revision to Diff 54692.Apr 22 2016, 11:52 AM
evandro retitled this revision from to [AArch64] Use the reciprocal estimation machiner.
evandro updated this object.
evandro set the repository for this revision to rL LLVM.
evandro retitled this revision from [AArch64] Use the reciprocal estimation machiner to [AArch64] Use the reciprocal estimation machinery.Apr 25 2016, 7:45 AM

Ideally we would enable this everywhere (and not need to add additional features). Do you have any idea what the impact would be on other cores?

Cheers,
Silviu

evandro added a comment.EditedApr 25 2016, 9:03 AM

Ideally we would enable this everywhere (and not need to add additional features). Do you have any idea what the impact would be on other cores?

I have an idea on A57, where only sqrt(), but neither sqrtf() or any division, would be beneficial to emit the series instead.

Then again, I'm testing the waters by opting to use additional features instead of a sequence of plethora of isCPU(). Perhaps it's time to use features, even of some other kind, so that all such nuances remain in the machine descriptions instead of peppered all over the rest of the source code. Perhaps not.

eastig added inline comments.Apr 26 2016, 5:28 AM
llvm/lib/Target/AArch64/AArch64.td
67 ↗(On Diff #54692)

I think it's better to use names something like:
FeatureReciprocalSqrtEstimate
FeatureReciprocalEstimate

They don't directly calculate DIV and SQRT. They calculate aproximate 1/DIV and 1/SQRT.

142 ↗(On Diff #54692)

Why is FeatureApproximateDiv not on the list of the features?

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
4663

Most of the code of the functions is the same. Only differences are RecipOp string and Opcodes. I would suggest to create a template function or a function with common code. Also documenting parameters will be good. Am I correct they will be used for step versions in the future?

llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
202

I suggest to put this code into a separate function(s): initReciprocals. This will help not to make mess if in the future more initialization is added.

Thank you.

llvm/lib/Target/AArch64/AArch64.td
67 ↗(On Diff #54692)

Do you mean to replace HasApproximateSqrt with FeatureReciprocalSqrtEstimate and so on?

142 ↗(On Diff #54692)

Because Exynos M1 doesn't benefit from it. It might be beneficial in other cores, when their respective maintainers should be better equipped than me to decide whether to add such features or not.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
4663

Will do.

I'm currently working on having the step instrs to be emitted instead of discrete code.

rengolin edited edge metadata.Apr 26 2016, 5:43 AM

Hi Evandro,

Wouldn't this require some form of -ffast-math flag? I assume precision will be affected somehow.

cheers,
--renato

Wouldn't this require some form of -ffast-math flag? I assume precision will be affected somehow.

Indeed it is, but DAGCombiner already makes sure of that.

eastig added inline comments.Apr 26 2016, 6:02 AM
llvm/lib/Target/AArch64/AArch64.td
67 ↗(On Diff #54692)

Yes.
When I first read the names of these definitions I had initial understanding based on their names. Later I saw instructions and from their documentation I found what they do. It didn't match to my initial understanding. I think when someone who is not familiar with these instructions starts looking at the code he or she will have the similar issue.

142 ↗(On Diff #54692)

Sorry for a naive question.
Do we need to have these feature on the list? I thought they are part ARMv8-A NEON.

Thank you.

llvm/lib/Target/AArch64/AArch64.td
67 ↗(On Diff #54692)

On the other hand, the reciprocals are later used to calculate the square root and division.

All features start with "Has", that's why I did the same. However, at the moment, the only features in use are ISA features, whereas this is an operation that, though it depends on an ISA feature, just makes use of it. So, in a way, if it started with "Does" it might be more appropriate. I'd appreciate to find out what others think.

But, if it makes too many waves dissipating in every direction, I'd rather fall back to the silly trains of "isCPUNAME" tests.

142 ↗(On Diff #54692)

They are, but whether it's beneficial to always use them or not, beyond explicit intrinsics, is another matter that would depend on specific sub-targets.

evandro updated this revision to Diff 55019.Apr 26 2016, 9:23 AM
evandro edited edge metadata.

Address the previous comments that were more clearly understood.

evandro marked 3 inline comments as done.Apr 26 2016, 9:29 AM

Thank you.

jmolloy requested changes to this revision.Apr 27 2016, 5:04 AM
jmolloy added a reviewer: jmolloy.
jmolloy added a subscriber: jmolloy.

Hi,

Then again, I'm testing the waters by opting to use additional features instead of a sequence of plethora of isCPU(). Perhaps it's time to use features, even of some other kind, so that all such nuances remain in the machine descriptions instead of peppered all over the rest of the source code. Perhaps not.

I don't like the use of features here. They are a very large, indiscriminate hammer for when to enable this optimization. I don't know about Exynos M1, but on many chips the decision of whether to use reciprocals or not is contextual.

Often, the iterative SQRT instruction is faster in latency than a reciprocal alternative. Not only because there is less instruction fetch/dispatch/issue overhead but also because the iterative version can exit early in hardware if the NR steps converge quickly. A reciprocal alternative has to have a fixed number of steps which must be enough for the worst case. The reciprocal has the advantage that it is fully pipelined whereas the iterative SQRT might not be.

In my experiments, reciprocals are a poor choice for any situations where there are

(a) few data items to process, or 
(b) the sqrt/div is on the critical path.

So this sequence would be pessimized by changing to reciprocals:

t = 0;
for (...) {
  t = t + a[i];
  t /= b[i];
}

Because the divide is on the critical path and is a loop dependence, the core can never overlap executions of the divide, so changing to reciprocals and extending the critical path would be a lose.

However here:

for (...) {
  a[i] = a[i] / b[i];
}

We can vectorize and unroll this. For this, reciprocals could be a *significant* win.

Your current implementation doesn't consider any of the situations where reciprocals might be beneficial or not, and it can't (because you've moved the heuristic out of TTI/TLI into Subtarget).

Cheers,

James

This revision now requires changes to proceed.Apr 27 2016, 5:04 AM

James,

It seems to me that your objection is not so much against this patch as against the machinery in the DAGCombiner.

I understand your points, even adding that the series takes the pressure from the unit(s) that perform division and square root and puts it unto the unit(s) that perform multiplication.

I do intend to investigate this issue further, but I think that it's incremental, if not tangential, to what this patch proposes.

Finally, I'm interested in understanding better what you think would be more appropriate than using features.

Thank you.

Hi Evandro,

It seems to me that your objection is not so much against this patch as against the machinery in the DAGCombiner.

Perhaps. The important thing is how we decide whether to perform this optimisation or not. A hook function, as we normally use, has the ability to be extended to add more logic. A subtarget feature does not. I wouldn’t like to see optimization decisions being switched on or off by subtarget features.

Cheers,

James

The important thing is how we decide whether to perform this optimisation or not. A hook function, as we normally use, has the ability to be extended to add more logic. A subtarget feature does not. I wouldn’t like to see optimization decisions being switched on or off by subtarget features.

+1

James,

Perhaps. The important thing is how we decide whether to perform this optimisation or not. A hook function, as we normally use, has the ability to be extended to add more logic. A subtarget feature does not. I wouldn’t like to see optimization decisions being switched on or off by subtarget features.

Whether it is to use the series or not in a loop bound by latency, as in the 1st example, and in a loop bound by throughput, as in the 2nd example, or in a sequence with many multiplications is still something that would still depend on the sub-target. For, say, in a sub-target division or square root could take much longer than the series or there are plenty of units capable of multiplications. Hopefully the context of this hook may allow querying the sub-target about such details. I'll start working on this soon.

Still, methinks that this patch is good for some sub-targets as is and it shouldn't be the enemy of "perfection". ;-)

Thank you.

Hi Evandro,

evandro updated this revision to Diff 55666.Apr 29 2016, 2:11 PM
evandro updated this object.
evandro edited edge metadata.

I removed the features in this version of the patch and added test cases.

evandro updated this revision to Diff 55667.Apr 29 2016, 2:13 PM
evandro edited edge metadata.

Ahem... and added test cases.

Hi Evandro,

This looks much nicer, thanks! Just a couple of readability issues left from my side.

Cheers,

James

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
4636

For readability, please could you put the test here in brackets and have a space before the '?' of the ternary? Same below:

std::string RecipOp = (AArch64ISD::FRECPE) ? "div" : "sqrt";
llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
150

Please can you extract the heuristic computation from the call here, to make it explicit that's what it is?

Something like:

// FIXME: Only enable SQRT reciprocals for M1 for the moment.
bool UseRSQRTE = ST.isExynosM1();
TM.Options.Reciprocals.setDefaults("sqrtf", UseRSQRTE, ExtraStepsF);
evandro marked 2 inline comments as done.May 4 2016, 7:58 AM

In the meantime, I've been considering how to issue the iterations using FRSQRTS and FRECPS and it a possible way would be to have new SDNode for them, used directly by the DAGCombiner, which is by default lowered as a polynomial or, as for ARM, as the respective instructions. Thoughts?

Thank you.

evandro updated this revision to Diff 56153.May 4 2016, 7:59 AM
evandro removed rL LLVM as the repository for this revision.
jmolloy accepted this revision.May 4 2016, 8:08 AM
jmolloy edited edge metadata.

LGTM!

In the meantime, I've been considering how to issue the iterations using FRSQRTS and FRECPS and it a possible way would be to have new SDNode for them, used directly by the DAGCombiner, which is by default lowered as a polynomial or, as for ARM, as the respective instructions. Thoughts?

Thank you.

How does X86 do this?

This revision is now accepted and ready to land.May 4 2016, 8:08 AM

How does X86 do this?

It doesn't. AFAIK, ARM is the only target that has an instruction for the series steps. All the other targets that can estimate the initial value, X86, PPC, IA64, MIPS, compute the series steps with polynomials.

Committed initial implementation as r268539.

rengolin closed this revision.Jun 27 2016, 6:53 AM