This is an archive of the discontinued LLVM Phabricator instance.

New unsafe-fp-math implementation for X86 target
Needs ReviewPublic

Authored by avt77 on Nov 18 2016, 7:24 AM.

Download Raw Diff

Details

Reviewers

spatel
dsanders
scanon
Gerolf
craig.topper
hfinkel
RKSimon

Summary

The current fast-math implementation is based on DAGCombiner. One of disadvantages of that is ignoring of any cost model (throughput and/or length). Another disadvantage is implementation of target specific optimization at improper level of transformations. The introduced version moves the implementation lower into MachineCombiner. As result we're using the existing mechanism of transformations like getMachineCombinerPatterns and getAlternativeCodeSequence. In addition both throughput and length cost models are being used automatically.

This patch is only initial step to demonstrate the intention. The patch implements only one type of transformation: the reciprocal estimated code instead of vdivss instruction. I'm going to support other types of reciprocal optimizations as well. But I'd like to get first comments on this my job. The code is safely compiled and produces the working output.

Diff Detail

Event Timeline

avt77 updated this revision to Diff 78520.Nov 18 2016, 7:24 AM

avt77 retitled this revision from to New unsafe-fp-math implementation for X86 target.

avt77 updated this object.

avt77 added reviewers: RKSimon, spatel, ABataev.

RKSimon added a subscriber: llvm-commits.Nov 18 2016, 7:43 AM

RKSimon added inline comments.Nov 21 2016, 7:20 AM

lib/Target/X86/X86InstrInfo.cpp
10183	This needs refactoring to support scalar and packed versions of SSE and AVX opcodes if possible.
10231	We don't do reciprocal estimates for double types - these all need removing.
10273	Replace this with a basic TODO comment for future FSQRT pattern support - don't comment out code (especially when it doesn't exist).

The new reciprocal implementation is done on more than 97%: we still don't have public tests but the code changes are almost completed. Please review and send me your comments.

In D26855#606635, @avt77 wrote:

The new reciprocal implementation is done on more than 97%: we still don't have public tests but the code changes are almost completed. Please review and send me your comments.

I applied the patch to trunk, but every example that I tried after that crashed:
Assertion failed: (i < getNumOperands() && "getOperand() out of range!"), function getOperand, file llvm/include/llvm/CodeGen/MachineInstr.h, line 280.

Is it correct that this patch will allow us to remove 'fake' subtarget features like FeatureFastScalarFSQRT / FeatureFastVectorFSQRT ?

In D26855#608325, @spatel wrote:

In D26855#606635, @avt77 wrote:

The new reciprocal implementation is done on more than 97%: we still don't have public tests but the code changes are almost completed. Please review and send me your comments.

I applied the patch to trunk, but every example that I tried after that crashed:
Assertion failed: (i < getNumOperands() && "getOperand() out of range!"), function getOperand, file llvm/include/llvm/CodeGen/MachineInstr.h, line 280.

Is it correct that this patch will allow us to remove 'fake' subtarget features like FeatureFastScalarFSQRT / FeatureFastVectorFSQRT ?

Sorry for crashes - it's a result of my last minute changes and missing of corresponding tests. Hope to fix it today-tomorrow.
About FeatureFastScalarFSQRT / FeatureFastVectorFSQRT: I did not look at these features yet but it seems YES, we'll be able to remove them because the corresponding functionality should appear automatically. The idea of the patch: the throughput(latency) should be calculated automatically for every specific alternative code sequence (e.g. recip instead of div) but I did not check yet if it works at the moment. And we should check other similar features if they exist (FeatureFastLZCNT, etc.) Does it make sense?

BTW, I have a question: this is the first patch in the possible series of patches. This patch introduces the the suggested approach on the example of FDiv - Recip implementation. Should I complete this patch with FDIV only or should I extend it with rsqrt? The patch is rather big that's why I suggest to make it really working with fdiv only and commit. After that I'll be able to make other similar patches faster and safely. Is it OK?

In fact if you try something like

llc < .../llvm/test/CodeGen/X86/recip-fastmath.ll -mtriple=x86_64-unknown-unknown -mattr=+avx

it should work just now: up to now I was working with this test only .

In D26855#608851, @avt77 wrote:

BTW, I have a question: this is the first patch in the possible series of patches. This patch introduces the the suggested approach on the example of FDiv - Recip implementation. Should I complete this patch with FDIV only or should I extend it with rsqrt? The patch is rather big that's why I suggest to make it really working with fdiv only and commit. After that I'll be able to make other similar patches faster and safely. Is it OK?

I think this is the right direction because we want to make use of the actual CPU scheduler model instead of estimates, but implementing this will almost certainly lead to unintended consequences that need to be worked around. For example, one thing I noticed when adding the reassociation patterns to machine combiner: we could significantly increase compile time because calculating critical path/resource height with MachineTraceMetrics can be expensive.

You should send a note to llvm-dev about this plan, the motivation, the benefits, etc. and reference this patch as the starting point. The changes you are proposing could be adapted by all targets, so there should be feedback from various backend people to avoid duplicated effort.

Is this patch an NFC patch? If no, the test must be added

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
14950 ↗	(On Diff #79408)	Why is this line commented? If it is not required (but I rather doubt) you should just remove it.
lib/Target/X86/X86InstrInfo.cpp
10108–10110	I believe this can be replaced by a range-based loop
10115	auto &?
10131–10135	Should these loops be range-based loops?
10138–10140	Range-based loop?
10144–10146	No braces required
10219–10220	reformat this

Still looking through this, but here are some initial comments.

lib/Target/X86/X86InstrInfo.cpp
10108	Use a range iterator if you can, same for the other cases below
10118	This should be C->isExactlyValue(1.0)
10126	This should be isDividendExactlyOne. We should be calling it Dividend not Divident as well.
10145	Embed hasAllOnesOperand here.
10179	Replace SmallVector<int, 7> with ArrayRef<int>
10293	This isn't necessary if DividentIsAllOnes == true

The new version for next discussions

I fixed one assertion failing (but there is another one at the moment), moved to the range for, formatted and made some sugar changes.

All failed assertions were resolved. The ability to use old reciprocal implementation for non-x86 platforms was restored. Now we have only one failed test CodeGen/X86/recip-fastmath.ll but that's exactly what we were expecting because I changed the code generation here. I hope that's 99.99% of the patch - please review.

I updated the test related to reciprocal code gen. Now it shows most of possible variants of dividend (not only 1.0 as it was before). Now the patch is ready for final review.

avt77 added reviewers: hfinkel, Gerolf, dsanders.Dec 6 2016, 3:01 AM

Some minor comments, but I'd like to hear other people's thoughts. This might be the first of a number of X86 MC patterns and I want to be sure we're doing this correctly.

include/llvm/CodeGen/MachineInstr.h
1155 ↗	(On Diff #80397)	Explain the purpose of the new 'P' argument?
lib/Target/X86/X86InstrInfo.cpp
10105	Rename this to hasExactlyOneOperand ? AllOnes means 'all bits are set'
10134	Rename to isDividendEactlyOne and update the comment.
10205	Dividend is only used once, merge these? unsigned DividendReg = Root.getOperand(1).getReg();
10206	AllOnes -> ExactlyOne
10271	Can we replace TII->get(Instrs[2]) with MIDesc?
10278	Can we replace TII->get(Instrs[2]) with MIDesc?
10390	Inconsistent braces to the other cases.

Also, please add the additional tests to trunk now with the current codegen so this patch shows the diff.

I fixed all requirements raised by Simon and Alexey. The special test is here as well.

avt77 mentioned this in rL289931: Extra coverage tests to demonstrate fixes in D72618 and D26855.Dec 16 2016, 2:06 AM

I've updated the reciprocal related tests to see difference between old and new code gen more clearly. In fact there is no real difference but the new approach allows to take into account the schedule cost model when we deal with different machine code patterns. This patch should become the first step in the future similar optimizations like rsqrt, etc.

Does anyone else have comments on this please? There has been interest in moving more of such codegen to the MC so that we can use more realistic scheduling costs so I'm keen to see this go in.

test/CodeGen/X86/recip-fastmath2.ll
802	Any ideas why we fail to get: movaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] There are other cases here as well.

A few general remarks:

I'm very much in favor of the MC combiner approach
But I'm getting increasingly concerned about the MC code quality. I felt the FMA/Aarch64 support starts looking crabby (I'm the culprit) and this doesn't look much better I'm afraid. The approach is in need of more automation for better maintainability. This is meant as food for thought. Since I"m too blame for the approach I can't to be harsh in reviews :-)
I'm also concerned about the compile-time (in particular since we don't track x86 specific issues ( or any other backend than Aarch64- or least I'm not aware that anyone is watching closely). Could you share some specific data about your perf gains and compile-time measurements? However, I think this optimization is for fast math only and compile-time is probably less of an issue in that mode. One way to deal with compile-time issues is to wrap some MC under an option.
Perhaps I missed it but I expected the optimization to kick in only under fast math. I saw 'fast' in the test cases, but didn't see a check in the code.

Thanks
Gerolf

lib/Target/X86/X86InstrInfo.cpp
10111	This is an expensive search. There must be a direct simpler way to get that info.
10162	For which types? All FP?
10167	Can you elaborate? eg. on #iterations?
10214	Why?

Gerolf added inline comments.Dec 21 2016, 2:36 PM

include/llvm/CodeGen/MachineInstr.h
1154 ↗	(On Diff #81735)	MBB instead of P? Is this for debugging?
lib/CodeGen/MachineCombiner.cpp
160	Please add a comment that explains why default is used. + assert(DefIdx \|\| UseIdx);
lib/CodeGen/MachineInstr.cpp
1712 ↗	(On Diff #81735)	you can use the new parameter and then if (!MBB) MBB=getParent()

So what do you suggest? Can we get this in before refactoring the MC alternative code pattern system or will that cause too much additional complexity?

Whatever, it sounds like we'd be better off doing this all after the 4.0 branch, do you agree?

What do you mean when you speak about "automation"? Do you mean a possibility to describe alternative sequences with tools like TableGen? If yes I'm afraid it'll require some real time to implement. That's why from my point of view the hand-made patterns similar to the given one could be really useful in future. But of course it'd be really interesting to launch such a project. Right?

The current trunk has already changes in Machine::print, etc. similar in my initial patch. Because of that I removed all corresponding changes and did not answer on all corresponding comments.
It seems I fixed all other requirements from Gerolf
But the main question is the same: should we continue with the effort?

In D26855#629715, @avt77 wrote:

But the main question is the same: should we continue with the effort?

Absolutely, but I'm starting to think it make sense to wait until after the 4.0 branch, then get this in as we need an x86 implementation for reference and then begin an investigation into how we want the MachineCombiner to develop.

The "automatic" generation of pattern e.g. with TableGen is on my longer term wish list, not a requirement for this patch. Sorry if my wording was confusing.
Do you have performance numbers?

lib/Target/X86/X86InstrInfo.cpp
10109	-> dividend
10123	Why this special case? just C = C->getSplatValue() would be easier to read.
10215	wey -> why?
10289	Should there be a 2:?

Yes, I've just got the numbers. I created 2 versions of clang compiler: directly from trunk and with my patch applied. Then with help of these compilers I created 2 new compilers with the following configuration:

cmake -G "Ninja" -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=<trun/patch compiler home>/build/bin/clang -DCMAKE_CXX_COMPILER=<trunk/patch compiler home>/build/bin/clang++ -DCMAKE_C_FLAGS="-O3 -ffast-math" -DCMAKE_CXX_FLAGS="-O3 -ffast-math" ../llvm

Below you can see the times (I did 2 experiments for every compiler):

Compiler with patch

real 32m10.783s
user 125m19.424s
sys 3m8.456s

real 31m20.432s
user 122m2.012s
sys 3m4.444s

Trunk based compiler

real 31m46.001s
user 123m39.192s
sys 3m10.180s

real 40m6.791s
user 156m5.472s
sys 3m36.476s

Of course it's very rough estimations because I used our server and there a lot of things around. But general picture is clear from my point of view: my patch does not increase the compilation time.
Is it enough or I should do other experiments?

In D26855#630340, @avt77 wrote:

Yes, I've just got the numbers. I created 2 versions of clang compiler: directly from trunk and with my patch applied. Then with help of these compilers I created 2 new compilers with the following configuration:

cmake -G "Ninja" -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=<trun/patch compiler home>/build/bin/clang -DCMAKE_CXX_COMPILER=<trunk/patch compiler home>/build/bin/clang++ -DCMAKE_C_FLAGS="-O3 -ffast-math" -DCMAKE_CXX_FLAGS="-O3 -ffast-math" ../llvm

Below you can see the times (I did 2 experiments for every compiler):

Compiler with patch

real 32m10.783s
user 125m19.424s
sys 3m8.456s

real 31m20.432s
user 122m2.012s
sys 3m4.444s

Trunk based compiler

real 31m46.001s
user 123m39.192s
sys 3m10.180s

real 40m6.791s
user 156m5.472s
sys 3m36.476s

Of course it's very rough estimations because I used our server and there a lot of things around. But general picture is clear from my point of view: my patch does not increase the compilation time.
Is it enough or I should do other experiments?

I'll let Simon decide but these numbers are iffy. I can't necessarily conclude your patch increases compile time but I can't conclude anything else either. In particular, the stock clang measurement have a variance of 20% between consecutive runs, so I have very little faith in the numbers collected.
Rafael recent('ish)ly published a set of suggestions/knob to turn on to get relatively stable numbers on a Linux machine. I'm also pretty sure the topic of how to get {reliable numbers, numbers you can have faith in} has been discussed multiple times (look at the archives, Sean has generally pretty informative posts/insights on the topic).

Two more comments:

I did not update the patch accordingly to the latest Gerolf comments: I'll do it asap
Gerolf asked: "Perhaps I missed it but I expected the optimization to kick in only under fast math. I saw 'fast' in the test cases, but didn't see a check in the code."

I check fast-math in "static bool getFDIVPatterns"

switch (TLI->getRecipEstimateDivEnabled(VT, *MF)) { this line checks per-function option case TLI->ReciprocalEstimate::Disabled: return false; case TLI->ReciprocalEstimate::Unspecified: if (Root.getParent()->getParent()->getTarget().Options.UnsafeFPMath) and here I check command line option if there is no per-function code

In D26855#630341, @davide wrote:

I'll let Simon decide but these numbers are iffy. I can't necessarily conclude your patch increases compile time but I can't conclude anything else either. In particular, the stock clang measurement have a variance of 20% between consecutive runs, so I have very little faith in the numbers collected.
Rafael recent('ish)ly published a set of suggestions/knob to turn on to get relatively stable numbers on a Linux machine. I'm also pretty sure the topic of how to get {reliable numbers, numbers you can have faith in} has been discussed multiple times (look at the archives, Sean has generally pretty informative posts/insights on the topic).

Could you give me a reference to the info? Of course I'll try to find it myself but with your help it could be faster.

In D26855#630344, @avt77 wrote:

In D26855#630341, @davide wrote:

I'll let Simon decide but these numbers are iffy. I can't necessarily conclude your patch increases compile time but I can't conclude anything else either. In particular, the stock clang measurement have a variance of 20% between consecutive runs, so I have very little faith in the numbers collected.
Rafael recent('ish)ly published a set of suggestions/knob to turn on to get relatively stable numbers on a Linux machine. I'm also pretty sure the topic of how to get {reliable numbers, numbers you can have faith in} has been discussed multiple times (look at the archives, Sean has generally pretty informative posts/insights on the topic).

Could you give me a reference to the info? Of course I'll try to find it myself but with your help it could be faster.

http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20161017/398831.html

I made new experiments but now I use a dedicated computer for it:

atischenko@ip-172-31-21-62:~/workspaces$ cat time.log
cmake -G "Ninja" -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=/home/atischenko/workspaces/step3/build/bin/clang -DCMAKE_CXX_COMPILER=/home/atischenko/workspaces/step3/build/bin/clang++ -DCMAKE_C_FLAGS="-O3 -ffast-math" -DCMAKE_CXX_FLAGS="-O3 -ffast-math" ../llvm
time ninja -j4

Compiler with patch

real 32m10.783s
user 125m19.424s
sys 3m8.456s

real 31m20.432s
user 122m2.012s
sys 3m4.444s

On dedicated computer

real 31m23.564s
user 122m13.120s
sys 3m5.340s

real 31m28.115s
user 122m30.596s
sys 3m6.588s

real 31m22.861s
user 122m8.920s
sys 3m7.236s

Trunk based compiler

cmake -G "Ninja" -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=/home/atischenko/workspaces/bootstrap-trunk/build/bin/clang -DCMAKE_CXX_COMPILER=/home/atischenko/workspaces/bootstrap-trunk/build/bin/clang++ -DCMAKE_C_FLAGS="-O3 -ffast-math" -DCMAKE_CXX_FLAGS="-O3 -ffast-math" ../llvm
time ninja -j4

real 31m46.001s
user 123m39.192s
sys 3m10.180s

real 40m6.791s
user 156m5.472s
sys 3m36.476s

On dedicated computer

real 31m24.634s
user 122m14.912s
sys 3m8.080s

real 31m22.833s
user 122m9.836s
sys 3m5.676s

real 31m22.795s
user 122m5.924s
sys 3m8.588s

Hope now we can trust my results.

I fixed the last issues raised by Gerolf except one related to special case of "if" because the suggested change breaks the current logic.

From my perspective the implementation is close and only requires a few minor changes.

The compile-time numbers I've seen so far are meaningless (wide variation, unclear if/how many times your code actually fires etc), but I'm not too concerned about O3 fast-math ct and would give it benefit of the doubt.

I did ask the question about performance benefits twice to no avail and admit I'm still curious. I assume to get these numbers you do set your machines into perf mode rather than using some servers running some random load.

Thanks
Gerolf

lib/Target/X86/X86InstrInfo.cpp
10105	Add comment what the function does before the function
10109	it's -> it is
10177	input to this function are 7 parameters, the comment only lists 6.
10214	I keep stumbling over this comment and every time i read it: did you mean to say something like //Execute at least one iteration. Iternations = Max(1, Iterations);
10217	Where is that input sequence documented?

I fixed everything except one comment (see below). And I collected new perf numbers. Now I used the following command for bootstrap building:

time make -j 1

As result the reproducing is very well from my point of view. In addition I tried to get numbers accordingly to description in http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20161017/398831.html . The reproducing is almost the same but the required time is even longer (about 2 hours for every test). Because of that I kept my numbers. The test itself is very simple: I created 2 versions of compiler: the first one was built directly from trunk and the second one was built after applying of my patch. Then with help of every compiler I created 4 bootstraps. The results are below:

The trunk version of the compiler builds the bootstrap like here (there were 4 starts with "time make -j 1"):

real  91m7.998s
real  90m23.861s
real  90m26.154s
real  90m31.533s

The version of the compiler with my patch builds the bootstrap like here (there were the same 4 starts with "time make -j 1"):

real  90m43.970s
real 90m7.257s
real  90m6.671s
real  90m11.733s

Obviously, the compilation time does not depend on my patch.

avt77 added inline comments.Jan 16 2017, 3:22 AM

lib/Target/X86/X86InstrInfo.cpp
10177	I did not understand this comment: what should I do here?

RKSimon added inline comments.Jan 16 2017, 1:41 PM

lib/Target/X86/X86InstrInfo.cpp
10177	I think it means that while ArrayRef<int> Instrs have 7 instructions listed, the codegen in the comment only shows 6 instructions

avt77 added inline comments.Jan 18 2017, 1:03 AM

lib/Target/X86/X86InstrInfo.cpp
10177	But in fact all 7 instructions are shown but from index 0 to index 6 (maybe in "strange" order: 0,2,1,3,4,5,6). If you'd like I could change the order and/or start numbering from 1. Gerolf, should I do it or we fixed everything?

In D26855#647135, @avt77 wrote:

I fixed everything except one comment (see below). And I collected new perf numbers. Now I used the following command for bootstrap building:

time make -j 1

As result the reproducing is very well from my point of view. In addition I tried to get numbers accordingly to description in http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20161017/398831.html . The reproducing is almost the same but the required time is even longer (about 2 hours for every test). Because of that I kept my numbers. The test itself is very simple: I created 2 versions of compiler: the first one was built directly from trunk and the second one was built after applying of my patch. Then with help of every compiler I created 4 bootstraps. The results are below:

So, I'm mostly lurking, but I want to point out a serious issue here: Clang and LLVM have as little floating point as we could manage in them. So I would expect them to be quite uninteresting in testing the compile time impact of a patch that is only concerned with floating point code....

And there still are no numbers around the improvement here...

I suspect you'll need to provide benchmark data from at least SPEC and/or the LLVM test suite that shows this is an improvement and compile time numbers to show that the improvement doesn't cost too much... At least, that would be my expectation. Those at least do include some floating point code. You might also try running benchmarks from the Eigen project which has a very large amount of floating point code. However, they usually don't build with any unsafe math flags, so correctness issues may dominate.

It'd also be great to hear from others invested in LLVM's FP lowering like Hal, Steve, etc...

What is "Eigen project"? Could you point me to it?

I'm leaning towards a LGTM since you addressed basically all my issues, but more people mushroomed and are curious about your performance data. So I think can't dodge that question anymore and need to share some good data for the benchmark(s) you are tuning for before you get the nod.

Cheers
Gerolf

lib/Target/X86/X86InstrInfo.cpp
10177	In the comments I only count 6 instructions from vmovss to vaddss, but the code handles 7. What am I missing?

Unless the change is trivially and obviously an improvement on inspection of the result, I think you need data before making it. =] I've not looked at this particular change in enough detail to say whether it satisfies that. Maybe Gerolf or Craig could.

If there are compile time concerns (which actually came up) and they deal with floating point and unsafe floating point in particular, then I don't think a bootstrap is a useful way to measure anything one way or the other (there just isn't enough floating point).

In D26855#649578, @avt77 wrote:

What is "Eigen project"? Could you point me to it?

http://eigen.tuxfamily.org/

In D26855#650115, @chandlerc wrote:

Unless the change is trivially and obviously an improvement on inspection of the result, I think you need data before making it. =] I've not looked at this particular change in enough detail to say whether it satisfies that. Maybe Gerolf or Craig could.

Another way besides benchmark data to show runtime improvements are tools like IACA. This is how we curated most of the vector shuffle improvements over the last few years for example.

Chandler, in fact this patch should not show any improvement in generating code. If you look in changes made in tests you'll see that the newly generated code is almost identical to the previous one (only some names, order of instructions, etc.). The idea of the patch is moving of such kind of optimization from the rather high level (DAGCombiner) to the really low level (MachineCombiner), Here we see real target machine instructions and as result we can use real cost model to estimate the real cost of possible transformation (in the given case the transformation is the replacement of one instruction (div) with some sequence of instructions). The transformation itself already exists inside Clang but the patch suggests to implement it in another place and that's it. If we agree with this new place of implementation then it will be the base for future possible similar optimizations like rsqrt, etc. And in addition this (and follow up) patch(es) will allow us to remove 'fake' subtarget features like FeatureFastScalarFSQRT / FeatureFastVectorFSQRT, etc. The question from Gerolf was not about the quality of the generated code (it's the same like we have now) but about the compilation time only.

Of course I'll try to collect the required perf numbers but they should be the same like we have now. Do we really need it?

Hi All,
I found a really "stress" test for div operations (see the attachment)

spill_fdiv.ll644 KBDownload

(tnx to Sanjay Patel). The test shows maybe the worst case of the possible degradation because of this patch. I used the following command with 2 different compilers:

cmake -E time ./llc spill_div.ll -o /dev/null -enable-unsafe-fp-math

For "pure" trunk compiler I got: Elapsed time 2 s.
For compiler with patch I got: Elapsed time 18 s.

(I launched the test several times with the same results.)
What's now? Is it acceptable? Should I try to optimize the patch? Should I try other benchmarks?
(I tried both LNT and Eigen but unfortunately they don't work for me at the moment because of unpredictable runtime issues.)

In D26855#657287, @avt77 wrote:

Hi All,
I found a really "stress" test for div operations (see the attachment)
spill_fdiv.ll644 KBDownload
(tnx to Sanjay Patel). The test shows maybe the worst case of the possible degradation because of this patch. I used the following command with 2 different compilers:

.......

For "pure" trunk compiler I got: Elapsed time 2 s.
For compiler with patch I got: Elapsed time 18 s.

Do you have any profiling info on where the time is going please? @Gerolf might then be able to advise whether we can improve the MCCombiner mechanism before/after this patch has gone in.

In D26855#657464, @RKSimon wrote:

In D26855#657287, @avt77 wrote:

Hi All,
I found a really "stress" test for div operations (see the attachment)
spill_fdiv.ll644 KBDownload
(tnx to Sanjay Patel). The test shows maybe the worst case of the possible degradation because of this patch. I used the following command with 2 different compilers:

.......

For "pure" trunk compiler I got: Elapsed time 2 s.
For compiler with patch I got: Elapsed time 18 s.

Do you have any profiling info on where the time is going please? @Gerolf might then be able to advise whether we can improve the MCCombiner mechanism before/after this patch has gone in.

I'll jump in here because I supplied this (hopefully degenerate and worst) case based on my earlier reassociation transforms for MachineCombiner (see D10321 where I first mentioned the potential compile-time problem). When I looked into that, the time was all going into MachineTraceMetrics::Ensemble::computeInstrDepths() and MachineTraceMetrics::Ensemble::
computeInstrHeights(). Those got very expensive for long sequences of instructions. Possible fixes would be to improve how those are computed, cache those results, and/or eliminate how often we ask for those values.

We were ok with some additional potential compile-time cost because reassociation opportunities do not appear to be that common and were limited to -ffast-math compiles . We can make similar arguments for the recip transforms in this patch?

But it is worth noting that since the time of D10321, the number of reassociation candidate opcodes that x86 matches has grown to ~200 (X86InstrInfo::isAssociativeAndCommutative()) and includes integer ops. We're probably overdue on measuring and improving the perf of MachineCombiner.

I just re-read some of the comments in this review, and back on Nov 30 I explained the possible compile-time hit. I requested a note to llvm-dev at that time. Based on the confusion in the subsequent comments (people think this patch will affect execution perf), I am making the suggestion to post to llvm-dev again. Moving from DAG heuristics to MI-scheduler-based transforms is a change in strategy for all targets and explaining that to a wider audience is an opportunity to get good feedback.

In D26855#657577, @spatel wrote:

In D26855#657464, @RKSimon wrote:

In D26855#657287, @avt77 wrote:

Hi All,
I found a really "stress" test for div operations (see the attachment)
spill_fdiv.ll644 KBDownload
(tnx to Sanjay Patel). The test shows maybe the worst case of the possible degradation because of this patch. I used the following command with 2 different compilers:

.......

For "pure" trunk compiler I got: Elapsed time 2 s.
For compiler with patch I got: Elapsed time 18 s.

Do you have any profiling info on where the time is going please? @Gerolf might then be able to advise whether we can improve the MCCombiner mechanism before/after this patch has gone in.

I'll jump in here because I supplied this (hopefully degenerate and worst) case based on my earlier reassociation transforms for MachineCombiner (see D10321 where I first mentioned the potential compile-time problem). When I looked into that, the time was all going into MachineTraceMetrics::Ensemble::computeInstrDepths() and MachineTraceMetrics::Ensemble::
computeInstrHeights(). Those got very expensive for long sequences of instructions. Possible fixes would be to improve how those are computed, cache those results, and/or eliminate how often we ask for those values.

Certainly seems worthwhile exploring whether those can be cached (if I understand what they're doing, we do essentially cache very-similar values during MI scheduling). This worst-case hit is definitely undesirable, and we can certainly run into lots of machine-generated straight-line code, so hitting these kinds of cases in the wild is not unthinkable.

We were ok with some additional potential compile-time cost because reassociation opportunities do not appear to be that common and were limited to -ffast-math compiles . We can make similar arguments for the recip transforms in this patch?

But it is worth noting that since the time of D10321, the number of reassociation candidate opcodes that x86 matches has grown to ~200 (X86InstrInfo::isAssociativeAndCommutative()) and includes integer ops. We're probably overdue on measuring and improving the perf of MachineCombiner.

I think the only issue that needs to be addressed is (finally!) sharing perf data. This has been raised at least 3 times. The possible compile-time implication, the speciality of the application (fast-math) etc are well understood.

Gerolf

In D26855#657735, @Gerolf wrote:

I think the only issue that needs to be addressed is (finally!) sharing perf data. This has been raised at least 3 times. The possible compile-time implication, the speciality of the application (fast-math) etc are well understood.

Gerolf

As I understand it the idea is that by moving this to the MC, then these alternative patterns will only be used if (1) the fast-math code permits it and (2) that the target cpu's scheduler model indicates that its quicker? So what you are asking is that we time the two versions of the code on specific cpus to check if in each case the correct decision is made?

This probably means that the tests should be updated to check against a couple of specific target cpus as well - we're limited by what x86 schedulers we have as but I know Jaguar (btver2) should use the rcpps version in all cases and expect Haswell should use divps.

A quick look at the SandyBridge scheduler model suggests its latency for FDIV is too low (especially ymm as it only has a 128-bit div alu) so that will select divps when it probably shouldn't....

In D26855#657735, @Gerolf wrote:

I think the only issue that needs to be addressed is (finally!) sharing perf data. This has been raised at least 3 times. The possible compile-time implication, the speciality of the application (fast-math) etc are well understood.

Gerolf

Sorry but I don't understand what means sharing in this case? I put all perf numbers here in comments. Is not it enough for sharing? If not where should I share it? Or maybe my perf numbers are not perf numbers from your point of view? Please, clarify.

And next question about profiling data. Should I collect it? I've already started the process but now I'm not sure if it's interesting for somebody.

I got the first profiling data. In fact it's the same that was described by Sanjay:

Samples: 1M of event 'cycles:pp', Event count (approx.): 1180464
Overhead Command Source Shared Object Source Symbol

15,65%  llc      llc                   [.] llvm::MachineTraceMetrics::Ensemble::computeInstrDepths
15,18%  llc      llc                   [.] getDataDeps
 9,23%  llc      llc                   [.] llvm::MachineTraceMetrics::Ensemble::computeInstrHeights
 8,29%  llc      llc                   [.] pushDepHeight
 8,15%  llc      llc                   [.] llvm::MachineTraceMetrics::Ensemble::invhalidate
 5,64%  llc      llc                   [.] llvm::TargetInstrInfo::defaultDefLatency
 4,89%  llc      llc                   [.] llvm::MachineTraceMetrics::getResources
 2,44%  llc      llc                   [.] llvm::X86InstrInfo::isHighLatencyDef

Should I try to cash the metrics or it's a question of a special patch?

I updated recip-fastmath2.ll test accordingly to Simon recommendations. Now it includes special checks for different CPUs: SandyBridge, Haswell and btver2. These new checks demonstrate that alternative sequence of instructions is being selected when it's really cheaper than the single fdiv instruction. (Obviously we should change cost numbers for SandyBridge because they are too small.)

In D26855#658647, @avt77 wrote:
I got the first profiling data. In fact it's the same that was described by Sanjay:

Samples: 1M of event 'cycles:pp', Event count (approx.): 1180464
Overhead Command Source Shared Object Source Symbol
15,65%  llc      llc                   [.] llvm::MachineTraceMetrics::Ensemble::computeInstrDepths
15,18%  llc      llc                   [.] getDataDeps
 9,23%  llc      llc                   [.] llvm::MachineTraceMetrics::Ensemble::computeInstrHeights
 8,29%  llc      llc                   [.] pushDepHeight
 8,15%  llc      llc                   [.] llvm::MachineTraceMetrics::Ensemble::invhalidate
 5,64%  llc      llc                   [.] llvm::TargetInstrInfo::defaultDefLatency
 4,89%  llc      llc                   [.] llvm::MachineTraceMetrics::getResources
 2,44%  llc      llc                   [.] llvm::X86InstrInfo::isHighLatencyDef
Should I try to cash the metrics or it's a question of a special patch?

I think you should look at caching them, or limiting their depth (beyond a certain point, the exact answer might not matter), or both.

In D26855#658576, @avt77 wrote:

In D26855#657735, @Gerolf wrote:

I think the only issue that needs to be addressed is (finally!) sharing perf data. This has been raised at least 3 times. The possible compile-time implication, the speciality of the application (fast-math) etc are well understood.

Gerolf

Sorry but I don't understand what means sharing in this case? I put all perf numbers here in comments. Is not it enough for sharing? If not where should I share it? Or maybe my perf numbers are not perf numbers from your point of view? Please, clarify.

You shared both compile-time numbers and runtime numbers by building clang, which is "insensitive" to floating point optimization. So it was asked to motivate better your change with benchmarks that can show codegen improvements in practice.

Chandler, in fact this patch should not show any improvement in generating code. [...] The idea of the patch is moving of such kind of optimization from the rather high level (DAGCombiner) to the really low level (MachineCombiner), [....] if we agree with this new place of implementation then it will be the base for future possible similar optimizations like rsqrt, etc. And in addition this (and follow up) patch(es) will allow us to remove 'fake' subtarget features like FeatureFastScalarFSQRT / FeatureFastVectorFSQRT, etc.

At this point, without an example showing that you can outperform the DAG approach, it is quite hypothetical ("believe me it is better!").
However GlobalISel could be a motivation? It'll need to reproduce all the combine from the DAG there isn't it?

The question from Gerolf was not about the quality of the generated code (it's the same like we have now) but about the compilation time only.

I think he was asking both :)

guyblank added a subscriber: guyblank.Jan 30 2017, 9:02 AM

guyblank added inline comments.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15331 ↗	(On Diff #86042)	sorry for joining the discussion so late. IIUC this affects all X86 CPUs, but I didn't see handling for all possible types (as stated in a TODO below). specifically, does this affect AVX-512 code?

avt77 added inline comments.Jan 31 2017, 1:46 AM

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15331 ↗	(On Diff #86042)	At the moment we support only limited set of types (see X86InstrInfo::genAlternativeCodeSequence below) but we are ready to extend it if it is necessary.

guyblank added inline comments.Feb 1 2017, 1:29 AM

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15331 ↗	(On Diff #86042)	you support the machine combiner approach for a limited set of types/instructions, but you've disabled the DAG combine approach for ALL types/instructions. I ran CodeGen/X86/recip-fastmath.ll on knl with your changes. the output for @f32_one_step (for example) changed from vrcp14ss %xmm0, %xmm0, %xmm1 vfnmadd213ss .LCPI1_0(%rip), %xmm1, %xmm0 vfmadd132ss %xmm1, %xmm1, %xmm0 to vmovss .LCPI1_0(%rip), %xmm1 vdivss %xmm0, %xmm1, %xmm0

AsafBadouh added a subscriber: AsafBadouh.Feb 1 2017, 1:44 AM

I fixed the issue with compile time increasing - see usage of MinInstr->getTrace(MBB). Now we're getting the trace when we really need it only. As result the executing profile was totally changed and the compiling time is now even less than it was in DAG Combiner - about 1.5 s on my laptop (I'm speaking about our worst case test only).

avt77 added inline comments.Feb 3 2017, 3:03 AM

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15331 ↗	(On Diff #86042)	I did not do anything with knl and as result you are right: this code generation was lost. Do you know other examples like that? If the problem in knl only then I could easily support it. But of course I could add special option like "-use_machine-combiner" or "-use-dag-combiner". What's better?

Please separate out the compile time improvements into a new patch for review - now that we have it the "compile time question" shouldn't hold up this patch any longer so we should be able to get this in as soon as the avx512 (knl/skx) issue is answered.

As for AVX512, I am ok with excluding it specifically in DAGCombiner.

But I'm also concerned about other cpus for which we don't have an accurate scheduler model (broadwell, skylake), should these be excluded as well?

lib/Target/X86/X86InstrInfo.cpp
10354	currently in Haswell and newer cpus, the generated sequence is using fma instructions. this should be taken into account here in the patterns, right?
test/CodeGen/X86/recip-fastmath.ll
2–10	please add RUN commands for specific cpus (as in the other test + AVX512 target) also commit the test changes and rebase your patch on them so we can see the output changes in these new runs.

RKSimon mentioned this in rL294128: [X86][SSE] Add target cpu specific reciprocal tests.Feb 5 2017, 10:37 AM

I've added the target cpu specific tests to trunk (rL294128) to help us track AVX512 (SKX/KNL) perf

In D26855#667181, @guyblank wrote:

But I'm also concerned about other cpus for which we don't have an accurate scheduler model (broadwell, skylake), should these be excluded as well?

If the cpu is using an older scheduler model then its already not necessarily being optimally scheduled - I don't see this patch being any different.

lib/Target/X86/X86InstrInfo.cpp
10354	Yes this should be taken into account, but it does mean that the number of codegen patterns is going to start to balloon if we're not careful.

In D26855#667256, @RKSimon wrote:

In D26855#667181, @guyblank wrote:

But I'm also concerned about other cpus for which we don't have an accurate scheduler model (broadwell, skylake), should these be excluded as well?

If the cpu is using an older scheduler model then its already not necessarily being optimally scheduled - I don't see this patch being any different.

Currently the older schedule "only" affects the schedule, but with this patch (and future ones using the machine combiner framework) it will affect which instructions we emit. This could possibly lead to generating worse code than we do at the moment.

mkuper added a subscriber: mkuper.Feb 6 2017, 12:58 AM

RKSimon mentioned this in rL294587: [X86][SSE] Added extra FMA/NO-FMA reciprocal test cases for D26855.Feb 9 2017, 6:25 AM

In D26855#667484, @guyblank wrote:

In D26855#667256, @RKSimon wrote:

In D26855#667181, @guyblank wrote:

But I'm also concerned about other cpus for which we don't have an accurate scheduler model (broadwell, skylake), should these be excluded as well?

If the cpu is using an older scheduler model then its already not necessarily being optimally scheduled - I don't see this patch being any different.

Currently the older schedule "only" affects the schedule, but with this patch (and future ones using the machine combiner framework) it will affect which instructions we emit. This could possibly lead to generating worse code than we do at the moment.

Recap:

This (and hopefully an equivalent sqrt/rsqrt future patch) are the stand out examples of cases that affect the actual result depending on MC and the scheduler model - minor code changes could cause it to switch between full precision divps and rcpps+nr, but this only happens with the necessary fast/unsafe flags enabled which means the user knows what to expect.

For most other possibly MC cases (e.g. constant rematerialization, shuffles, slow-vs-fast path instructions) there will be differences in the instructions used but not the final result.

RKSimon mentioned this in rL294750: [X86][SSE] Added chained FDIV test cases for D26855.Feb 10 2017, 7:07 AM

I fixed all known issues:

AVX512 is now again supported by DAGCombiner
FMA instructions are being used when FMA is enabled

This version clearly shows the advantage of sched model usage: it selects reciprocal code when it's profit only (e.g. compare v8f32_one_step and v8f32_one_step_2_divs, etc.)

In addition I removed the speedup related code from this patch because it was included into the special review: D29627

Herald added a subscriber: igorb. · View Herald TranscriptFeb 10 2017, 8:31 AM

ABataev resigned from this revision.Feb 13 2017, 11:42 AM

Hi All,
Do we except anything more here?
It seems I fixed all requirements. Maybe it's time for LGTM?

Guy Blank found a problem with PIC relocation model on SLM architecture. I fixed it and added the corresponding test.

avt77 mentioned this in rL296746: Added special test covering a problem with PIC relocation model on SLM….Mar 2 2017, 5:59 AM

I committed PIC-related test in trunk and updated this patch to be able to compare it with new code generation.

RKSimon added inline comments.Mar 9 2017, 11:27 AM

lib/CodeGen/MachineTraceMetrics.cpp
521	Is this still true after D29627?

avt77 added inline comments.Mar 10 2017, 12:09 AM

lib/CodeGen/MachineTraceMetrics.cpp

521

Yes, the profile shows that now (after D29627) this feature eats more time than any other

58,19%  llc      llc                   [.] llvm::MachineTraceMetrics::Ensemble::invalidate
 3,44%  llc      llc                   [.] (anonymous namespace)::TwoAddressInstructionPass::scanUses
 3,19%  llc      llc                   [.] llvm::ScheduleDAGSDNodes::ClusterNeighboringLoads
 1,59%  llc      llc                   [.] llvm::SparseMultiSet<llvm::VReg2SUnit, llvm::VirtReg2IndexFunctor, unsigned char>::find
 1,08%  llc      llc                   [.] llvm::X86InstrInfo::areLoadsFromSameBasePtr

RKSimon mentioned this in D38318: [X86][SSE] Match PSHUFLW/PSHUFHW + PSHUFD vXi16 shuffle patterns (PR34686).Oct 8 2017, 6:11 AM

@avt77 As we discussed offline, please can you strip out the debug changes and put them into a new patch?

avt77 mentioned this in D43813: [Machine Combiner] Valid use of OptSize.Feb 28 2018, 4:51 AM

RKSimon resigned from this revision.Feb 19 2019, 10:30 AM

Herald added a subscriber: jdoerfert. · View Herald TranscriptFeb 19 2019, 10:30 AM

Revision Contents

Path

Size

include/

llvm/

CodeGen/

MachineCombinerPattern.h

5 lines

Target/

TargetLowering.h

3 lines

lib/

CodeGen/

MachineCombiner.cpp

38 lines

MachineTraceMetrics.cpp

1 line

TargetLoweringBase.cpp

5 lines

Target/

X86/

8 lines

11 lines

15 lines

377 lines

test/

CodeGen/

X86/

recip-fastmath.ll

414 lines

recip-fastmath2.ll

643 lines

recip-pic.ll

9 lines

Diff 90329

include/llvm/CodeGen/MachineCombinerPattern.h

Show First 20 Lines • Show All 65 Lines • ▼ Show 20 Lines	enum class MachineCombinerPattern {
FMLAv4i32_indexed_OP2,		FMLAv4i32_indexed_OP2,
FMLSv1i32_indexed_OP2,		FMLSv1i32_indexed_OP2,
FMLSv1i64_indexed_OP2,		FMLSv1i64_indexed_OP2,
FMLSv2i32_indexed_OP2,		FMLSv2i32_indexed_OP2,
FMLSv2i64_indexed_OP2,		FMLSv2i64_indexed_OP2,
FMLSv2f32_OP2,		FMLSv2f32_OP2,
FMLSv2f64_OP2,		FMLSv2f64_OP2,
FMLSv4i32_indexed_OP2,		FMLSv4i32_indexed_OP2,
FMLSv4f32_OP2		FMLSv4f32_OP2,

		// This is FDIV-RECIP pattern matched by X86 machine combiner
		Div2RecipEst
};		};

} // end namespace llvm		} // end namespace llvm

#endif		#endif

include/llvm/Target/TargetLowering.h

Show First 20 Lines • Show All 280 Lines • ▼ Show 20 Lines	public:
/// the function's attributes, "Unspecified" is returned and target defaults		/// the function's attributes, "Unspecified" is returned and target defaults
/// are expected to be used for instruction selection.		/// are expected to be used for instruction selection.
int getRecipEstimateSqrtEnabled(EVT VT, MachineFunction &MF) const;		int getRecipEstimateSqrtEnabled(EVT VT, MachineFunction &MF) const;

/// Return a ReciprocalEstimate enum value for a division of the given type		/// Return a ReciprocalEstimate enum value for a division of the given type
/// based on the function's attributes. If the operation is not overridden by		/// based on the function's attributes. If the operation is not overridden by
/// the function's attributes, "Unspecified" is returned and target defaults		/// the function's attributes, "Unspecified" is returned and target defaults
/// are expected to be used for instruction selection.		/// are expected to be used for instruction selection.
int getRecipEstimateDivEnabled(EVT VT, MachineFunction &MF) const;		virtual int getRecipEstimateDivEnabled(EVT VT, MachineFunction &MF,
		bool forDAGF = true) const;

/// Return the refinement step count for a square root of the given type based		/// Return the refinement step count for a square root of the given type based
/// on the function's attributes. If the operation is not overridden by		/// on the function's attributes. If the operation is not overridden by
/// the function's attributes, "Unspecified" is returned and target defaults		/// the function's attributes, "Unspecified" is returned and target defaults
/// are expected to be used for instruction selection.		/// are expected to be used for instruction selection.
int getSqrtRefinementSteps(EVT VT, MachineFunction &MF) const;		int getSqrtRefinementSteps(EVT VT, MachineFunction &MF) const;

/// Return the refinement step count for a division of the given type based		/// Return the refinement step count for a division of the given type based
▲ Show 20 Lines • Show All 2,898 Lines • Show Last 20 Lines

lib/CodeGen/MachineCombiner.cpp

Show First 20 Lines • Show All 149 Lines • ▼ Show 20 Lines	for (const MachineOperand &MO : InstrPtr->operands()) {
InstrIdxForVirtReg.find(MO.getReg());		InstrIdxForVirtReg.find(MO.getReg());
if (II != InstrIdxForVirtReg.end()) {		if (II != InstrIdxForVirtReg.end()) {
// Operand is new virtual register not in trace		// Operand is new virtual register not in trace
assert(II->second < InstrDepth.size() && "Bad Index");		assert(II->second < InstrDepth.size() && "Bad Index");
MachineInstr *DefInstr = InsInstrs[II->second];		MachineInstr *DefInstr = InsInstrs[II->second];
assert(DefInstr &&		assert(DefInstr &&
"There must be a definition for a new virtual register");		"There must be a definition for a new virtual register");
DepthOp = InstrDepth[II->second];		DepthOp = InstrDepth[II->second];
LatencyOp = TSchedModel.computeOperandLatency(		int DefIdx = DefInstr->findRegisterDefOperandIdx(MO.getReg());
DefInstr, DefInstr->findRegisterDefOperandIdx(MO.getReg()),		int UseIdx = InstrPtr->findRegisterUseOperandIdx(MO.getReg());
InstrPtr, InstrPtr->findRegisterUseOperandIdx(MO.getReg()));		assert((DefIdx \|\| UseIdx) && "Invalid reg usage");
		GerolfUnsubmitted Not Done Reply Inline Actions Please add a comment that explains why default is used. + assert(DefIdx \|\| UseIdx); Gerolf: Please add a comment that explains why default is used. + assert(DefIdx \|\| UseIdx);
		if (DefIdx < 0 \|\| UseIdx < 0)
		// W/o def/use indexes we can't compute latency based on shed model
		// that's why we're forced to use the default value
		LatencyOp = TII->defaultDefLatency(SchedModel, *DefInstr);
		else
		LatencyOp = TSchedModel.computeOperandLatency(DefInstr, DefIdx,
		InstrPtr, UseIdx);
} else {		} else {
MachineInstr *DefInstr = getOperandDef(MO);		MachineInstr *DefInstr = getOperandDef(MO);
if (DefInstr) {		if (DefInstr) {
DepthOp = BlockTrace.getInstrCycles(*DefInstr).Depth;		DepthOp = BlockTrace.getInstrCycles(*DefInstr).Depth;
LatencyOp = TSchedModel.computeOperandLatency(		LatencyOp = TSchedModel.computeOperandLatency(
DefInstr, DefInstr->findRegisterDefOperandIdx(MO.getReg()),		DefInstr, DefInstr->findRegisterDefOperandIdx(MO.getReg()),
InstrPtr, InstrPtr->findRegisterUseOperandIdx(MO.getReg()));		InstrPtr, InstrPtr->findRegisterUseOperandIdx(MO.getReg()));
}		}
▲ Show 20 Lines • Show All 93 Lines • ▼ Show 20 Lines	DEBUG(dbgs() << "DEPENDENCE DATA FOR " << Root << "\n";
dbgs() << " NewRootDepth: " << NewRootDepth << "\n";		dbgs() << " NewRootDepth: " << NewRootDepth << "\n";
dbgs() << " RootDepth: " << RootDepth << "\n");		dbgs() << " RootDepth: " << RootDepth << "\n");

// For a transform such as reassociation, the cost equation is		// For a transform such as reassociation, the cost equation is
// conservatively calculated so that we must improve the depth (data		// conservatively calculated so that we must improve the depth (data
// dependency cycles) in the critical path to proceed with the transform.		// dependency cycles) in the critical path to proceed with the transform.
// Being conservative also protects against inaccuracies in the underlying		// Being conservative also protects against inaccuracies in the underlying
// machine trace metrics and CPU models.		// machine trace metrics and CPU models.
if (getCombinerObjective(Pattern) == CombinerObjective::MustReduceDepth)		if (getCombinerObjective(Pattern) == CombinerObjective::MustReduceDepth) {
		DEBUG(dbgs() << "It MustReduceDepth ");
		DEBUG(NewRootDepth < RootDepth ? dbgs() << "and it does it\n"
		: dbgs() << "but it does NOT do it\n");
return NewRootDepth < RootDepth;		return NewRootDepth < RootDepth;
		}

// A more flexible cost calculation for the critical path includes the slack		// A more flexible cost calculation for the critical path includes the slack
// of the original code sequence. This may allow the transform to proceed		// of the original code sequence. This may allow the transform to proceed
// even if the instruction depths (data dependency cycles) become worse.		// even if the instruction depths (data dependency cycles) become worse.

unsigned NewRootLatency = getLatency(Root, NewRoot, BlockTrace);		unsigned NewRootLatency = getLatency(Root, NewRoot, BlockTrace);
unsigned RootLatency = 0;		unsigned RootLatency = 0;

for (auto I : DelInstrs)		for (auto I : DelInstrs)
RootLatency += TSchedModel.computeInstrLatency(I);		RootLatency += TSchedModel.computeInstrLatency(I);

unsigned RootSlack = BlockTrace.getInstrSlack(*Root);		unsigned RootSlack = BlockTrace.getInstrSlack(*Root);

		unsigned NewCycleCount = NewRootDepth + NewRootLatency;
		unsigned OldCycleCount = RootDepth + RootLatency + RootSlack;

DEBUG(dbgs() << " NewRootLatency: " << NewRootLatency << "\n";		DEBUG(dbgs() << " NewRootLatency: " << NewRootLatency << "\n";
dbgs() << " RootLatency: " << RootLatency << "\n";		dbgs() << " RootLatency: " << RootLatency << "\n";
dbgs() << " RootSlack: " << RootSlack << "\n";		dbgs() << " RootSlack: " << RootSlack << "\n";
dbgs() << " NewRootDepth + NewRootLatency = "		dbgs() << " NewRootDepth + NewRootLatency = " << NewCycleCount << "\n";
<< NewRootDepth + NewRootLatency << "\n";		dbgs() << " RootDepth + RootLatency + RootSlack = " << OldCycleCount
dbgs() << " RootDepth + RootLatency + RootSlack = "		<< "\n";);
<< RootDepth + RootLatency + RootSlack << "\n";);		DEBUG(NewCycleCount <= OldCycleCount
		? dbgs() << "It improves PathLen\n"
unsigned NewCycleCount = NewRootDepth + NewRootLatency;		: dbgs() << "It does NOT improve PathLen");
unsigned OldCycleCount = RootDepth + RootLatency + RootSlack;

return NewCycleCount <= OldCycleCount;		return NewCycleCount <= OldCycleCount;
}		}

/// helper routine to convert instructions into SC		/// helper routine to convert instructions into SC
void MachineCombiner::instr2instrSC(		void MachineCombiner::instr2instrSC(
SmallVectorImpl<MachineInstr *> &Instrs,		SmallVectorImpl<MachineInstr *> &Instrs,
SmallVectorImpl<const MCSchedClassDesc *> &InstrsSC) {		SmallVectorImpl<const MCSchedClassDesc *> &InstrsSC) {
Show All 32 Lines	bool MachineCombiner::preservesResourceLen(

// Compute new resource length.		// Compute new resource length.
unsigned ResLenAfterCombine =		unsigned ResLenAfterCombine =
BlockTrace.getResourceLength(MBBarr, MSCInsArr, MSCDelArr);		BlockTrace.getResourceLength(MBBarr, MSCInsArr, MSCDelArr);

DEBUG(dbgs() << "RESOURCE DATA: \n";		DEBUG(dbgs() << "RESOURCE DATA: \n";
dbgs() << " resource len before: " << ResLenBeforeCombine		dbgs() << " resource len before: " << ResLenBeforeCombine
<< " after: " << ResLenAfterCombine << "\n";);		<< " after: " << ResLenAfterCombine << "\n";);
		DEBUG(ResLenAfterCombine <= ResLenBeforeCombine
		? dbgs() << "It preserves ResourceLen\n"
		: dbgs() << "It does NOT preserve ResourceLen\n");

return ResLenAfterCombine <= ResLenBeforeCombine;		return ResLenAfterCombine <= ResLenBeforeCombine;
}		}

/// \returns true when new instruction sequence should be generated		/// \returns true when new instruction sequence should be generated
/// independent if it lengthens critical path or not		/// independent if it lengthens critical path or not
bool MachineCombiner::doSubstitute(unsigned NewSize, unsigned OldSize) {		bool MachineCombiner::doSubstitute(unsigned NewSize, unsigned OldSize) {
if (OptSize && (NewSize < OldSize))		if (OptSize && (NewSize < OldSize))
▲ Show 20 Lines • Show All 148 Lines • Show Last 20 Lines

lib/CodeGen/MachineTraceMetrics.cpp

Show First 20 Lines • Show All 512 Lines • ▼ Show 20 Lines	DEBUG({
dbgs() << "null\n";		dbgs() << "null\n";
});		});
// The trace leaving I is now known, compute the height resources.		// The trace leaving I is now known, compute the height resources.
computeHeightResources(I);		computeHeightResources(I);
}		}
}		}

/// Invalidate traces through BadMBB.		/// Invalidate traces through BadMBB.
		// TODO: this code should be refactored because it really increases compile time
		RKSimonUnsubmitted Not Done Reply Inline Actions Is this still true after D29627? RKSimon: Is this still true after D29627?
		avt77AuthorUnsubmitted Not Done Reply Inline Actions Yes, the profile shows that now (after D29627) this feature eats more time than any other 58,19% llc llc [.] llvm::MachineTraceMetrics::Ensemble::invalidate 3,44% llc llc [.] (anonymous namespace)::TwoAddressInstructionPass::scanUses 3,19% llc llc [.] llvm::ScheduleDAGSDNodes::ClusterNeighboringLoads 1,59% llc llc [.] llvm::SparseMultiSet<llvm::VReg2SUnit, llvm::VirtReg2IndexFunctor, unsigned char>::find 1,08% llc llc [.] llvm::X86InstrInfo::areLoadsFromSameBasePtr avt77: Yes, the profile shows that now (after D29627) this feature eats more time than any other 58…
void		void
MachineTraceMetrics::Ensemble::invalidate(const MachineBasicBlock *BadMBB) {		MachineTraceMetrics::Ensemble::invalidate(const MachineBasicBlock *BadMBB) {
SmallVector<const MachineBasicBlock*, 16> WorkList;		SmallVector<const MachineBasicBlock*, 16> WorkList;
TraceBlockInfo &BadTBI = BlockInfo[BadMBB->getNumber()];		TraceBlockInfo &BadTBI = BlockInfo[BadMBB->getNumber()];

// Invalidate height resources of blocks above MBB.		// Invalidate height resources of blocks above MBB.
if (BadTBI.hasValidHeight()) {		if (BadTBI.hasValidHeight()) {
BadTBI.invalidateHeight();		BadTBI.invalidateHeight();
▲ Show 20 Lines • Show All 823 Lines • Show Last 20 Lines

lib/CodeGen/TargetLoweringBase.cpp

Show First 20 Lines • Show All 2,072 Lines • ▼ Show 20 Lines	static int getOpRefinementSteps(bool IsSqrt, EVT VT, StringRef Override) {
return TargetLoweringBase::ReciprocalEstimate::Unspecified;		return TargetLoweringBase::ReciprocalEstimate::Unspecified;
}		}

int TargetLoweringBase::getRecipEstimateSqrtEnabled(EVT VT,		int TargetLoweringBase::getRecipEstimateSqrtEnabled(EVT VT,
MachineFunction &MF) const {		MachineFunction &MF) const {
return getOpEnabled(true, VT, getRecipEstimateForFunc(MF));		return getOpEnabled(true, VT, getRecipEstimateForFunc(MF));
}		}

int TargetLoweringBase::getRecipEstimateDivEnabled(EVT VT,		int TargetLoweringBase::getRecipEstimateDivEnabled(EVT VT, MachineFunction &MF,
MachineFunction &MF) const {		bool forDAGCombiner) const {

return getOpEnabled(false, VT, getRecipEstimateForFunc(MF));		return getOpEnabled(false, VT, getRecipEstimateForFunc(MF));
}		}

int TargetLoweringBase::getSqrtRefinementSteps(EVT VT,		int TargetLoweringBase::getSqrtRefinementSteps(EVT VT,
MachineFunction &MF) const {		MachineFunction &MF) const {
return getOpRefinementSteps(true, VT, getRecipEstimateForFunc(MF));		return getOpRefinementSteps(true, VT, getRecipEstimateForFunc(MF));
}		}

int TargetLoweringBase::getDivRefinementSteps(EVT VT,		int TargetLoweringBase::getDivRefinementSteps(EVT VT,
MachineFunction &MF) const {		MachineFunction &MF) const {
return getOpRefinementSteps(false, VT, getRecipEstimateForFunc(MF));		return getOpRefinementSteps(false, VT, getRecipEstimateForFunc(MF));
}		}

lib/Target/X86/X86ISelLowering.h

Show First 20 Lines • Show All 1,279 Lines • ▼ Show 20 Lines	private:
/// Check if replacement of SQRT with RSQRT should be disabled.		/// Check if replacement of SQRT with RSQRT should be disabled.
bool isFsqrtCheap(SDValue Operand, SelectionDAG &DAG) const override;		bool isFsqrtCheap(SDValue Operand, SelectionDAG &DAG) const override;

/// Use rsqrt* to speed up sqrt calculations.		/// Use rsqrt* to speed up sqrt calculations.
SDValue getSqrtEstimate(SDValue Operand, SelectionDAG &DAG, int Enabled,		SDValue getSqrtEstimate(SDValue Operand, SelectionDAG &DAG, int Enabled,
int &RefinementSteps, bool &UseOneConstNR,		int &RefinementSteps, bool &UseOneConstNR,
bool Reciprocal) const override;		bool Reciprocal) const override;

		/// Return a ReciprocalEstimate enum value for a division of the given type
		/// based on the function's attributes. If the operation is not overridden
		/// by
		/// the function's attributes, "Unspecified" is returned and target defaults
		/// are expected to be used for instruction selection.
		int getRecipEstimateDivEnabled(EVT VT, MachineFunction &MF,
		bool forDAG) const override;

/// Use rcp* to speed up fdiv calculations.		/// Use rcp* to speed up fdiv calculations.
SDValue getRecipEstimate(SDValue Operand, SelectionDAG &DAG, int Enabled,		SDValue getRecipEstimate(SDValue Operand, SelectionDAG &DAG, int Enabled,
int &RefinementSteps) const override;		int &RefinementSteps) const override;

/// Reassociate floating point divisions into multiply by reciprocal.		/// Reassociate floating point divisions into multiply by reciprocal.
unsigned combineRepeatedFPDivisors() const override;		unsigned combineRepeatedFPDivisors() const override;
};		};

▲ Show 20 Lines • Show All 94 Lines • Show Last 20 Lines

lib/Target/X86/X86ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 16,469 Lines • ▼ Show 20 Lines	if (RefinementSteps == ReciprocalEstimate::Unspecified)
RefinementSteps = 1;		RefinementSteps = 1;

UseOneConstNR = false;		UseOneConstNR = false;
return DAG.getNode(X86ISD::FRSQRT, SDLoc(Op), VT, Op);		return DAG.getNode(X86ISD::FRSQRT, SDLoc(Op), VT, Op);
}		}
return SDValue();		return SDValue();
}		}

		/// Return a ReciprocalEstimate enum value for a division of the given type
		/// based on the function's attributes. If the operation is not overridden by
		/// the function's attributes, "Unspecified" is returned and target defaults
		/// are expected to be used for instruction selection.
		int X86TargetLowering::getRecipEstimateDivEnabled(EVT VT, MachineFunction &MF,
		bool forDAG) const {
		if (!Subtarget.hasAVX512() && forDAG)
		return ReciprocalEstimate::Disabled;
		return TargetLoweringBase::getRecipEstimateDivEnabled(VT, MF);
		}

/// The minimum architected relative accuracy is 2^-12. We need one		/// The minimum architected relative accuracy is 2^-12. We need one
/// Newton-Raphson step to have a good float result (24 bits of precision).		/// Newton-Raphson step to have a good float result (24 bits of precision).
SDValue X86TargetLowering::getRecipEstimate(SDValue Op, SelectionDAG &DAG,		SDValue X86TargetLowering::getRecipEstimate(SDValue Op, SelectionDAG &DAG,
int Enabled,		int Enabled,
int &RefinementSteps) const {		int &RefinementSteps) const {
EVT VT = Op.getValueType();		EVT VT = Op.getValueType();

// SSE1 has rcpss and rcpps. AVX adds a 256-bit variant for rcpps.		// SSE1 has rcpss and rcpps. AVX adds a 256-bit variant for rcpps.
▲ Show 20 Lines • Show All 19,064 Lines • Show Last 20 Lines

lib/Target/X86/X86InstrInfo.h

Show First 20 Lines • Show All 495 Lines • ▼ Show 20 Lines	bool hasHighOperandLatency(const TargetSchedModel &SchedModel,
const MachineInstr &DefMI, unsigned DefIdx,		const MachineInstr &DefMI, unsigned DefIdx,
const MachineInstr &UseMI,		const MachineInstr &UseMI,
unsigned UseIdx) const override;		unsigned UseIdx) const override;

bool useMachineCombiner() const override {		bool useMachineCombiner() const override {
return true;		return true;
}		}

		/// When getMachineCombinerPatterns() finds patterns, this function generates
		/// the instructions that could replace the original code sequence
		void genAlternativeCodeSequence(
		MachineInstr &Root, MachineCombinerPattern Pattern,
		SmallVectorImpl<MachineInstr *> &InsInstrs,
		SmallVectorImpl<MachineInstr *> &DelInstrs,
		DenseMap<unsigned, unsigned> &InstrIdxForVirtReg) const override;

		/// Return true when there is potentially a faster code sequence
		/// for an instruction chain ending in <Root>. All potential patterns are
		/// listed in the <Patterns> array.
		bool getMachineCombinerPatterns(
		MachineInstr &Root,
		SmallVectorImpl<MachineCombinerPattern> &Patterns) const override;

bool isAssociativeAndCommutative(const MachineInstr &Inst) const override;		bool isAssociativeAndCommutative(const MachineInstr &Inst) const override;

bool hasReassociableOperands(const MachineInstr &Inst,		bool hasReassociableOperands(const MachineInstr &Inst,
const MachineBasicBlock *MBB) const override;		const MachineBasicBlock *MBB) const override;

void setSpecialOperandAttr(MachineInstr &OldMI1, MachineInstr &OldMI2,		void setSpecialOperandAttr(MachineInstr &OldMI1, MachineInstr &OldMI2,
MachineInstr &NewMI1,		MachineInstr &NewMI1,
MachineInstr &NewMI2) const override;		MachineInstr &NewMI2) const override;
▲ Show 20 Lines • Show All 94 Lines • Show Last 20 Lines

lib/Target/X86/X86InstrInfo.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 10,096 Lines • ▼ Show 20 Lines	bool X86InstrInfo::isAssociativeAndCommutative(const MachineInstr &Inst) const {
case X86::VMULSDZrr:		case X86::VMULSDZrr:
case X86::VMULSSZrr:		case X86::VMULSSZrr:
return Inst.getParent()->getParent()->getTarget().Options.UnsafeFPMath;		return Inst.getParent()->getParent()->getTarget().Options.UnsafeFPMath;
default:		default:
return false;		return false;
}		}
}		}

		// The dividend could be ExactlyOne value and in this case we should not create
		RKSimonUnsubmitted Not Done Reply Inline Actions Rename this to hasExactlyOneOperand ? AllOnes means 'all bits are set' RKSimon: Rename this to hasExactlyOneOperand ? AllOnes means 'all bits are set'
		GerolfUnsubmitted Not Done Reply Inline Actions Add comment what the function does before the function Gerolf: Add comment what the function does before the function
		// additional constant for reciprocal division but use the dividend instead.
		// We're trying to find the dividend definition and if it is a constant
		// ExactlyOne value we'll use it.
		RKSimonUnsubmitted Not Done Reply Inline Actions Use a range iterator if you can, same for the other cases below RKSimon: Use a range iterator if you can, same for the other cases below
		static bool isDividendExactlyOne(MachineFunction &MF, unsigned DividendReg) {
		GerolfUnsubmitted Not Done Reply Inline Actions -> dividend Gerolf: -> dividend
		GerolfUnsubmitted Not Done Reply Inline Actions it's -> it is Gerolf: it's -> it is
		if (MachineInstr *MI = MF.getRegInfo().getUniqueVRegDef(DividendReg)) {
		ABataevUnsubmitted Not Done Reply Inline Actions I believe this can be replaced by a range-based loop ABataev: I believe this can be replaced by a range-based loop
		auto Constants =
		GerolfUnsubmitted Not Done Reply Inline Actions This is an expensive search. There must be a direct simpler way to get that info. Gerolf: This is an expensive search. There must be a direct simpler way to get that info.
		MI->getParent()->getParent()->getConstantPool()->getConstants();
		for (auto &MO : MI->operands()) {
		if (MO.isCPI()) {
		// We have a Constant Pool Index operand in this instruction
		ABataevUnsubmitted Not Done Reply Inline Actions auto &? ABataev: auto &?
		// FIXME: should we deal with other types of operand like Immediate?
		auto ConstantEntry = Constants[MO.getIndex()];
		// FIXME: what should we do with MachineConstantPoolEntry?
		RKSimonUnsubmitted Not Done Reply Inline Actions This should be C->isExactlyValue(1.0) RKSimon: This should be C->isExactlyValue(1.0)
		if (!ConstantEntry.isMachineConstantPoolEntry()) {
		if (auto *C = dyn_cast<Constant>(ConstantEntry.Val.ConstVal)) {
		if (C->getType()->isVectorTy()) {
		if (!(C = C->getSplatValue()))
		return false;
		GerolfUnsubmitted Not Done Reply Inline Actions Why this special case? just C = C->getSplatValue() would be easier to read. Gerolf: Why this special case? just C = C->getSplatValue() would be easier to read.
		}
		if (auto *CFP = dyn_cast<ConstantFP>(C))
		return CFP->isExactlyValue(1.0);
		RKSimonUnsubmitted Not Done Reply Inline Actions This should be isDividendExactlyOne. We should be calling it Dividend not Divident as well. RKSimon: This should be isDividendExactlyOne. We should be calling it Dividend not Divident as well.
		}
		}
		}
		}
		}
		return false;
		}

		RKSimonUnsubmitted Not Done Reply Inline Actions Rename to isDividendEactlyOne and update the comment. RKSimon: Rename to isDividendEactlyOne and update the comment.
		static EVT getFDivEVT(MachineInstr &Root) {
		ABataevUnsubmitted Not Done Reply Inline Actions Should these loops be range-based loops? ABataev: Should these loops be range-based loops?
		// FIXME: should we support other kinds of DIV?
		switch (Root.getOpcode()) {
		default:
		break;
		case X86::DIVSSrr: // f32
		ABataevUnsubmitted Not Done Reply Inline Actions Range-based loop? ABataev: Range-based loop?
		case X86::VDIVSSrr: // f32
		return MVT::f32;
		case X86::DIVPSrr: // v4f32
		case X86::VDIVPSrr: // v4f32
		return MVT::v4f32;
		RKSimonUnsubmitted Not Done Reply Inline Actions Embed hasAllOnesOperand here. RKSimon: Embed hasAllOnesOperand here.
		case X86::VDIVPSYrr: // v8f32
		ABataevUnsubmitted Not Done Reply Inline Actions No braces required ABataev: No braces required
		return MVT::v8f32;
		}
		return MVT::INVALID_SIMPLE_VALUE_TYPE;
		}

		/// genReciprocalDiv - Generates A = B * 1/C instead of A = B/C
		/// (at the moment we support float types only: f32, v4f32 and 8f32)
		/// TODO: Should we support double types for the latest CPUs?
		///
		/// To get more precision we're using Newton-Raphson iterations like here:
		///
		/// X[0] = reciprocal (C);
		/// X[i+1] = X[i] + X[i] * (1 - C * X[i]); every iteration increases precision
		///
		/// In theory if we know that X[0] is accurate to N bits, the result of
		/// iteration k will be accurate to almost 2^k*N bits. For x86 it means:
		GerolfUnsubmitted Not Done Reply Inline Actions For which types? All FP? Gerolf: For which types? All FP?
		/// X[0] = 11 bits
		/// X[1] = 22 bits
		/// x[2] = 44 bits
		/// etc.
		///
		GerolfUnsubmitted Not Done Reply Inline Actions Can you elaborate? eg. on #iterations? Gerolf: Can you elaborate? eg. on #iterations?
		/// And the result of division will be here: A = B * X
		/// Example (-x86-asm-syntax=intel): instead of
		///
		/// vmovss xmm1, dword ptr [rip + .LCPI0_0] # xmm1 = mem[0],zero,zero,zero
		/// vdivss xmm0, xmm1, xmm0
		///
		/// we're generating
		///
		/// vmovss xmm1, dword ptr [rip + .LCPI0_0] # xmm1 = mem[0],zero,zero,zero
		/// vrcpss xmm2, xmm0, xmm0
		GerolfUnsubmitted Not Done Reply Inline Actions input to this function are 7 parameters, the comment only lists 6. Gerolf: input to this function are 7 parameters, the comment only lists 6.
		avt77AuthorUnsubmitted Not Done Reply Inline Actions I did not understand this comment: what should I do here? avt77: I did not understand this comment: what should I do here?
		RKSimonUnsubmitted Not Done Reply Inline Actions I think it means that while ArrayRef<int> Instrs have 7 instructions listed, the codegen in the comment only shows 6 instructions RKSimon: I think it means that while ArrayRef<int> Instrs have 7 instructions listed, the codegen in the…
		avt77AuthorUnsubmitted Not Done Reply Inline Actions But in fact all 7 instructions are shown but from index 0 to index 6 (maybe in "strange" order: 0,2,1,3,4,5,6). If you'd like I could change the order and/or start numbering from 1. Gerolf, should I do it or we fixed everything? avt77: But in fact all 7 instructions are shown but from index 0 to index 6 (maybe in "strange" order…
		GerolfUnsubmitted Not Done Reply Inline Actions In the comments I only count 6 instructions from vmovss to vaddss, but the code handles 7. What am I missing? Gerolf: In the comments I only count 6 instructions from vmovss to vaddss, but the code handles 7. What…
		/// vmulss xmm0, xmm0, xmm2
		/// vsubss xmm0, xmm1, xmm0
		RKSimonUnsubmitted Not Done Reply Inline Actions Replace SmallVector<int, 7> with ArrayRef<int> RKSimon: Replace SmallVector<int, 7> with ArrayRef<int>
		/// vmulss xmm0, xmm0, xmm2
		/// vaddss xmm0, xmm0, xmm2

		#define FMA_INDEX 7
		RKSimonUnsubmitted Not Done Reply Inline Actions This needs refactoring to support scalar and packed versions of SSE and AVX opcodes if possible. RKSimon: This needs refactoring to support scalar and packed versions of SSE and AVX opcodes if possible.

		static void genReciprocalDiv(MachineInstr &Root,
		SmallVectorImpl<MachineInstr *> &InsInstrs,
		DenseMap<unsigned, unsigned> &InstrIdxForVirtReg,
		ArrayRef<int> Instrs, Type *Ty) {

		MachineBasicBlock *MBB = Root.getParent();
		MachineFunction &MF = *MBB->getParent();
		MachineRegisterInfo &MRI = MF.getRegInfo();
		auto &Subtarget = MF.getSubtarget<X86Subtarget>();
		const TargetInstrInfo *TII = Subtarget.getInstrInfo();
		const TargetLowering *TLI = Subtarget.getTargetLowering();
		const TargetRegisterInfo *TRI = Subtarget.getRegisterInfo();
		const TargetRegisterClass *RC = Root.getRegClassConstraint(0, TII, TRI);
		int Iterations = TLI->getDivRefinementSteps(getFDivEVT(Root), MF);

		bool hasFMA = Subtarget.hasFMA();
		assert(!hasFMA \|\| Instrs.size() > FMA_INDEX);

		unsigned ResultReg = Root.getOperand(0).getReg();
		unsigned DividendReg = Root.getOperand(1).getReg();
		bool DividendIsExactlyOne = isDividendExactlyOne(MF, DividendReg);
		RKSimonUnsubmitted Not Done Reply Inline Actions Dividend is only used once, merge these? unsigned DividendReg = Root.getOperand(1).getReg(); RKSimon: Dividend is only used once, merge these? ``` unsigned DividendReg = Root.getOperand(1).getReg…
		unsigned DividerReg = Root.getOperand(2).getReg();
		RKSimonUnsubmitted Not Done Reply Inline Actions AllOnes -> ExactlyOne RKSimon: AllOnes -> ExactlyOne
		bool DividerIsKill = Root.getOperand(2).isKill();

		if (TargetRegisterInfo::isVirtualRegister(ResultReg))
		MRI.constrainRegClass(ResultReg, RC);
		if (TargetRegisterInfo::isVirtualRegister(DividendReg))
		MRI.constrainRegClass(DividendReg, RC);
		if (TargetRegisterInfo::isVirtualRegister(DividerReg))
		MRI.constrainRegClass(DividerReg, RC);
		GerolfUnsubmitted Not Done Reply Inline Actions Why? Gerolf: Why?
		GerolfUnsubmitted Not Done Reply Inline Actions I keep stumbling over this comment and every time i read it: did you mean to say something like //Execute at least one iteration. Iternations = Max(1, Iterations); Gerolf: I keep stumbling over this comment and every time i read it: did you mean to say something…

		GerolfUnsubmitted Not Done Reply Inline Actions wey -> why? Gerolf: wey -> why?
		if (Iterations < 0) // all values >= 0 mean Iterations were defined explictly
		Iterations = 1; // otherwise we use the default value
		GerolfUnsubmitted Not Done Reply Inline Actions Where is that input sequence documented? Gerolf: Where is that input sequence documented?

		// The bullets below (0,2,1,3,4,5,6) mean the indexes inside input Instrs
		// The meaning of indexes(bullets) see below inside genAlternativeCodeSequence
		ABataevUnsubmitted Not Done Reply Inline Actions reformat this ABataev: reformat this
		// 0: rcp
		// Initial estimate value is recipocal division of C
		MachineInstrBuilder RcpMI;
		// Iff DivIsRcp == true Then div ~= rcp without any additional refinement
		bool DivIsRcp = DividendIsExactlyOne && !Iterations;

		unsigned RcpVReg;
		if (!DivIsRcp) {
		// We need refinement and only because of that we need this vreg
		RcpVReg = MRI.createVirtualRegister(RC);
		InstrIdxForVirtReg.insert(std::make_pair(RcpVReg, 0));
		RKSimonUnsubmitted Not Done Reply Inline Actions We don't do reciprocal estimates for double types - these all need removing. RKSimon: We don't do reciprocal estimates for double types - these all need removing.
		}
		if (Instrs[0] == X86::VRCPSSr)
		RcpMI = BuildMI(MF, Root.getDebugLoc(), TII->get(Instrs[0]),
		DivIsRcp ? ResultReg : RcpVReg)
		.addReg(DividerReg, getKillRegState(DividerIsKill))
		.addReg(DividerReg, getKillRegState(DividerIsKill));
		else
		RcpMI = BuildMI(MF, Root.getDebugLoc(), TII->get(Instrs[0]),
		DivIsRcp ? ResultReg : RcpVReg)
		.addReg(DividerReg, getKillRegState(DividerIsKill));
		InsInstrs.push_back(RcpMI);

		unsigned LoadVReg = 0;
		if (!DividendIsExactlyOne && Iterations) {
		// 2: load (mov)
		// We need all ones value to be able to do (1 - C * X[i])
		// x86-32 PIC requires a PIC base register for constant pools.
		unsigned PICBase = 0;
		if (MF.getTarget().isPositionIndependent()) {
		if (Subtarget.is64Bit())
		PICBase = X86::RIP;
		else if (Subtarget.is32Bit())
		PICBase = X86::EIP;
		else
		return;
		}
		// Create a constant-pool entry.
		MachineConstantPool &MCP = *MF.getConstantPool();
		// const Constant *C = Constant::getAllOnesValue(Ty);
		auto *CFP = ConstantFP::get(Ty, 1.0);
		unsigned CPI = MCP.getConstantPoolIndex(CFP, 4);
		LoadVReg = MRI.createVirtualRegister(RC);
		InstrIdxForVirtReg.insert(std::make_pair(LoadVReg, 0));

		MachineInstrBuilder LoadMI;
		auto &MIDesc = TII->get(Instrs[2]);
		if (MIDesc.getNumOperands() == 6)
		LoadMI = BuildMI(MF, Root.getDebugLoc(), MIDesc, LoadVReg)
		.addReg(PICBase)
		.addImm(1)
		RKSimonUnsubmitted Not Done Reply Inline Actions Can we replace TII->get(Instrs[2]) with MIDesc? RKSimon: Can we replace TII->get(Instrs[2]) with MIDesc?
		.addReg(0)
		.addConstantPoolIndex(CPI)
		RKSimonUnsubmitted Not Done Reply Inline Actions Replace this with a basic TODO comment for future FSQRT pattern support - don't comment out code (especially when it doesn't exist). RKSimon: Replace this with a basic TODO comment for future FSQRT pattern support - don't comment out…
		.addReg(0);
		else
		LoadMI = BuildMI(MF, Root.getDebugLoc(), MIDesc, LoadVReg)
		.addConstantPoolIndex(CPI);
		InsInstrs.push_back(LoadMI);
		RKSimonUnsubmitted Not Done Reply Inline Actions Can we replace TII->get(Instrs[2]) with MIDesc? RKSimon: Can we replace TII->get(Instrs[2]) with MIDesc?
		}
		unsigned EstVReg = RcpVReg; // X[0] = reciprocal (C);

		for (int i = 0; i < Iterations; i++) {
		if (hasFMA) {
		// 7: fnmadd
		unsigned NFmaVReg = MRI.createVirtualRegister(RC);
		InstrIdxForVirtReg.insert(std::make_pair(NFmaVReg, 0));
		MachineInstrBuilder NFmaMI =
		BuildMI(MF, Root.getDebugLoc(), TII->get(Instrs[FMA_INDEX]), NFmaVReg)
		.addReg(DividerReg, getKillRegState(DividerIsKill))
		GerolfUnsubmitted Not Done Reply Inline Actions Should there be a 2:? Gerolf: Should there be a 2:?
		.addReg(DividendIsExactlyOne ? DividendReg : LoadVReg)
		.addReg(EstVReg);
		InsInstrs.push_back(NFmaMI); // 1 - C * X[i]
		// 8: fmadd
		RKSimonUnsubmitted Not Done Reply Inline Actions This isn't necessary if DividentIsAllOnes == true RKSimon: This isn't necessary if DividentIsAllOnes == true
		MachineInstrBuilder FmaMI;
		unsigned FmaVReg;
		if (DividendIsExactlyOne && (i + 1 == Iterations))
		FmaVReg = ResultReg;
		else {
		FmaVReg = MRI.createVirtualRegister(RC);
		InstrIdxForVirtReg.insert(std::make_pair(FmaVReg, 0));
		}
		FmaMI = BuildMI(MF, Root.getDebugLoc(), TII->get(Instrs[FMA_INDEX + 1]),
		FmaVReg)
		.addReg(EstVReg)
		.addReg(EstVReg)
		.addReg(NFmaVReg);
		InsInstrs.push_back(FmaMI); // X[i] + X[i] * (1 - C * X[i])
		EstVReg = FmaVReg;
		} else {
		// 1: mul
		unsigned MulVReg = MRI.createVirtualRegister(RC);
		InstrIdxForVirtReg.insert(std::make_pair(MulVReg, 0));
		MachineInstrBuilder MulMI =
		BuildMI(MF, Root.getDebugLoc(), TII->get(Instrs[1]), MulVReg)
		.addReg(DividerReg, getKillRegState(DividerIsKill))
		.addReg(EstVReg);
		InsInstrs.push_back(MulMI); // C * X[i]

		// 3: sub
		unsigned SubVReg = MRI.createVirtualRegister(RC);
		InstrIdxForVirtReg.insert(std::make_pair(SubVReg, 0));
		MachineInstrBuilder SubMI =
		BuildMI(MF, Root.getDebugLoc(), TII->get(Instrs[3]), SubVReg)
		.addReg(DividendIsExactlyOne ? DividendReg : LoadVReg)
		.addReg(MulVReg);
		InsInstrs.push_back(SubMI); // 1 - C * X[i]

		// 4: mul2
		unsigned Mul2VReg = MRI.createVirtualRegister(RC);
		InstrIdxForVirtReg.insert(std::make_pair(Mul2VReg, 0));
		MachineInstrBuilder Mul2MI =
		BuildMI(MF, Root.getDebugLoc(), TII->get(Instrs[4]), Mul2VReg)
		.addReg(SubVReg)
		.addReg(EstVReg);
		InsInstrs.push_back(Mul2MI); // X[i] * (1 - C * X[i])

		// 5: add
		MachineInstrBuilder AddMI;
		unsigned AddVReg;
		if (DividendIsExactlyOne && (i + 1 == Iterations))
		AddVReg = ResultReg;
		else {
		AddVReg = MRI.createVirtualRegister(RC);
		InstrIdxForVirtReg.insert(std::make_pair(AddVReg, 0));
		}
		AddMI = BuildMI(MF, Root.getDebugLoc(), TII->get(Instrs[5]), AddVReg)
		.addReg(Mul2VReg)
		.addReg(EstVReg);
		InsInstrs.push_back(AddMI); // X[i] + X[i] * (1 - C * X[i])
		EstVReg = AddVReg;
		}
		}
		if (!DividendIsExactlyOne) {
		// 6: result mul
		guyblankUnsubmitted Not Done Reply Inline Actions currently in Haswell and newer cpus, the generated sequence is using fma instructions. this should be taken into account here in the patterns, right? guyblank: currently in Haswell and newer cpus, the generated sequence is using fma instructions. this…
		RKSimonUnsubmitted Not Done Reply Inline Actions Yes this should be taken into account, but it does mean that the number of codegen patterns is going to start to balloon if we're not careful. RKSimon: Yes this should be taken into account, but it does mean that the number of codegen patterns is…
		// The final multiplication B * 1/C
		MachineInstrBuilder ResultMulMI =
		BuildMI(MF, Root.getDebugLoc(), TII->get(Instrs[6]), ResultReg)
		.addReg(DividendReg)
		.addReg(EstVReg);
		InsInstrs.push_back(ResultMulMI);
		}
		return;
		}

		/// When getMachineCombinerPatterns() finds potential patterns,
		/// this function generates the instructions that could replace the
		/// original code sequence
		void X86InstrInfo::genAlternativeCodeSequence(
		MachineInstr &Root, MachineCombinerPattern Pattern,
		SmallVectorImpl<MachineInstr *> &InsInstrs,
		SmallVectorImpl<MachineInstr *> &DelInstrs,
		DenseMap<unsigned, unsigned> &InstrIdxForVirtReg) const {
		MachineBasicBlock &MBB = *Root.getParent();
		MachineFunction &MF = *MBB.getParent();

		switch (Pattern) {
		default:
		// Reassociate instructions.
		TargetInstrInfo::genAlternativeCodeSequence(Root, Pattern, InsInstrs,
		DelInstrs, InstrIdxForVirtReg);
		return;
		case MachineCombinerPattern::Div2RecipEst:
		switch (Root.getOpcode()) {
		default:
		return;
		case X86::VDIVSSrr: // f32
		genReciprocalDiv(Root, InsInstrs, InstrIdxForVirtReg,
		{X86::VRCPSSr, X86::VMULSSrr, X86::VMOVSSrm,
		X86::VSUBSSrr, X86::VMULSSrr, X86::VADDSSrr,
		X86::VMULSSrr, X86::VFNMADD132SSr,
		RKSimonUnsubmitted Not Done Reply Inline Actions Inconsistent braces to the other cases. RKSimon: Inconsistent braces to the other cases.
		X86::VFMADD132SSr}, // FMA support at FMA_INDEX
		Type::getFloatTy(MF.getFunction()->getContext()));
		break;
		case X86::VDIVPSrr: // v4f32
		genReciprocalDiv(
		Root, InsInstrs, InstrIdxForVirtReg,
		{X86::VRCPPSr, X86::VMULPSrr, X86::VMOVAPSrm, X86::VSUBPSrr,
		X86::VMULPSrr, X86::VADDPSrr, X86::VMULPSrr, X86::VFNMADD132PSr,
		X86::VFMADD132PSr}, // FMA support at FMA_INDEX
		VectorType::get(Type::getFloatTy(MF.getFunction()->getContext()), 4));
		break;
		case X86::VDIVPSYrr: // v8f32
		genReciprocalDiv(
		Root, InsInstrs, InstrIdxForVirtReg,
		{X86::VRCPPSYr, X86::VMULPSYrr, X86::VMOVAPSYrm, X86::VSUBPSYrr,
		X86::VMULPSYrr, X86::VADDPSYrr, X86::VMULPSYrr, X86::VFNMADD132PSYr,
		X86::VFMADD132PSYr}, // FMA support at FMA_INDEX
		VectorType::get(Type::getFloatTy(MF.getFunction()->getContext()), 8));
		break;
		case X86::DIVSSrr: // f32
		genReciprocalDiv(Root, InsInstrs, InstrIdxForVirtReg,
		{X86::RCPSSr, X86::MULSSrr, X86::MOVSSrm, X86::SUBSSrr,
		X86::MULSSrr, X86::ADDSSrr, X86::MULSSrr},
		Type::getFloatTy(MF.getFunction()->getContext()));
		break;
		case X86::DIVPSrr: // v4f32
		genReciprocalDiv(
		Root, InsInstrs, InstrIdxForVirtReg,
		{X86::RCPPSr, X86::MULPSrr, X86::MOVAPSrr, X86::SUBPSrr, X86::MULPSrr,
		X86::ADDPSrr, X86::MULPSrr},
		VectorType::get(Type::getFloatTy(MF.getFunction()->getContext()), 4));
		break;
		}
		break;
		}
		DEBUG(dbgs() << "\nAlternate sequence for " << MF.getName() << "\n";
		for (unsigned i = 0; i < InsInstrs.size(); i++) {
		dbgs() << i << ": ";
		InsInstrs[i]->print(dbgs(), false, MF.getSubtarget().getInstrInfo());
		});
		DelInstrs.push_back(&Root); // Record FDiv for deletion
		}

		/// Find instructions that can be turned into recip
		static bool getFDIVPatterns(MachineInstr &Root,
		SmallVectorImpl<MachineCombinerPattern> &Patterns) {
		switch (Root.getOpcode()) {
		default:
		return false;
		// TODO: should we support other kinds of instructions?
		case X86::VDIVSSrr: // f32
		case X86::VDIVPSrr: // v4f32
		case X86::VDIVPSYrr: // v8f32
		case X86::DIVSSrr: // f32
		case X86::DIVPSrr: // v4f32
		break;
		}
		auto *MF = Root.getParent()->getParent();
		auto TLI = MF->getSubtarget().getTargetLowering();
		EVT VT = getFDivEVT(Root);
		if (VT == MVT::INVALID_SIMPLE_VALUE_TYPE)
		return false;
		switch (TLI->getRecipEstimateDivEnabled(VT, MF, /forDAG*/ false)) {
		case TLI->ReciprocalEstimate::Disabled:
		return false;
		case TLI->ReciprocalEstimate::Unspecified:
		if (Root.getParent()->getParent()->getTarget().Options.UnsafeFPMath)
		break;
		return false;
		}

		Patterns.push_back(MachineCombinerPattern::Div2RecipEst);
		return true;
		}

		/// Return true when there is potentially a faster code sequence for an
		/// instruction chain ending in \p Root. All potential patterns are listed in
		/// the \p Pattern vector. Pattern should be sorted in priority order since the
		/// pattern evaluator stops checking as soon as it finds a faster sequence.

		bool X86InstrInfo::getMachineCombinerPatterns(
		MachineInstr &Root,
		SmallVectorImpl<MachineCombinerPattern> &Patterns) const {
		// FDIV patterns
		if (getFDIVPatterns(Root, Patterns))
		return true;
		// TODO: FSQRT patterns will be prepared after reciprocal implementation
		// completes
		return TargetInstrInfo::getMachineCombinerPatterns(Root, Patterns);
		}

/// This is an architecture-specific helper function of reassociateOps.		/// This is an architecture-specific helper function of reassociateOps.
/// Set special operand attributes for new instructions after reassociation.		/// Set special operand attributes for new instructions after reassociation.
void X86InstrInfo::setSpecialOperandAttr(MachineInstr &OldMI1,		void X86InstrInfo::setSpecialOperandAttr(MachineInstr &OldMI1,
MachineInstr &OldMI2,		MachineInstr &OldMI2,
MachineInstr &NewMI1,		MachineInstr &NewMI1,
MachineInstr &NewMI2) const {		MachineInstr &NewMI2) const {
// Integer instructions define an implicit EFLAGS source register operand as		// Integer instructions define an implicit EFLAGS source register operand as
// the third source (fourth total) operand.		// the third source (fourth total) operand.
▲ Show 20 Lines • Show All 273 Lines • Show Last 20 Lines

test/CodeGen/X86/recip-fastmath.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+sse2 \| FileCheck %s --check-prefix=CHECK --check-prefix=SSE --check-prefix=SSE-RECIP			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+sse2 \| FileCheck %s --check-prefix=CHECK --check-prefix=SSE --check-prefix=SSE-RECIP
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=AVX-RECIP			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=AVX-RECIP
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx,+fma \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=FMA-RECIP			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mattr=+avx,+fma \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=FMA-RECIP
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=btver2 \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=BTVER2			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=btver2 \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=BTVER2
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=sandybridge\| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=SANDY			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=sandybridge\| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=SANDY
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=haswell \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=HASWELL			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=haswell \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=HASWELL
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=haswell -mattr=-fma \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=HASWELL-NO-FMA			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=haswell -mattr=-fma \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=HASWELL-NO-FMA
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=knl \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=AVX512 --check-prefix=KNL			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=knl \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=AVX512 --check-prefix=KNL
	; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=skx \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=AVX512 --check-prefix=SKX			; RUN: llc < %s -mtriple=x86_64-unknown-unknown -mcpu=skx \| FileCheck %s --check-prefix=CHECK --check-prefix=AVX --check-prefix=AVX512 --check-prefix=SKX
				guyblankUnsubmitted Not Done Reply Inline Actions please add RUN commands for specific cpus (as in the other test + AVX512 target) also commit the test changes and rebase your patch on them so we can see the output changes in these new runs. guyblank: please add RUN commands for specific cpus (as in the other test + AVX512 target) also commit…

	; If the target's divss/divps instructions are substantially			; If the target's divss/divps instructions are substantially
	; slower than rcpss/rcpps with a Newton-Raphson refinement,			; slower than rcpss/rcpps with a Newton-Raphson refinement,
	; we should generate the estimate sequence.			; we should generate the estimate sequence.

	; See PR21385 ( http://llvm.org/bugs/show_bug.cgi?id=21385 )			; See PR21385 ( http://llvm.org/bugs/show_bug.cgi?id=21385 )
	; for details about the accuracy, speed, and implementation			; for details about the accuracy, speed, and implementation
	; differences of x86 reciprocal estimates.			; differences of x86 reciprocal estimates.
	Show All 13 Lines
	; AVX-NEXT: retq			; AVX-NEXT: retq
	%div = fdiv fast float 1.0, %x			%div = fdiv fast float 1.0, %x
	ret float %div			ret float %div
	}			}

	define float @f32_one_step(float %x) #1 {			define float @f32_one_step(float %x) #1 {
	; SSE-LABEL: f32_one_step:			; SSE-LABEL: f32_one_step:
	; SSE: # BB#0:			; SSE: # BB#0:
				; SSE-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; SSE-NEXT: rcpss %xmm0, %xmm2			; SSE-NEXT: rcpss %xmm0, %xmm2
	; SSE-NEXT: mulss %xmm2, %xmm0			; SSE-NEXT: mulss %xmm2, %xmm0
	; SSE-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; SSE-NEXT: subss %xmm0, %xmm1			; SSE-NEXT: subss %xmm0, %xmm1
	; SSE-NEXT: mulss %xmm2, %xmm1			; SSE-NEXT: mulss %xmm2, %xmm1
	; SSE-NEXT: addss %xmm2, %xmm1			; SSE-NEXT: addss %xmm2, %xmm1
	; SSE-NEXT: movaps %xmm1, %xmm0			; SSE-NEXT: movaps %xmm1, %xmm0
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-RECIP-LABEL: f32_one_step:			; AVX-RECIP-LABEL: f32_one_step:
	; AVX-RECIP: # BB#0:			; AVX-RECIP: # BB#0:
	; AVX-RECIP-NEXT: vrcpss %xmm0, %xmm0, %xmm1			; AVX-RECIP-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; AVX-RECIP-NEXT: vmulss %xmm1, %xmm0, %xmm0			; AVX-RECIP-NEXT: vrcpss %xmm0, %xmm0, %xmm2
	; AVX-RECIP-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero			; AVX-RECIP-NEXT: vmulss %xmm2, %xmm0, %xmm0
	; AVX-RECIP-NEXT: vsubss %xmm0, %xmm2, %xmm0			; AVX-RECIP-NEXT: vsubss %xmm0, %xmm1, %xmm0
	; AVX-RECIP-NEXT: vmulss %xmm0, %xmm1, %xmm0			; AVX-RECIP-NEXT: vmulss %xmm2, %xmm0, %xmm0
	; AVX-RECIP-NEXT: vaddss %xmm0, %xmm1, %xmm0			; AVX-RECIP-NEXT: vaddss %xmm2, %xmm0, %xmm0
	; AVX-RECIP-NEXT: retq			; AVX-RECIP-NEXT: retq
	;			;
	; FMA-RECIP-LABEL: f32_one_step:			; FMA-RECIP-LABEL: f32_one_step:
	; FMA-RECIP: # BB#0:			; FMA-RECIP: # BB#0:
	; FMA-RECIP-NEXT: vrcpss %xmm0, %xmm0, %xmm1			; FMA-RECIP-NEXT: vrcpss %xmm0, %xmm0, %xmm1
	; FMA-RECIP-NEXT: vfnmadd213ss {{.*}}(%rip), %xmm1, %xmm0			; FMA-RECIP-NEXT: vfnmadd213ss {{.*}}(%rip), %xmm1, %xmm0
	; FMA-RECIP-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm0			; FMA-RECIP-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm0
	; FMA-RECIP-NEXT: retq			; FMA-RECIP-NEXT: retq
	;			;
	; BTVER2-LABEL: f32_one_step:			; BTVER2-LABEL: f32_one_step:
	; BTVER2: # BB#0:			; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero			; BTVER2-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; BTVER2-NEXT: vrcpss %xmm0, %xmm0, %xmm1			; BTVER2-NEXT: vrcpss %xmm0, %xmm0, %xmm2
	; BTVER2-NEXT: vmulss %xmm1, %xmm0, %xmm0			; BTVER2-NEXT: vmulss %xmm2, %xmm0, %xmm0
	; BTVER2-NEXT: vsubss %xmm0, %xmm2, %xmm0			; BTVER2-NEXT: vsubss %xmm0, %xmm1, %xmm0
	; BTVER2-NEXT: vmulss %xmm0, %xmm1, %xmm0			; BTVER2-NEXT: vmulss %xmm2, %xmm0, %xmm0
	; BTVER2-NEXT: vaddss %xmm0, %xmm1, %xmm0			; BTVER2-NEXT: vaddss %xmm2, %xmm0, %xmm0
	; BTVER2-NEXT: retq			; BTVER2-NEXT: retq
	;			;
	; SANDY-LABEL: f32_one_step:			; SANDY-LABEL: f32_one_step:
	; SANDY: # BB#0:			; SANDY: # BB#0:
	; SANDY-NEXT: vrcpss %xmm0, %xmm0, %xmm1			; SANDY-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; SANDY-NEXT: vmulss %xmm1, %xmm0, %xmm0			; SANDY-NEXT: vdivss %xmm0, %xmm1, %xmm0
	; SANDY-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero
	; SANDY-NEXT: vsubss %xmm0, %xmm2, %xmm0
	; SANDY-NEXT: vmulss %xmm0, %xmm1, %xmm0
	; SANDY-NEXT: vaddss %xmm0, %xmm1, %xmm0
	; SANDY-NEXT: retq			; SANDY-NEXT: retq
	;			;
	; HASWELL-LABEL: f32_one_step:			; HASWELL-LABEL: f32_one_step:
	; HASWELL: # BB#0:			; HASWELL: # BB#0:
	; HASWELL-NEXT: vrcpss %xmm0, %xmm0, %xmm1			; HASWELL-NEXT: vrcpss %xmm0, %xmm0, %xmm1
	; HASWELL-NEXT: vfnmadd213ss {{.*}}(%rip), %xmm1, %xmm0			; HASWELL-NEXT: vfnmadd213ss {{.*}}(%rip), %xmm1, %xmm0
	; HASWELL-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm0			; HASWELL-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm0
	; HASWELL-NEXT: retq			; HASWELL-NEXT: retq
	;			;
	; HASWELL-NO-FMA-LABEL: f32_one_step:			; HASWELL-NO-FMA-LABEL: f32_one_step:
	; HASWELL-NO-FMA: # BB#0:			; HASWELL-NO-FMA: # BB#0:
	; HASWELL-NO-FMA-NEXT: vrcpss %xmm0, %xmm0, %xmm1			; HASWELL-NO-FMA-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; HASWELL-NO-FMA-NEXT: vmulss %xmm1, %xmm0, %xmm0			; HASWELL-NO-FMA-NEXT: vdivss %xmm0, %xmm1, %xmm0
	; HASWELL-NO-FMA-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero
	; HASWELL-NO-FMA-NEXT: vsubss %xmm0, %xmm2, %xmm0
	; HASWELL-NO-FMA-NEXT: vmulss %xmm0, %xmm1, %xmm0
	; HASWELL-NO-FMA-NEXT: vaddss %xmm0, %xmm1, %xmm0
	; HASWELL-NO-FMA-NEXT: retq			; HASWELL-NO-FMA-NEXT: retq
	;			;
	; AVX512-LABEL: f32_one_step:			; AVX512-LABEL: f32_one_step:
	; AVX512: # BB#0:			; AVX512: # BB#0:
	; AVX512-NEXT: vrcp14ss %xmm0, %xmm0, %xmm1			; AVX512-NEXT: vrcp14ss %xmm0, %xmm0, %xmm1
	; AVX512-NEXT: vfnmadd213ss {{.*}}(%rip), %xmm1, %xmm0			; AVX512-NEXT: vfnmadd213ss {{.*}}(%rip), %xmm1, %xmm0
	; AVX512-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm0			; AVX512-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm0
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	%div = fdiv fast float 1.0, %x			%div = fdiv fast float 1.0, %x
	ret float %div			ret float %div
	}			}

	define float @f32_two_step(float %x) #2 {			define float @f32_two_step(float %x) #2 {
	; SSE-LABEL: f32_two_step:			; SSE-LABEL: f32_two_step:
	; SSE: # BB#0:			; SSE: # BB#0:
				; SSE-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; SSE-NEXT: rcpss %xmm0, %xmm2			; SSE-NEXT: rcpss %xmm0, %xmm2
	; SSE-NEXT: movaps %xmm0, %xmm3			; SSE-NEXT: movaps %xmm0, %xmm3
	; SSE-NEXT: mulss %xmm2, %xmm3			; SSE-NEXT: mulss %xmm2, %xmm3
	; SSE-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; SSE-NEXT: movaps %xmm1, %xmm4			; SSE-NEXT: movaps %xmm1, %xmm4
	; SSE-NEXT: subss %xmm3, %xmm4			; SSE-NEXT: subss %xmm3, %xmm4
	; SSE-NEXT: mulss %xmm2, %xmm4			; SSE-NEXT: mulss %xmm2, %xmm4
	; SSE-NEXT: addss %xmm2, %xmm4			; SSE-NEXT: addss %xmm2, %xmm4
	; SSE-NEXT: mulss %xmm4, %xmm0			; SSE-NEXT: mulss %xmm4, %xmm0
	; SSE-NEXT: subss %xmm0, %xmm1			; SSE-NEXT: subss %xmm0, %xmm1
	; SSE-NEXT: mulss %xmm4, %xmm1			; SSE-NEXT: mulss %xmm4, %xmm1
	; SSE-NEXT: addss %xmm4, %xmm1			; SSE-NEXT: addss %xmm4, %xmm1
	; SSE-NEXT: movaps %xmm1, %xmm0			; SSE-NEXT: movaps %xmm1, %xmm0
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-RECIP-LABEL: f32_two_step:			; AVX-RECIP-LABEL: f32_two_step:
	; AVX-RECIP: # BB#0:			; AVX-RECIP: # BB#0:
	; AVX-RECIP-NEXT: vrcpss %xmm0, %xmm0, %xmm1			; AVX-RECIP-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; AVX-RECIP-NEXT: vmulss %xmm1, %xmm0, %xmm2			; AVX-RECIP-NEXT: vrcpss %xmm0, %xmm0, %xmm2
	; AVX-RECIP-NEXT: vmovss {{.*#+}} xmm3 = mem[0],zero,zero,zero			; AVX-RECIP-NEXT: vmulss %xmm2, %xmm0, %xmm3
	; AVX-RECIP-NEXT: vsubss %xmm2, %xmm3, %xmm2			; AVX-RECIP-NEXT: vsubss %xmm3, %xmm1, %xmm3
	; AVX-RECIP-NEXT: vmulss %xmm2, %xmm1, %xmm2			; AVX-RECIP-NEXT: vmulss %xmm2, %xmm3, %xmm3
	; AVX-RECIP-NEXT: vaddss %xmm2, %xmm1, %xmm1			; AVX-RECIP-NEXT: vaddss %xmm2, %xmm3, %xmm2
	; AVX-RECIP-NEXT: vmulss %xmm1, %xmm0, %xmm0			; AVX-RECIP-NEXT: vmulss %xmm2, %xmm0, %xmm0
	; AVX-RECIP-NEXT: vsubss %xmm0, %xmm3, %xmm0			; AVX-RECIP-NEXT: vsubss %xmm0, %xmm1, %xmm0
	; AVX-RECIP-NEXT: vmulss %xmm0, %xmm1, %xmm0			; AVX-RECIP-NEXT: vmulss %xmm2, %xmm0, %xmm0
	; AVX-RECIP-NEXT: vaddss %xmm0, %xmm1, %xmm0			; AVX-RECIP-NEXT: vaddss %xmm2, %xmm0, %xmm0
	; AVX-RECIP-NEXT: retq			; AVX-RECIP-NEXT: retq
	;			;
	; FMA-RECIP-LABEL: f32_two_step:			; FMA-RECIP-LABEL: f32_two_step:
	; FMA-RECIP: # BB#0:			; FMA-RECIP: # BB#0:
	; FMA-RECIP-NEXT: vrcpss %xmm0, %xmm0, %xmm1			; FMA-RECIP-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; FMA-RECIP-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero			; FMA-RECIP-NEXT: vrcpss %xmm0, %xmm0, %xmm2
	; FMA-RECIP-NEXT: vmovaps %xmm1, %xmm3			; FMA-RECIP-NEXT: vmovaps %xmm0, %xmm3
	; FMA-RECIP-NEXT: vfnmadd213ss %xmm2, %xmm0, %xmm3			; FMA-RECIP-NEXT: vfnmadd132ss %xmm2, %xmm1, %xmm3
	; FMA-RECIP-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm3			; FMA-RECIP-NEXT: vfmadd132ss %xmm2, %xmm2, %xmm3
	; FMA-RECIP-NEXT: vfnmadd213ss %xmm2, %xmm3, %xmm0			; FMA-RECIP-NEXT: vfnmadd132ss %xmm3, %xmm1, %xmm0
	; FMA-RECIP-NEXT: vfmadd132ss %xmm3, %xmm3, %xmm0			; FMA-RECIP-NEXT: vfmadd132ss %xmm3, %xmm3, %xmm0
	; FMA-RECIP-NEXT: retq			; FMA-RECIP-NEXT: retq
	;			;
	; BTVER2-LABEL: f32_two_step:			; BTVER2-LABEL: f32_two_step:
	; BTVER2: # BB#0:			; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovss {{.*#+}} xmm3 = mem[0],zero,zero,zero			; BTVER2-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; BTVER2-NEXT: vrcpss %xmm0, %xmm0, %xmm1			; BTVER2-NEXT: vrcpss %xmm0, %xmm0, %xmm2
	; BTVER2-NEXT: vmulss %xmm1, %xmm0, %xmm2			; BTVER2-NEXT: vmulss %xmm2, %xmm0, %xmm3
	; BTVER2-NEXT: vsubss %xmm2, %xmm3, %xmm2			; BTVER2-NEXT: vsubss %xmm3, %xmm1, %xmm3
	; BTVER2-NEXT: vmulss %xmm2, %xmm1, %xmm2			; BTVER2-NEXT: vmulss %xmm2, %xmm3, %xmm3
	; BTVER2-NEXT: vaddss %xmm2, %xmm1, %xmm1			; BTVER2-NEXT: vaddss %xmm2, %xmm3, %xmm2
	; BTVER2-NEXT: vmulss %xmm1, %xmm0, %xmm0			; BTVER2-NEXT: vmulss %xmm2, %xmm0, %xmm0
	; BTVER2-NEXT: vsubss %xmm0, %xmm3, %xmm0			; BTVER2-NEXT: vsubss %xmm0, %xmm1, %xmm0
	; BTVER2-NEXT: vmulss %xmm0, %xmm1, %xmm0			; BTVER2-NEXT: vmulss %xmm2, %xmm0, %xmm0
	; BTVER2-NEXT: vaddss %xmm0, %xmm1, %xmm0			; BTVER2-NEXT: vaddss %xmm2, %xmm0, %xmm0
	; BTVER2-NEXT: retq			; BTVER2-NEXT: retq
	;			;
	; SANDY-LABEL: f32_two_step:			; SANDY-LABEL: f32_two_step:
	; SANDY: # BB#0:			; SANDY: # BB#0:
	; SANDY-NEXT: vrcpss %xmm0, %xmm0, %xmm1			; SANDY-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; SANDY-NEXT: vmulss %xmm1, %xmm0, %xmm2			; SANDY-NEXT: vdivss %xmm0, %xmm1, %xmm0
	; SANDY-NEXT: vmovss {{.*#+}} xmm3 = mem[0],zero,zero,zero
	; SANDY-NEXT: vsubss %xmm2, %xmm3, %xmm2
	; SANDY-NEXT: vmulss %xmm2, %xmm1, %xmm2
	; SANDY-NEXT: vaddss %xmm2, %xmm1, %xmm1
	; SANDY-NEXT: vmulss %xmm1, %xmm0, %xmm0
	; SANDY-NEXT: vsubss %xmm0, %xmm3, %xmm0
	; SANDY-NEXT: vmulss %xmm0, %xmm1, %xmm0
	; SANDY-NEXT: vaddss %xmm0, %xmm1, %xmm0
	; SANDY-NEXT: retq			; SANDY-NEXT: retq
	;			;
	; HASWELL-LABEL: f32_two_step:			; HASWELL-LABEL: f32_two_step:
	; HASWELL: # BB#0:			; HASWELL: # BB#0:
	; HASWELL-NEXT: vrcpss %xmm0, %xmm0, %xmm1			; HASWELL-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; HASWELL-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero			; HASWELL-NEXT: vrcpss %xmm0, %xmm0, %xmm2
	; HASWELL-NEXT: vmovaps %xmm1, %xmm3			; HASWELL-NEXT: vmovaps %xmm0, %xmm3
	; HASWELL-NEXT: vfnmadd213ss %xmm2, %xmm0, %xmm3			; HASWELL-NEXT: vfnmadd132ss %xmm2, %xmm1, %xmm3
	; HASWELL-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm3			; HASWELL-NEXT: vfmadd132ss %xmm2, %xmm2, %xmm3
	; HASWELL-NEXT: vfnmadd213ss %xmm2, %xmm3, %xmm0			; HASWELL-NEXT: vfnmadd132ss %xmm3, %xmm1, %xmm0
	; HASWELL-NEXT: vfmadd132ss %xmm3, %xmm3, %xmm0			; HASWELL-NEXT: vfmadd132ss %xmm3, %xmm3, %xmm0
	; HASWELL-NEXT: retq			; HASWELL-NEXT: retq
	;			;
	; HASWELL-NO-FMA-LABEL: f32_two_step:			; HASWELL-NO-FMA-LABEL: f32_two_step:
	; HASWELL-NO-FMA: # BB#0:			; HASWELL-NO-FMA: # BB#0:
	; HASWELL-NO-FMA-NEXT: vrcpss %xmm0, %xmm0, %xmm1			; HASWELL-NO-FMA-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; HASWELL-NO-FMA-NEXT: vmulss %xmm1, %xmm0, %xmm2			; HASWELL-NO-FMA-NEXT: vdivss %xmm0, %xmm1, %xmm0
	; HASWELL-NO-FMA-NEXT: vmovss {{.*#+}} xmm3 = mem[0],zero,zero,zero
	; HASWELL-NO-FMA-NEXT: vsubss %xmm2, %xmm3, %xmm2
	; HASWELL-NO-FMA-NEXT: vmulss %xmm2, %xmm1, %xmm2
	; HASWELL-NO-FMA-NEXT: vaddss %xmm2, %xmm1, %xmm1
	; HASWELL-NO-FMA-NEXT: vmulss %xmm1, %xmm0, %xmm0
	; HASWELL-NO-FMA-NEXT: vsubss %xmm0, %xmm3, %xmm0
	; HASWELL-NO-FMA-NEXT: vmulss %xmm0, %xmm1, %xmm0
	; HASWELL-NO-FMA-NEXT: vaddss %xmm0, %xmm1, %xmm0
	; HASWELL-NO-FMA-NEXT: retq			; HASWELL-NO-FMA-NEXT: retq
	;			;
	; AVX512-LABEL: f32_two_step:			; AVX512-LABEL: f32_two_step:
	; AVX512: # BB#0:			; AVX512: # BB#0:
	; AVX512-NEXT: vrcp14ss %xmm0, %xmm0, %xmm1			; AVX512-NEXT: vrcp14ss %xmm0, %xmm0, %xmm1
	; AVX512-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero			; AVX512-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero
	; AVX512-NEXT: vmovaps %xmm1, %xmm3			; AVX512-NEXT: vmovaps %xmm1, %xmm3
	; AVX512-NEXT: vfnmadd213ss %xmm2, %xmm0, %xmm3			; AVX512-NEXT: vfnmadd213ss %xmm2, %xmm0, %xmm3
	▲ Show 20 Lines • Show All 56 Lines • ▼ Show 20 Lines
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	%div = fdiv fast <4 x float> <float 1.0, float 1.0, float 1.0, float 1.0>, %x			%div = fdiv fast <4 x float> <float 1.0, float 1.0, float 1.0, float 1.0>, %x
	ret <4 x float> %div			ret <4 x float> %div
	}			}

	define <4 x float> @v4f32_one_step(<4 x float> %x) #1 {			define <4 x float> @v4f32_one_step(<4 x float> %x) #1 {
	; SSE-LABEL: v4f32_one_step:			; SSE-LABEL: v4f32_one_step:
	; SSE: # BB#0:			; SSE: # BB#0:
				; SSE-NEXT: movaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; SSE-NEXT: rcpps %xmm0, %xmm2			; SSE-NEXT: rcpps %xmm0, %xmm2
	; SSE-NEXT: mulps %xmm2, %xmm0			; SSE-NEXT: mulps %xmm2, %xmm0
	; SSE-NEXT: movaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; SSE-NEXT: subps %xmm0, %xmm1			; SSE-NEXT: subps %xmm0, %xmm1
	; SSE-NEXT: mulps %xmm2, %xmm1			; SSE-NEXT: mulps %xmm2, %xmm1
	; SSE-NEXT: addps %xmm2, %xmm1			; SSE-NEXT: addps %xmm2, %xmm1
	; SSE-NEXT: movaps %xmm1, %xmm0			; SSE-NEXT: movaps %xmm1, %xmm0
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-RECIP-LABEL: v4f32_one_step:			; AVX-RECIP-LABEL: v4f32_one_step:
	; AVX-RECIP: # BB#0:			; AVX-RECIP: # BB#0:
	; AVX-RECIP-NEXT: vrcpps %xmm0, %xmm1			; AVX-RECIP-NEXT: vmovaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; AVX-RECIP-NEXT: vmulps %xmm1, %xmm0, %xmm0			; AVX-RECIP-NEXT: vrcpps %xmm0, %xmm2
	; AVX-RECIP-NEXT: vmovaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]			; AVX-RECIP-NEXT: vmulps %xmm2, %xmm0, %xmm0
	; AVX-RECIP-NEXT: vsubps %xmm0, %xmm2, %xmm0			; AVX-RECIP-NEXT: vsubps %xmm0, %xmm1, %xmm0
	; AVX-RECIP-NEXT: vmulps %xmm0, %xmm1, %xmm0			; AVX-RECIP-NEXT: vmulps %xmm2, %xmm0, %xmm0
	; AVX-RECIP-NEXT: vaddps %xmm0, %xmm1, %xmm0			; AVX-RECIP-NEXT: vaddps %xmm2, %xmm0, %xmm0
	; AVX-RECIP-NEXT: retq			; AVX-RECIP-NEXT: retq
	;			;
	; FMA-RECIP-LABEL: v4f32_one_step:			; FMA-RECIP-LABEL: v4f32_one_step:
	; FMA-RECIP: # BB#0:			; FMA-RECIP: # BB#0:
	; FMA-RECIP-NEXT: vrcpps %xmm0, %xmm1			; FMA-RECIP-NEXT: vrcpps %xmm0, %xmm1
	; FMA-RECIP-NEXT: vfnmadd213ps {{.*}}(%rip), %xmm1, %xmm0			; FMA-RECIP-NEXT: vfnmadd213ps {{.*}}(%rip), %xmm1, %xmm0
	; FMA-RECIP-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm0			; FMA-RECIP-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm0
	; FMA-RECIP-NEXT: retq			; FMA-RECIP-NEXT: retq
	;			;
	; BTVER2-LABEL: v4f32_one_step:			; BTVER2-LABEL: v4f32_one_step:
	; BTVER2: # BB#0:			; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]			; BTVER2-NEXT: vmovaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; BTVER2-NEXT: vrcpps %xmm0, %xmm1			; BTVER2-NEXT: vrcpps %xmm0, %xmm2
	; BTVER2-NEXT: vmulps %xmm1, %xmm0, %xmm0			; BTVER2-NEXT: vmulps %xmm2, %xmm0, %xmm0
	; BTVER2-NEXT: vsubps %xmm0, %xmm2, %xmm0			; BTVER2-NEXT: vsubps %xmm0, %xmm1, %xmm0
	; BTVER2-NEXT: vmulps %xmm0, %xmm1, %xmm0			; BTVER2-NEXT: vmulps %xmm2, %xmm0, %xmm0
	; BTVER2-NEXT: vaddps %xmm0, %xmm1, %xmm0			; BTVER2-NEXT: vaddps %xmm2, %xmm0, %xmm0
	; BTVER2-NEXT: retq			; BTVER2-NEXT: retq
	;			;
	; SANDY-LABEL: v4f32_one_step:			; SANDY-LABEL: v4f32_one_step:
	; SANDY: # BB#0:			; SANDY: # BB#0:
	; SANDY-NEXT: vrcpps %xmm0, %xmm1			; SANDY-NEXT: vmovaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; SANDY-NEXT: vmulps %xmm1, %xmm0, %xmm0			; SANDY-NEXT: vdivps %xmm0, %xmm1, %xmm0
	; SANDY-NEXT: vmovaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; SANDY-NEXT: vsubps %xmm0, %xmm2, %xmm0
	; SANDY-NEXT: vmulps %xmm0, %xmm1, %xmm0
	; SANDY-NEXT: vaddps %xmm0, %xmm1, %xmm0
	; SANDY-NEXT: retq			; SANDY-NEXT: retq
	;			;
	; HASWELL-LABEL: v4f32_one_step:			; HASWELL-LABEL: v4f32_one_step:
	; HASWELL: # BB#0:			; HASWELL: # BB#0:
	; HASWELL-NEXT: vrcpps %xmm0, %xmm1			; HASWELL-NEXT: vbroadcastss {{.*}}(%rip), %xmm1
	; HASWELL-NEXT: vbroadcastss {{.*}}(%rip), %xmm2			; HASWELL-NEXT: vrcpps %xmm0, %xmm2
	; HASWELL-NEXT: vfnmadd213ps %xmm2, %xmm1, %xmm0			; HASWELL-NEXT: vfnmadd231ps %xmm2, %xmm0, %xmm1
	; HASWELL-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm0			; HASWELL-NEXT: vfmadd132ps %xmm2, %xmm2, %xmm1
				; HASWELL-NEXT: vmovaps %xmm1, %xmm0
	; HASWELL-NEXT: retq			; HASWELL-NEXT: retq
	;			;
	; HASWELL-NO-FMA-LABEL: v4f32_one_step:			; HASWELL-NO-FMA-LABEL: v4f32_one_step:
	; HASWELL-NO-FMA: # BB#0:			; HASWELL-NO-FMA: # BB#0:
	; HASWELL-NO-FMA-NEXT: vrcpps %xmm0, %xmm1			; HASWELL-NO-FMA-NEXT: vbroadcastss {{.*}}(%rip), %xmm1
	; HASWELL-NO-FMA-NEXT: vmulps %xmm1, %xmm0, %xmm0			; HASWELL-NO-FMA-NEXT: vdivps %xmm0, %xmm1, %xmm0
	; HASWELL-NO-FMA-NEXT: vbroadcastss {{.*}}(%rip), %xmm2
	; HASWELL-NO-FMA-NEXT: vsubps %xmm0, %xmm2, %xmm0
	; HASWELL-NO-FMA-NEXT: vmulps %xmm0, %xmm1, %xmm0
	; HASWELL-NO-FMA-NEXT: vaddps %xmm0, %xmm1, %xmm0
	; HASWELL-NO-FMA-NEXT: retq			; HASWELL-NO-FMA-NEXT: retq
	;			;
	; KNL-LABEL: v4f32_one_step:			; KNL-LABEL: v4f32_one_step:
	; KNL: # BB#0:			; KNL: # BB#0:
	; KNL-NEXT: vrcpps %xmm0, %xmm1			; KNL-NEXT: vrcpps %xmm0, %xmm1
	; KNL-NEXT: vbroadcastss {{.*}}(%rip), %xmm2			; KNL-NEXT: vbroadcastss {{.*}}(%rip), %xmm2
	; KNL-NEXT: vfnmadd213ps %xmm2, %xmm1, %xmm0			; KNL-NEXT: vfnmadd213ps %xmm2, %xmm1, %xmm0
	; KNL-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm0			; KNL-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm0
	; KNL-NEXT: retq			; KNL-NEXT: retq
	;			;
	; SKX-LABEL: v4f32_one_step:			; SKX-LABEL: v4f32_one_step:
	; SKX: # BB#0:			; SKX: # BB#0:
	; SKX-NEXT: vrcp14ps %xmm0, %xmm1			; SKX-NEXT: vrcp14ps %xmm0, %xmm1
	; SKX-NEXT: vfnmadd213ps {{.*}}(%rip){1to4}, %xmm1, %xmm0			; SKX-NEXT: vfnmadd213ps {{.*}}(%rip){1to4}, %xmm1, %xmm0
	; SKX-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm0			; SKX-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm0
	; SKX-NEXT: retq			; SKX-NEXT: retq
	%div = fdiv fast <4 x float> <float 1.0, float 1.0, float 1.0, float 1.0>, %x			%div = fdiv fast <4 x float> <float 1.0, float 1.0, float 1.0, float 1.0>, %x
	ret <4 x float> %div			ret <4 x float> %div
	}			}

	define <4 x float> @v4f32_two_step(<4 x float> %x) #2 {			define <4 x float> @v4f32_two_step(<4 x float> %x) #2 {
	; SSE-LABEL: v4f32_two_step:			; SSE-LABEL: v4f32_two_step:
	; SSE: # BB#0:			; SSE: # BB#0:
				; SSE-NEXT: movaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; SSE-NEXT: rcpps %xmm0, %xmm2			; SSE-NEXT: rcpps %xmm0, %xmm2
	; SSE-NEXT: movaps %xmm0, %xmm3			; SSE-NEXT: movaps %xmm0, %xmm3
	; SSE-NEXT: mulps %xmm2, %xmm3			; SSE-NEXT: mulps %xmm2, %xmm3
	; SSE-NEXT: movaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; SSE-NEXT: movaps %xmm1, %xmm4			; SSE-NEXT: movaps %xmm1, %xmm4
	; SSE-NEXT: subps %xmm3, %xmm4			; SSE-NEXT: subps %xmm3, %xmm4
	; SSE-NEXT: mulps %xmm2, %xmm4			; SSE-NEXT: mulps %xmm2, %xmm4
	; SSE-NEXT: addps %xmm2, %xmm4			; SSE-NEXT: addps %xmm2, %xmm4
	; SSE-NEXT: mulps %xmm4, %xmm0			; SSE-NEXT: mulps %xmm4, %xmm0
	; SSE-NEXT: subps %xmm0, %xmm1			; SSE-NEXT: subps %xmm0, %xmm1
	; SSE-NEXT: mulps %xmm4, %xmm1			; SSE-NEXT: mulps %xmm4, %xmm1
	; SSE-NEXT: addps %xmm4, %xmm1			; SSE-NEXT: addps %xmm4, %xmm1
	; SSE-NEXT: movaps %xmm1, %xmm0			; SSE-NEXT: movaps %xmm1, %xmm0
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-RECIP-LABEL: v4f32_two_step:			; AVX-RECIP-LABEL: v4f32_two_step:
	; AVX-RECIP: # BB#0:			; AVX-RECIP: # BB#0:
	; AVX-RECIP-NEXT: vrcpps %xmm0, %xmm1			; AVX-RECIP-NEXT: vmovaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; AVX-RECIP-NEXT: vmulps %xmm1, %xmm0, %xmm2			; AVX-RECIP-NEXT: vrcpps %xmm0, %xmm2
	; AVX-RECIP-NEXT: vmovaps {{.*#+}} xmm3 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]			; AVX-RECIP-NEXT: vmulps %xmm2, %xmm0, %xmm3
	; AVX-RECIP-NEXT: vsubps %xmm2, %xmm3, %xmm2			; AVX-RECIP-NEXT: vsubps %xmm3, %xmm1, %xmm3
	; AVX-RECIP-NEXT: vmulps %xmm2, %xmm1, %xmm2			; AVX-RECIP-NEXT: vmulps %xmm2, %xmm3, %xmm3
	; AVX-RECIP-NEXT: vaddps %xmm2, %xmm1, %xmm1			; AVX-RECIP-NEXT: vaddps %xmm2, %xmm3, %xmm2
	; AVX-RECIP-NEXT: vmulps %xmm1, %xmm0, %xmm0			; AVX-RECIP-NEXT: vmulps %xmm2, %xmm0, %xmm0
	; AVX-RECIP-NEXT: vsubps %xmm0, %xmm3, %xmm0			; AVX-RECIP-NEXT: vsubps %xmm0, %xmm1, %xmm0
	; AVX-RECIP-NEXT: vmulps %xmm0, %xmm1, %xmm0			; AVX-RECIP-NEXT: vmulps %xmm2, %xmm0, %xmm0
	; AVX-RECIP-NEXT: vaddps %xmm0, %xmm1, %xmm0			; AVX-RECIP-NEXT: vaddps %xmm2, %xmm0, %xmm0
	; AVX-RECIP-NEXT: retq			; AVX-RECIP-NEXT: retq
	;			;
	; FMA-RECIP-LABEL: v4f32_two_step:			; FMA-RECIP-LABEL: v4f32_two_step:
	; FMA-RECIP: # BB#0:			; FMA-RECIP: # BB#0:
	; FMA-RECIP-NEXT: vrcpps %xmm0, %xmm1			; FMA-RECIP-NEXT: vmovaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; FMA-RECIP-NEXT: vmovaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]			; FMA-RECIP-NEXT: vrcpps %xmm0, %xmm2
	; FMA-RECIP-NEXT: vmovaps %xmm1, %xmm3			; FMA-RECIP-NEXT: vmovaps %xmm0, %xmm3
	; FMA-RECIP-NEXT: vfnmadd213ps %xmm2, %xmm0, %xmm3			; FMA-RECIP-NEXT: vfnmadd132ps %xmm2, %xmm1, %xmm3
	; FMA-RECIP-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm3			; FMA-RECIP-NEXT: vfmadd132ps %xmm2, %xmm2, %xmm3
	; FMA-RECIP-NEXT: vfnmadd213ps %xmm2, %xmm3, %xmm0			; FMA-RECIP-NEXT: vfnmadd132ps %xmm3, %xmm1, %xmm0
	; FMA-RECIP-NEXT: vfmadd132ps %xmm3, %xmm3, %xmm0			; FMA-RECIP-NEXT: vfmadd132ps %xmm3, %xmm3, %xmm0
	; FMA-RECIP-NEXT: retq			; FMA-RECIP-NEXT: retq
	;			;
	; BTVER2-LABEL: v4f32_two_step:			; BTVER2-LABEL: v4f32_two_step:
	; BTVER2: # BB#0:			; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovaps {{.*#+}} xmm3 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]			; BTVER2-NEXT: vmovaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; BTVER2-NEXT: vrcpps %xmm0, %xmm1			; BTVER2-NEXT: vrcpps %xmm0, %xmm2
	; BTVER2-NEXT: vmulps %xmm1, %xmm0, %xmm2			; BTVER2-NEXT: vmulps %xmm2, %xmm0, %xmm3
	; BTVER2-NEXT: vsubps %xmm2, %xmm3, %xmm2			; BTVER2-NEXT: vsubps %xmm3, %xmm1, %xmm3
	; BTVER2-NEXT: vmulps %xmm2, %xmm1, %xmm2			; BTVER2-NEXT: vmulps %xmm2, %xmm3, %xmm3
	; BTVER2-NEXT: vaddps %xmm2, %xmm1, %xmm1			; BTVER2-NEXT: vaddps %xmm2, %xmm3, %xmm2
	; BTVER2-NEXT: vmulps %xmm1, %xmm0, %xmm0			; BTVER2-NEXT: vmulps %xmm2, %xmm0, %xmm0
	; BTVER2-NEXT: vsubps %xmm0, %xmm3, %xmm0			; BTVER2-NEXT: vsubps %xmm0, %xmm1, %xmm0
	; BTVER2-NEXT: vmulps %xmm0, %xmm1, %xmm0			; BTVER2-NEXT: vmulps %xmm2, %xmm0, %xmm0
	; BTVER2-NEXT: vaddps %xmm0, %xmm1, %xmm0			; BTVER2-NEXT: vaddps %xmm2, %xmm0, %xmm0
	; BTVER2-NEXT: retq			; BTVER2-NEXT: retq
	;			;
	; SANDY-LABEL: v4f32_two_step:			; SANDY-LABEL: v4f32_two_step:
	; SANDY: # BB#0:			; SANDY: # BB#0:
	; SANDY-NEXT: vrcpps %xmm0, %xmm1			; SANDY-NEXT: vmovaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; SANDY-NEXT: vmulps %xmm1, %xmm0, %xmm2			; SANDY-NEXT: vdivps %xmm0, %xmm1, %xmm0
	; SANDY-NEXT: vmovaps {{.*#+}} xmm3 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; SANDY-NEXT: vsubps %xmm2, %xmm3, %xmm2
	; SANDY-NEXT: vmulps %xmm2, %xmm1, %xmm2
	; SANDY-NEXT: vaddps %xmm2, %xmm1, %xmm1
	; SANDY-NEXT: vmulps %xmm1, %xmm0, %xmm0
	; SANDY-NEXT: vsubps %xmm0, %xmm3, %xmm0
	; SANDY-NEXT: vmulps %xmm0, %xmm1, %xmm0
	; SANDY-NEXT: vaddps %xmm0, %xmm1, %xmm0
	; SANDY-NEXT: retq			; SANDY-NEXT: retq
	;			;
	; HASWELL-LABEL: v4f32_two_step:			; HASWELL-LABEL: v4f32_two_step:
	; HASWELL: # BB#0:			; HASWELL: # BB#0:
	; HASWELL-NEXT: vrcpps %xmm0, %xmm1			; HASWELL-NEXT: vbroadcastss {{.*}}(%rip), %xmm1
	; HASWELL-NEXT: vbroadcastss {{.*}}(%rip), %xmm2			; HASWELL-NEXT: vrcpps %xmm0, %xmm2
	; HASWELL-NEXT: vmovaps %xmm1, %xmm3			; HASWELL-NEXT: vmovaps %xmm0, %xmm3
	; HASWELL-NEXT: vfnmadd213ps %xmm2, %xmm0, %xmm3			; HASWELL-NEXT: vfnmadd132ps %xmm2, %xmm1, %xmm3
	; HASWELL-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm3			; HASWELL-NEXT: vfmadd132ps %xmm2, %xmm2, %xmm3
	; HASWELL-NEXT: vfnmadd213ps %xmm2, %xmm3, %xmm0			; HASWELL-NEXT: vfnmadd132ps %xmm3, %xmm1, %xmm0
	; HASWELL-NEXT: vfmadd132ps %xmm3, %xmm3, %xmm0			; HASWELL-NEXT: vfmadd132ps %xmm3, %xmm3, %xmm0
	; HASWELL-NEXT: retq			; HASWELL-NEXT: retq
	;			;
	; HASWELL-NO-FMA-LABEL: v4f32_two_step:			; HASWELL-NO-FMA-LABEL: v4f32_two_step:
	; HASWELL-NO-FMA: # BB#0:			; HASWELL-NO-FMA: # BB#0:
	; HASWELL-NO-FMA-NEXT: vrcpps %xmm0, %xmm1			; HASWELL-NO-FMA-NEXT: vbroadcastss {{.*}}(%rip), %xmm1
	; HASWELL-NO-FMA-NEXT: vmulps %xmm1, %xmm0, %xmm2			; HASWELL-NO-FMA-NEXT: vdivps %xmm0, %xmm1, %xmm0
	; HASWELL-NO-FMA-NEXT: vbroadcastss {{.*}}(%rip), %xmm3
	; HASWELL-NO-FMA-NEXT: vsubps %xmm2, %xmm3, %xmm2
	; HASWELL-NO-FMA-NEXT: vmulps %xmm2, %xmm1, %xmm2
	; HASWELL-NO-FMA-NEXT: vaddps %xmm2, %xmm1, %xmm1
	; HASWELL-NO-FMA-NEXT: vmulps %xmm1, %xmm0, %xmm0
	; HASWELL-NO-FMA-NEXT: vsubps %xmm0, %xmm3, %xmm0
	; HASWELL-NO-FMA-NEXT: vmulps %xmm0, %xmm1, %xmm0
	; HASWELL-NO-FMA-NEXT: vaddps %xmm0, %xmm1, %xmm0
	; HASWELL-NO-FMA-NEXT: retq			; HASWELL-NO-FMA-NEXT: retq
	;			;
	; KNL-LABEL: v4f32_two_step:			; KNL-LABEL: v4f32_two_step:
	; KNL: # BB#0:			; KNL: # BB#0:
	; KNL-NEXT: vrcpps %xmm0, %xmm1			; KNL-NEXT: vrcpps %xmm0, %xmm1
	; KNL-NEXT: vbroadcastss {{.*}}(%rip), %xmm2			; KNL-NEXT: vbroadcastss {{.*}}(%rip), %xmm2
	; KNL-NEXT: vmovaps %xmm1, %xmm3			; KNL-NEXT: vmovaps %xmm1, %xmm3
	; KNL-NEXT: vfnmadd213ps %xmm2, %xmm0, %xmm3			; KNL-NEXT: vfnmadd213ps %xmm2, %xmm0, %xmm3
	▲ Show 20 Lines • Show All 70 Lines • ▼ Show 20 Lines
	; AVX512-NEXT: retq			; AVX512-NEXT: retq
	%div = fdiv fast <8 x float> <float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0>, %x			%div = fdiv fast <8 x float> <float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0>, %x
	ret <8 x float> %div			ret <8 x float> %div
	}			}

	define <8 x float> @v8f32_one_step(<8 x float> %x) #1 {			define <8 x float> @v8f32_one_step(<8 x float> %x) #1 {
	; SSE-LABEL: v8f32_one_step:			; SSE-LABEL: v8f32_one_step:
	; SSE: # BB#0:			; SSE: # BB#0:
				; SSE-NEXT: movaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; SSE-NEXT: rcpps %xmm0, %xmm4			; SSE-NEXT: rcpps %xmm0, %xmm4
	; SSE-NEXT: mulps %xmm4, %xmm0			; SSE-NEXT: mulps %xmm4, %xmm0
	; SSE-NEXT: movaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; SSE-NEXT: movaps %xmm2, %xmm3			; SSE-NEXT: movaps %xmm2, %xmm3
	; SSE-NEXT: subps %xmm0, %xmm3			; SSE-NEXT: subps %xmm0, %xmm3
	; SSE-NEXT: mulps %xmm4, %xmm3			; SSE-NEXT: mulps %xmm4, %xmm3
	; SSE-NEXT: addps %xmm4, %xmm3			; SSE-NEXT: addps %xmm4, %xmm3
	; SSE-NEXT: rcpps %xmm1, %xmm0			; SSE-NEXT: rcpps %xmm1, %xmm0
	; SSE-NEXT: mulps %xmm0, %xmm1			; SSE-NEXT: mulps %xmm0, %xmm1
	; SSE-NEXT: subps %xmm1, %xmm2			; SSE-NEXT: subps %xmm1, %xmm2
	; SSE-NEXT: mulps %xmm0, %xmm2			; SSE-NEXT: mulps %xmm0, %xmm2
	; SSE-NEXT: addps %xmm0, %xmm2			; SSE-NEXT: addps %xmm0, %xmm2
	; SSE-NEXT: movaps %xmm3, %xmm0			; SSE-NEXT: movaps %xmm3, %xmm0
	; SSE-NEXT: movaps %xmm2, %xmm1			; SSE-NEXT: movaps %xmm2, %xmm1
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-RECIP-LABEL: v8f32_one_step:			; AVX-RECIP-LABEL: v8f32_one_step:
	; AVX-RECIP: # BB#0:			; AVX-RECIP: # BB#0:
	; AVX-RECIP-NEXT: vrcpps %ymm0, %ymm1			; AVX-RECIP-NEXT: vmovaps {{.*#+}} ymm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; AVX-RECIP-NEXT: vmulps %ymm1, %ymm0, %ymm0			; AVX-RECIP-NEXT: vrcpps %ymm0, %ymm2
	; AVX-RECIP-NEXT: vmovaps {{.*#+}} ymm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]			; AVX-RECIP-NEXT: vmulps %ymm2, %ymm0, %ymm0
	; AVX-RECIP-NEXT: vsubps %ymm0, %ymm2, %ymm0			; AVX-RECIP-NEXT: vsubps %ymm0, %ymm1, %ymm0
	; AVX-RECIP-NEXT: vmulps %ymm0, %ymm1, %ymm0			; AVX-RECIP-NEXT: vmulps %ymm2, %ymm0, %ymm0
	; AVX-RECIP-NEXT: vaddps %ymm0, %ymm1, %ymm0			; AVX-RECIP-NEXT: vaddps %ymm2, %ymm0, %ymm0
	; AVX-RECIP-NEXT: retq			; AVX-RECIP-NEXT: retq
	;			;
	; FMA-RECIP-LABEL: v8f32_one_step:			; FMA-RECIP-LABEL: v8f32_one_step:
	; FMA-RECIP: # BB#0:			; FMA-RECIP: # BB#0:
	; FMA-RECIP-NEXT: vrcpps %ymm0, %ymm1			; FMA-RECIP-NEXT: vrcpps %ymm0, %ymm1
	; FMA-RECIP-NEXT: vfnmadd213ps {{.*}}(%rip), %ymm1, %ymm0			; FMA-RECIP-NEXT: vfnmadd213ps {{.*}}(%rip), %ymm1, %ymm0
	; FMA-RECIP-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm0			; FMA-RECIP-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm0
	; FMA-RECIP-NEXT: retq			; FMA-RECIP-NEXT: retq
	;			;
	; BTVER2-LABEL: v8f32_one_step:			; BTVER2-LABEL: v8f32_one_step:
	; BTVER2: # BB#0:			; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovaps {{.*#+}} ymm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]			; BTVER2-NEXT: vmovaps {{.*#+}} ymm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; BTVER2-NEXT: vrcpps %ymm0, %ymm1			; BTVER2-NEXT: vrcpps %ymm0, %ymm2
	; BTVER2-NEXT: vmulps %ymm1, %ymm0, %ymm0			; BTVER2-NEXT: vmulps %ymm2, %ymm0, %ymm0
	; BTVER2-NEXT: vsubps %ymm0, %ymm2, %ymm0			; BTVER2-NEXT: vsubps %ymm0, %ymm1, %ymm0
	; BTVER2-NEXT: vmulps %ymm0, %ymm1, %ymm0			; BTVER2-NEXT: vmulps %ymm2, %ymm0, %ymm0
	; BTVER2-NEXT: vaddps %ymm0, %ymm1, %ymm0			; BTVER2-NEXT: vaddps %ymm2, %ymm0, %ymm0
	; BTVER2-NEXT: retq			; BTVER2-NEXT: retq
	;			;
	; SANDY-LABEL: v8f32_one_step:			; SANDY-LABEL: v8f32_one_step:
	; SANDY: # BB#0:			; SANDY: # BB#0:
	; SANDY-NEXT: vrcpps %ymm0, %ymm1			; SANDY-NEXT: vmovaps {{.*#+}} ymm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; SANDY-NEXT: vmulps %ymm1, %ymm0, %ymm0			; SANDY-NEXT: vdivps %ymm0, %ymm1, %ymm0
	; SANDY-NEXT: vmovaps {{.*#+}} ymm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; SANDY-NEXT: vsubps %ymm0, %ymm2, %ymm0
	; SANDY-NEXT: vmulps %ymm0, %ymm1, %ymm0
	; SANDY-NEXT: vaddps %ymm0, %ymm1, %ymm0
	; SANDY-NEXT: retq			; SANDY-NEXT: retq
	;			;
	; HASWELL-LABEL: v8f32_one_step:			; HASWELL-LABEL: v8f32_one_step:
	; HASWELL: # BB#0:			; HASWELL: # BB#0:
	; HASWELL-NEXT: vrcpps %ymm0, %ymm1			; HASWELL-NEXT: vrcpps %ymm0, %ymm2
	; HASWELL-NEXT: vbroadcastss {{.*}}(%rip), %ymm2			; HASWELL-NEXT: vbroadcastss {{.*}}(%rip), %ymm1
	; HASWELL-NEXT: vfnmadd213ps %ymm2, %ymm1, %ymm0			; HASWELL-NEXT: vfnmadd231ps %ymm2, %ymm0, %ymm1
	; HASWELL-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm0			; HASWELL-NEXT: vfmadd132ps %ymm2, %ymm2, %ymm1
				; HASWELL-NEXT: vmovaps %ymm1, %ymm0
	; HASWELL-NEXT: retq			; HASWELL-NEXT: retq
	;			;
	; HASWELL-NO-FMA-LABEL: v8f32_one_step:			; HASWELL-NO-FMA-LABEL: v8f32_one_step:
	; HASWELL-NO-FMA: # BB#0:			; HASWELL-NO-FMA: # BB#0:
	; HASWELL-NO-FMA-NEXT: vrcpps %ymm0, %ymm1			; HASWELL-NO-FMA-NEXT: vbroadcastss {{.*}}(%rip), %ymm1
	; HASWELL-NO-FMA-NEXT: vmulps %ymm1, %ymm0, %ymm0			; HASWELL-NO-FMA-NEXT: vdivps %ymm0, %ymm1, %ymm0
	; HASWELL-NO-FMA-NEXT: vbroadcastss {{.*}}(%rip), %ymm2
	; HASWELL-NO-FMA-NEXT: vsubps %ymm0, %ymm2, %ymm0
	; HASWELL-NO-FMA-NEXT: vmulps %ymm0, %ymm1, %ymm0
	; HASWELL-NO-FMA-NEXT: vaddps %ymm0, %ymm1, %ymm0
	; HASWELL-NO-FMA-NEXT: retq			; HASWELL-NO-FMA-NEXT: retq
	;			;
	; KNL-LABEL: v8f32_one_step:			; KNL-LABEL: v8f32_one_step:
	; KNL: # BB#0:			; KNL: # BB#0:
	; KNL-NEXT: vrcpps %ymm0, %ymm1			; KNL-NEXT: vrcpps %ymm0, %ymm1
	; KNL-NEXT: vbroadcastss {{.*}}(%rip), %ymm2			; KNL-NEXT: vbroadcastss {{.*}}(%rip), %ymm2
	; KNL-NEXT: vfnmadd213ps %ymm2, %ymm1, %ymm0			; KNL-NEXT: vfnmadd213ps %ymm2, %ymm1, %ymm0
	; KNL-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm0			; KNL-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm0
	; KNL-NEXT: retq			; KNL-NEXT: retq
	;			;
	; SKX-LABEL: v8f32_one_step:			; SKX-LABEL: v8f32_one_step:
	; SKX: # BB#0:			; SKX: # BB#0:
	; SKX-NEXT: vrcp14ps %ymm0, %ymm1			; SKX-NEXT: vrcp14ps %ymm0, %ymm1
	; SKX-NEXT: vfnmadd213ps {{.*}}(%rip){1to8}, %ymm1, %ymm0			; SKX-NEXT: vfnmadd213ps {{.*}}(%rip){1to8}, %ymm1, %ymm0
	; SKX-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm0			; SKX-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm0
	; SKX-NEXT: retq			; SKX-NEXT: retq
	%div = fdiv fast <8 x float> <float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0>, %x			%div = fdiv fast <8 x float> <float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0>, %x
	ret <8 x float> %div			ret <8 x float> %div
	}			}

	define <8 x float> @v8f32_two_step(<8 x float> %x) #2 {			define <8 x float> @v8f32_two_step(<8 x float> %x) #2 {
	; SSE-LABEL: v8f32_two_step:			; SSE-LABEL: v8f32_two_step:
	; SSE: # BB#0:			; SSE: # BB#0:
	; SSE-NEXT: movaps %xmm1, %xmm2			; SSE-NEXT: movaps %xmm1, %xmm2
				; SSE-NEXT: movaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; SSE-NEXT: rcpps %xmm0, %xmm3			; SSE-NEXT: rcpps %xmm0, %xmm3
	; SSE-NEXT: movaps %xmm0, %xmm4			; SSE-NEXT: movaps %xmm0, %xmm4
	; SSE-NEXT: mulps %xmm3, %xmm4			; SSE-NEXT: mulps %xmm3, %xmm4
	; SSE-NEXT: movaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; SSE-NEXT: movaps %xmm1, %xmm5			; SSE-NEXT: movaps %xmm1, %xmm5
	; SSE-NEXT: subps %xmm4, %xmm5			; SSE-NEXT: subps %xmm4, %xmm5
	; SSE-NEXT: mulps %xmm3, %xmm5			; SSE-NEXT: mulps %xmm3, %xmm5
	; SSE-NEXT: addps %xmm3, %xmm5			; SSE-NEXT: addps %xmm3, %xmm5
	; SSE-NEXT: mulps %xmm5, %xmm0			; SSE-NEXT: mulps %xmm5, %xmm0
	; SSE-NEXT: movaps %xmm1, %xmm3			; SSE-NEXT: movaps %xmm1, %xmm3
	; SSE-NEXT: subps %xmm0, %xmm3			; SSE-NEXT: subps %xmm0, %xmm3
	; SSE-NEXT: mulps %xmm5, %xmm3			; SSE-NEXT: mulps %xmm5, %xmm3
	Show All 9 Lines
	; SSE-NEXT: subps %xmm2, %xmm1			; SSE-NEXT: subps %xmm2, %xmm1
	; SSE-NEXT: mulps %xmm5, %xmm1			; SSE-NEXT: mulps %xmm5, %xmm1
	; SSE-NEXT: addps %xmm5, %xmm1			; SSE-NEXT: addps %xmm5, %xmm1
	; SSE-NEXT: movaps %xmm3, %xmm0			; SSE-NEXT: movaps %xmm3, %xmm0
	; SSE-NEXT: retq			; SSE-NEXT: retq
	;			;
	; AVX-RECIP-LABEL: v8f32_two_step:			; AVX-RECIP-LABEL: v8f32_two_step:
	; AVX-RECIP: # BB#0:			; AVX-RECIP: # BB#0:
	; AVX-RECIP-NEXT: vrcpps %ymm0, %ymm1			; AVX-RECIP-NEXT: vmovaps {{.*#+}} ymm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; AVX-RECIP-NEXT: vmulps %ymm1, %ymm0, %ymm2			; AVX-RECIP-NEXT: vrcpps %ymm0, %ymm2
	; AVX-RECIP-NEXT: vmovaps {{.*#+}} ymm3 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]			; AVX-RECIP-NEXT: vmulps %ymm2, %ymm0, %ymm3
	; AVX-RECIP-NEXT: vsubps %ymm2, %ymm3, %ymm2			; AVX-RECIP-NEXT: vsubps %ymm3, %ymm1, %ymm3
	; AVX-RECIP-NEXT: vmulps %ymm2, %ymm1, %ymm2			; AVX-RECIP-NEXT: vmulps %ymm2, %ymm3, %ymm3
	; AVX-RECIP-NEXT: vaddps %ymm2, %ymm1, %ymm1			; AVX-RECIP-NEXT: vaddps %ymm2, %ymm3, %ymm2
	; AVX-RECIP-NEXT: vmulps %ymm1, %ymm0, %ymm0			; AVX-RECIP-NEXT: vmulps %ymm2, %ymm0, %ymm0
	; AVX-RECIP-NEXT: vsubps %ymm0, %ymm3, %ymm0			; AVX-RECIP-NEXT: vsubps %ymm0, %ymm1, %ymm0
	; AVX-RECIP-NEXT: vmulps %ymm0, %ymm1, %ymm0			; AVX-RECIP-NEXT: vmulps %ymm2, %ymm0, %ymm0
	; AVX-RECIP-NEXT: vaddps %ymm0, %ymm1, %ymm0			; AVX-RECIP-NEXT: vaddps %ymm2, %ymm0, %ymm0
	; AVX-RECIP-NEXT: retq			; AVX-RECIP-NEXT: retq
	;			;
	; FMA-RECIP-LABEL: v8f32_two_step:			; FMA-RECIP-LABEL: v8f32_two_step:
	; FMA-RECIP: # BB#0:			; FMA-RECIP: # BB#0:
	; FMA-RECIP-NEXT: vrcpps %ymm0, %ymm1			; FMA-RECIP-NEXT: vmovaps {{.*#+}} ymm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; FMA-RECIP-NEXT: vmovaps {{.*#+}} ymm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]			; FMA-RECIP-NEXT: vrcpps %ymm0, %ymm2
	; FMA-RECIP-NEXT: vmovaps %ymm1, %ymm3			; FMA-RECIP-NEXT: vmovaps %ymm0, %ymm3
	; FMA-RECIP-NEXT: vfnmadd213ps %ymm2, %ymm0, %ymm3			; FMA-RECIP-NEXT: vfnmadd132ps %ymm2, %ymm1, %ymm3
	; FMA-RECIP-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm3			; FMA-RECIP-NEXT: vfmadd132ps %ymm2, %ymm2, %ymm3
	; FMA-RECIP-NEXT: vfnmadd213ps %ymm2, %ymm3, %ymm0			; FMA-RECIP-NEXT: vfnmadd132ps %ymm3, %ymm1, %ymm0
	; FMA-RECIP-NEXT: vfmadd132ps %ymm3, %ymm3, %ymm0			; FMA-RECIP-NEXT: vfmadd132ps %ymm3, %ymm3, %ymm0
	; FMA-RECIP-NEXT: retq			; FMA-RECIP-NEXT: retq
	;			;
	; BTVER2-LABEL: v8f32_two_step:			; BTVER2-LABEL: v8f32_two_step:
	; BTVER2: # BB#0:			; BTVER2: # BB#0:
	; BTVER2-NEXT: vmovaps {{.*#+}} ymm3 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]			; BTVER2-NEXT: vmovaps {{.*#+}} ymm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; BTVER2-NEXT: vrcpps %ymm0, %ymm1			; BTVER2-NEXT: vrcpps %ymm0, %ymm2
	; BTVER2-NEXT: vmulps %ymm1, %ymm0, %ymm2			; BTVER2-NEXT: vmulps %ymm2, %ymm0, %ymm3
	; BTVER2-NEXT: vsubps %ymm2, %ymm3, %ymm2			; BTVER2-NEXT: vsubps %ymm3, %ymm1, %ymm3
	; BTVER2-NEXT: vmulps %ymm2, %ymm1, %ymm2			; BTVER2-NEXT: vmulps %ymm2, %ymm3, %ymm3
	; BTVER2-NEXT: vaddps %ymm2, %ymm1, %ymm1			; BTVER2-NEXT: vaddps %ymm2, %ymm3, %ymm2
	; BTVER2-NEXT: vmulps %ymm1, %ymm0, %ymm0			; BTVER2-NEXT: vmulps %ymm2, %ymm0, %ymm0
	; BTVER2-NEXT: vsubps %ymm0, %ymm3, %ymm0			; BTVER2-NEXT: vsubps %ymm0, %ymm1, %ymm0
	; BTVER2-NEXT: vmulps %ymm0, %ymm1, %ymm0			; BTVER2-NEXT: vmulps %ymm2, %ymm0, %ymm0
	; BTVER2-NEXT: vaddps %ymm0, %ymm1, %ymm0			; BTVER2-NEXT: vaddps %ymm2, %ymm0, %ymm0
	; BTVER2-NEXT: retq			; BTVER2-NEXT: retq
	;			;
	; SANDY-LABEL: v8f32_two_step:			; SANDY-LABEL: v8f32_two_step:
	; SANDY: # BB#0:			; SANDY: # BB#0:
	; SANDY-NEXT: vrcpps %ymm0, %ymm1			; SANDY-NEXT: vmovaps {{.*#+}} ymm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; SANDY-NEXT: vmulps %ymm1, %ymm0, %ymm2			; SANDY-NEXT: vdivps %ymm0, %ymm1, %ymm0
	; SANDY-NEXT: vmovaps {{.*#+}} ymm3 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
	; SANDY-NEXT: vsubps %ymm2, %ymm3, %ymm2
	; SANDY-NEXT: vmulps %ymm2, %ymm1, %ymm2
	; SANDY-NEXT: vaddps %ymm2, %ymm1, %ymm1
	; SANDY-NEXT: vmulps %ymm1, %ymm0, %ymm0
	; SANDY-NEXT: vsubps %ymm0, %ymm3, %ymm0
	; SANDY-NEXT: vmulps %ymm0, %ymm1, %ymm0
	; SANDY-NEXT: vaddps %ymm0, %ymm1, %ymm0
	; SANDY-NEXT: retq			; SANDY-NEXT: retq
	;			;
	; HASWELL-LABEL: v8f32_two_step:			; HASWELL-LABEL: v8f32_two_step:
	; HASWELL: # BB#0:			; HASWELL: # BB#0:
	; HASWELL-NEXT: vrcpps %ymm0, %ymm1			; HASWELL-NEXT: vbroadcastss {{.*}}(%rip), %ymm1
	; HASWELL-NEXT: vbroadcastss {{.*}}(%rip), %ymm2			; HASWELL-NEXT: vrcpps %ymm0, %ymm2
	; HASWELL-NEXT: vmovaps %ymm1, %ymm3			; HASWELL-NEXT: vmovaps %ymm0, %ymm3
	; HASWELL-NEXT: vfnmadd213ps %ymm2, %ymm0, %ymm3			; HASWELL-NEXT: vfnmadd132ps %ymm2, %ymm1, %ymm3
	; HASWELL-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm3			; HASWELL-NEXT: vfmadd132ps %ymm2, %ymm2, %ymm3
	; HASWELL-NEXT: vfnmadd213ps %ymm2, %ymm3, %ymm0			; HASWELL-NEXT: vfnmadd132ps %ymm3, %ymm1, %ymm0
	; HASWELL-NEXT: vfmadd132ps %ymm3, %ymm3, %ymm0			; HASWELL-NEXT: vfmadd132ps %ymm3, %ymm3, %ymm0
	; HASWELL-NEXT: retq			; HASWELL-NEXT: retq
	;			;
	; HASWELL-NO-FMA-LABEL: v8f32_two_step:			; HASWELL-NO-FMA-LABEL: v8f32_two_step:
	; HASWELL-NO-FMA: # BB#0:			; HASWELL-NO-FMA: # BB#0:
	; HASWELL-NO-FMA-NEXT: vrcpps %ymm0, %ymm1			; HASWELL-NO-FMA-NEXT: vbroadcastss {{.*}}(%rip), %ymm1
	; HASWELL-NO-FMA-NEXT: vmulps %ymm1, %ymm0, %ymm2			; HASWELL-NO-FMA-NEXT: vdivps %ymm0, %ymm1, %ymm0
	; HASWELL-NO-FMA-NEXT: vbroadcastss {{.*}}(%rip), %ymm3
	; HASWELL-NO-FMA-NEXT: vsubps %ymm2, %ymm3, %ymm2
	; HASWELL-NO-FMA-NEXT: vmulps %ymm2, %ymm1, %ymm2
	; HASWELL-NO-FMA-NEXT: vaddps %ymm2, %ymm1, %ymm1
	; HASWELL-NO-FMA-NEXT: vmulps %ymm1, %ymm0, %ymm0
	; HASWELL-NO-FMA-NEXT: vsubps %ymm0, %ymm3, %ymm0
	; HASWELL-NO-FMA-NEXT: vmulps %ymm0, %ymm1, %ymm0
	; HASWELL-NO-FMA-NEXT: vaddps %ymm0, %ymm1, %ymm0
	; HASWELL-NO-FMA-NEXT: retq			; HASWELL-NO-FMA-NEXT: retq
	;			;
	; KNL-LABEL: v8f32_two_step:			; KNL-LABEL: v8f32_two_step:
	; KNL: # BB#0:			; KNL: # BB#0:
	; KNL-NEXT: vrcpps %ymm0, %ymm1			; KNL-NEXT: vrcpps %ymm0, %ymm1
	; KNL-NEXT: vbroadcastss {{.*}}(%rip), %ymm2			; KNL-NEXT: vbroadcastss {{.*}}(%rip), %ymm2
	; KNL-NEXT: vmovaps %ymm1, %ymm3			; KNL-NEXT: vmovaps %ymm1, %ymm3
	; KNL-NEXT: vfnmadd213ps %ymm2, %ymm0, %ymm3			; KNL-NEXT: vfnmadd213ps %ymm2, %ymm0, %ymm3
	Show All 23 Lines

test/CodeGen/X86/recip-fastmath2.ll

Show All 32 Lines
; BTVER2-LABEL: f32_no_step_2:		; BTVER2-LABEL: f32_no_step_2:
; BTVER2: # BB#0:		; BTVER2: # BB#0:
; BTVER2-NEXT: vrcpss %xmm0, %xmm0, %xmm0		; BTVER2-NEXT: vrcpss %xmm0, %xmm0, %xmm0
; BTVER2-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0		; BTVER2-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0
; BTVER2-NEXT: retq		; BTVER2-NEXT: retq
;		;
; SANDY-LABEL: f32_no_step_2:		; SANDY-LABEL: f32_no_step_2:
; SANDY: # BB#0:		; SANDY: # BB#0:
; SANDY-NEXT: vrcpss %xmm0, %xmm0, %xmm0		; SANDY-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
; SANDY-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0		; SANDY-NEXT: vdivss %xmm0, %xmm1, %xmm0
; SANDY-NEXT: retq		; SANDY-NEXT: retq
;		;
; HASWELL-LABEL: f32_no_step_2:		; HASWELL-LABEL: f32_no_step_2:
; HASWELL: # BB#0:		; HASWELL: # BB#0:
; HASWELL-NEXT: vrcpss %xmm0, %xmm0, %xmm0		; HASWELL-NEXT: vrcpss %xmm0, %xmm0, %xmm0
; HASWELL-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0		; HASWELL-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0
; HASWELL-NEXT: retq		; HASWELL-NEXT: retq
;		;
Show All 11 Lines	; AVX512-NEXT: retq
%div = fdiv fast float 1234.0, %x		%div = fdiv fast float 1234.0, %x
ret float %div		ret float %div
}		}

define float @f32_one_step_2(float %x) #1 {		define float @f32_one_step_2(float %x) #1 {
; SSE-LABEL: f32_one_step_2:		; SSE-LABEL: f32_one_step_2:
; SSE: # BB#0:		; SSE: # BB#0:
; SSE-NEXT: rcpss %xmm0, %xmm2		; SSE-NEXT: rcpss %xmm0, %xmm2
; SSE-NEXT: mulss %xmm2, %xmm0
; SSE-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero		; SSE-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero
		; SSE-NEXT: mulss %xmm2, %xmm0
; SSE-NEXT: subss %xmm0, %xmm1		; SSE-NEXT: subss %xmm0, %xmm1
; SSE-NEXT: mulss %xmm2, %xmm1		; SSE-NEXT: mulss %xmm2, %xmm1
; SSE-NEXT: addss %xmm2, %xmm1		; SSE-NEXT: addss %xmm2, %xmm1
; SSE-NEXT: mulss {{.*}}(%rip), %xmm1		; SSE-NEXT: mulss {{.*}}(%rip), %xmm1
; SSE-NEXT: movaps %xmm1, %xmm0		; SSE-NEXT: movaps %xmm1, %xmm0
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-RECIP-LABEL: f32_one_step_2:		; AVX-RECIP-LABEL: f32_one_step_2:
; AVX-RECIP: # BB#0:		; AVX-RECIP: # BB#0:
; AVX-RECIP-NEXT: vrcpss %xmm0, %xmm0, %xmm1		; AVX-RECIP-NEXT: vrcpss %xmm0, %xmm0, %xmm1
; AVX-RECIP-NEXT: vmulss %xmm1, %xmm0, %xmm0
; AVX-RECIP-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero		; AVX-RECIP-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero
		; AVX-RECIP-NEXT: vmulss %xmm1, %xmm0, %xmm0
; AVX-RECIP-NEXT: vsubss %xmm0, %xmm2, %xmm0		; AVX-RECIP-NEXT: vsubss %xmm0, %xmm2, %xmm0
; AVX-RECIP-NEXT: vmulss %xmm0, %xmm1, %xmm0		; AVX-RECIP-NEXT: vmulss %xmm1, %xmm0, %xmm0
; AVX-RECIP-NEXT: vaddss %xmm0, %xmm1, %xmm0		; AVX-RECIP-NEXT: vaddss %xmm1, %xmm0, %xmm0
; AVX-RECIP-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0		; AVX-RECIP-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0
; AVX-RECIP-NEXT: retq		; AVX-RECIP-NEXT: retq
;		;
; FMA-RECIP-LABEL: f32_one_step_2:		; FMA-RECIP-LABEL: f32_one_step_2:
; FMA-RECIP: # BB#0:		; FMA-RECIP: # BB#0:
; FMA-RECIP-NEXT: vrcpss %xmm0, %xmm0, %xmm1		; FMA-RECIP-NEXT: vrcpss %xmm0, %xmm0, %xmm1
; FMA-RECIP-NEXT: vfnmadd213ss {{.*}}(%rip), %xmm1, %xmm0		; FMA-RECIP-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero
; FMA-RECIP-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm0		; FMA-RECIP-NEXT: vfnmadd231ss %xmm1, %xmm0, %xmm2
; FMA-RECIP-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0		; FMA-RECIP-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm2
		; FMA-RECIP-NEXT: vmulss {{.*}}(%rip), %xmm2, %xmm0
; FMA-RECIP-NEXT: retq		; FMA-RECIP-NEXT: retq
;		;
; BTVER2-LABEL: f32_one_step_2:		; BTVER2-LABEL: f32_one_step_2:
; BTVER2: # BB#0:		; BTVER2: # BB#0:
; BTVER2-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero		; BTVER2-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero
; BTVER2-NEXT: vrcpss %xmm0, %xmm0, %xmm1		; BTVER2-NEXT: vrcpss %xmm0, %xmm0, %xmm1
; BTVER2-NEXT: vmulss %xmm1, %xmm0, %xmm0		; BTVER2-NEXT: vmulss %xmm1, %xmm0, %xmm0
; BTVER2-NEXT: vsubss %xmm0, %xmm2, %xmm0		; BTVER2-NEXT: vsubss %xmm0, %xmm2, %xmm0
; BTVER2-NEXT: vmulss %xmm0, %xmm1, %xmm0		; BTVER2-NEXT: vmulss %xmm1, %xmm0, %xmm0
; BTVER2-NEXT: vaddss %xmm0, %xmm1, %xmm0		; BTVER2-NEXT: vaddss %xmm1, %xmm0, %xmm0
; BTVER2-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0		; BTVER2-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0
; BTVER2-NEXT: retq		; BTVER2-NEXT: retq
;		;
; SANDY-LABEL: f32_one_step_2:		; SANDY-LABEL: f32_one_step_2:
; SANDY: # BB#0:		; SANDY: # BB#0:
; SANDY-NEXT: vrcpss %xmm0, %xmm0, %xmm1		; SANDY-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
; SANDY-NEXT: vmulss %xmm1, %xmm0, %xmm0		; SANDY-NEXT: vdivss %xmm0, %xmm1, %xmm0
; SANDY-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero
; SANDY-NEXT: vsubss %xmm0, %xmm2, %xmm0
; SANDY-NEXT: vmulss %xmm0, %xmm1, %xmm0
; SANDY-NEXT: vaddss %xmm0, %xmm1, %xmm0
; SANDY-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0
; SANDY-NEXT: retq		; SANDY-NEXT: retq
;		;
; HASWELL-LABEL: f32_one_step_2:		; HASWELL-LABEL: f32_one_step_2:
; HASWELL: # BB#0:		; HASWELL: # BB#0:
; HASWELL-NEXT: vrcpss %xmm0, %xmm0, %xmm1		; HASWELL-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
; HASWELL-NEXT: vfnmadd213ss {{.*}}(%rip), %xmm1, %xmm0		; HASWELL-NEXT: vdivss %xmm0, %xmm1, %xmm0
; HASWELL-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm0
; HASWELL-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0
; HASWELL-NEXT: retq		; HASWELL-NEXT: retq
;		;
; HASWELL-NO-FMA-LABEL: f32_one_step_2:		; HASWELL-NO-FMA-LABEL: f32_one_step_2:
; HASWELL-NO-FMA: # BB#0:		; HASWELL-NO-FMA: # BB#0:
; HASWELL-NO-FMA-NEXT: vrcpss %xmm0, %xmm0, %xmm1		; HASWELL-NO-FMA-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
; HASWELL-NO-FMA-NEXT: vmulss %xmm1, %xmm0, %xmm0		; HASWELL-NO-FMA-NEXT: vdivss %xmm0, %xmm1, %xmm0
; HASWELL-NO-FMA-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero
; HASWELL-NO-FMA-NEXT: vsubss %xmm0, %xmm2, %xmm0
; HASWELL-NO-FMA-NEXT: vmulss %xmm0, %xmm1, %xmm0
; HASWELL-NO-FMA-NEXT: vaddss %xmm0, %xmm1, %xmm0
; HASWELL-NO-FMA-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0
; HASWELL-NO-FMA-NEXT: retq		; HASWELL-NO-FMA-NEXT: retq
;		;
; AVX512-LABEL: f32_one_step_2:		; AVX512-LABEL: f32_one_step_2:
; AVX512: # BB#0:		; AVX512: # BB#0:
; AVX512-NEXT: vrcp14ss %xmm0, %xmm0, %xmm1		; AVX512-NEXT: vrcp14ss %xmm0, %xmm0, %xmm1
; AVX512-NEXT: vfnmadd213ss {{.*}}(%rip), %xmm1, %xmm0		; AVX512-NEXT: vfnmadd213ss {{.*}}(%rip), %xmm1, %xmm0
; AVX512-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm0		; AVX512-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm0
; AVX512-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0		; AVX512-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0
; AVX512-NEXT: retq		; AVX512-NEXT: retq
%div = fdiv fast float 3456.0, %x		%div = fdiv fast float 3456.0, %x
ret float %div		ret float %div
}		}

define float @f32_one_step_2_divs(float %x) #1 {		define float @f32_one_step_2_divs(float %x) #1 {
; SSE-LABEL: f32_one_step_2_divs:		; SSE-LABEL: f32_one_step_2_divs:
; SSE: # BB#0:		; SSE: # BB#0:
; SSE-NEXT: rcpss %xmm0, %xmm1		; SSE-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero
; SSE-NEXT: mulss %xmm1, %xmm0		; SSE-NEXT: rcpss %xmm0, %xmm2
; SSE-NEXT: movss {{.*#+}} xmm2 = mem[0],zero,zero,zero
; SSE-NEXT: subss %xmm0, %xmm2
; SSE-NEXT: mulss %xmm1, %xmm2
; SSE-NEXT: addss %xmm1, %xmm2
; SSE-NEXT: movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
; SSE-NEXT: mulss %xmm2, %xmm0
; SSE-NEXT: mulss %xmm2, %xmm0		; SSE-NEXT: mulss %xmm2, %xmm0
		; SSE-NEXT: subss %xmm0, %xmm1
		; SSE-NEXT: mulss %xmm2, %xmm1
		; SSE-NEXT: addss %xmm2, %xmm1
		; SSE-NEXT: movss {{.*#+}} xmm0 = mem[0],zero,zero,zero
		; SSE-NEXT: mulss %xmm1, %xmm0
		; SSE-NEXT: mulss %xmm1, %xmm0
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-RECIP-LABEL: f32_one_step_2_divs:		; AVX-RECIP-LABEL: f32_one_step_2_divs:
; AVX-RECIP: # BB#0:		; AVX-RECIP: # BB#0:
; AVX-RECIP-NEXT: vrcpss %xmm0, %xmm0, %xmm1		; AVX-RECIP-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
; AVX-RECIP-NEXT: vmulss %xmm1, %xmm0, %xmm0		; AVX-RECIP-NEXT: vrcpss %xmm0, %xmm0, %xmm2
; AVX-RECIP-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero		; AVX-RECIP-NEXT: vmulss %xmm2, %xmm0, %xmm0
; AVX-RECIP-NEXT: vsubss %xmm0, %xmm2, %xmm0		; AVX-RECIP-NEXT: vsubss %xmm0, %xmm1, %xmm0
; AVX-RECIP-NEXT: vmulss %xmm0, %xmm1, %xmm0		; AVX-RECIP-NEXT: vmulss %xmm2, %xmm0, %xmm0
; AVX-RECIP-NEXT: vaddss %xmm0, %xmm1, %xmm0		; AVX-RECIP-NEXT: vaddss %xmm2, %xmm0, %xmm0
; AVX-RECIP-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm1		; AVX-RECIP-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm1
; AVX-RECIP-NEXT: vmulss %xmm0, %xmm1, %xmm0		; AVX-RECIP-NEXT: vmulss %xmm0, %xmm1, %xmm0
; AVX-RECIP-NEXT: retq		; AVX-RECIP-NEXT: retq
;		;
; FMA-RECIP-LABEL: f32_one_step_2_divs:		; FMA-RECIP-LABEL: f32_one_step_2_divs:
; FMA-RECIP: # BB#0:		; FMA-RECIP: # BB#0:
; FMA-RECIP-NEXT: vrcpss %xmm0, %xmm0, %xmm1		; FMA-RECIP-NEXT: vrcpss %xmm0, %xmm0, %xmm1
; FMA-RECIP-NEXT: vfnmadd213ss {{.*}}(%rip), %xmm1, %xmm0		; FMA-RECIP-NEXT: vfnmadd213ss {{.*}}(%rip), %xmm1, %xmm0
; FMA-RECIP-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm0		; FMA-RECIP-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm0
; FMA-RECIP-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm1		; FMA-RECIP-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm1
; FMA-RECIP-NEXT: vmulss %xmm0, %xmm1, %xmm0		; FMA-RECIP-NEXT: vmulss %xmm0, %xmm1, %xmm0
; FMA-RECIP-NEXT: retq		; FMA-RECIP-NEXT: retq
;		;
; BTVER2-LABEL: f32_one_step_2_divs:		; BTVER2-LABEL: f32_one_step_2_divs:
; BTVER2: # BB#0:		; BTVER2: # BB#0:
; BTVER2-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero		; BTVER2-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
; BTVER2-NEXT: vrcpss %xmm0, %xmm0, %xmm1		; BTVER2-NEXT: vrcpss %xmm0, %xmm0, %xmm2
; BTVER2-NEXT: vmulss %xmm1, %xmm0, %xmm0		; BTVER2-NEXT: vmulss %xmm2, %xmm0, %xmm0
; BTVER2-NEXT: vsubss %xmm0, %xmm2, %xmm0		; BTVER2-NEXT: vsubss %xmm0, %xmm1, %xmm0
; BTVER2-NEXT: vmulss %xmm0, %xmm1, %xmm0		; BTVER2-NEXT: vmulss %xmm2, %xmm0, %xmm0
; BTVER2-NEXT: vaddss %xmm0, %xmm1, %xmm0		; BTVER2-NEXT: vaddss %xmm2, %xmm0, %xmm0
; BTVER2-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm1		; BTVER2-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm1
; BTVER2-NEXT: vmulss %xmm0, %xmm1, %xmm0		; BTVER2-NEXT: vmulss %xmm0, %xmm1, %xmm0
; BTVER2-NEXT: retq		; BTVER2-NEXT: retq
;		;
; SANDY-LABEL: f32_one_step_2_divs:		; SANDY-LABEL: f32_one_step_2_divs:
; SANDY: # BB#0:		; SANDY: # BB#0:
; SANDY-NEXT: vrcpss %xmm0, %xmm0, %xmm1		; SANDY-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
; SANDY-NEXT: vmulss %xmm1, %xmm0, %xmm0		; SANDY-NEXT: vdivss %xmm0, %xmm1, %xmm0
; SANDY-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero
; SANDY-NEXT: vsubss %xmm0, %xmm2, %xmm0
; SANDY-NEXT: vmulss %xmm0, %xmm1, %xmm0
; SANDY-NEXT: vaddss %xmm0, %xmm1, %xmm0
; SANDY-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm1		; SANDY-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm1
; SANDY-NEXT: vmulss %xmm0, %xmm1, %xmm0		; SANDY-NEXT: vmulss %xmm0, %xmm1, %xmm0
; SANDY-NEXT: retq		; SANDY-NEXT: retq
;		;
; HASWELL-LABEL: f32_one_step_2_divs:		; HASWELL-LABEL: f32_one_step_2_divs:
; HASWELL: # BB#0:		; HASWELL: # BB#0:
; HASWELL-NEXT: vrcpss %xmm0, %xmm0, %xmm1		; HASWELL-NEXT: vrcpss %xmm0, %xmm0, %xmm1
; HASWELL-NEXT: vfnmadd213ss {{.*}}(%rip), %xmm1, %xmm0		; HASWELL-NEXT: vfnmadd213ss {{.*}}(%rip), %xmm1, %xmm0
; HASWELL-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm0		; HASWELL-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm0
; HASWELL-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm1		; HASWELL-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm1
; HASWELL-NEXT: vmulss %xmm0, %xmm1, %xmm0		; HASWELL-NEXT: vmulss %xmm0, %xmm1, %xmm0
; HASWELL-NEXT: retq		; HASWELL-NEXT: retq
;		;
; HASWELL-NO-FMA-LABEL: f32_one_step_2_divs:		; HASWELL-NO-FMA-LABEL: f32_one_step_2_divs:
; HASWELL-NO-FMA: # BB#0:		; HASWELL-NO-FMA: # BB#0:
; HASWELL-NO-FMA-NEXT: vrcpss %xmm0, %xmm0, %xmm1		; HASWELL-NO-FMA-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
; HASWELL-NO-FMA-NEXT: vmulss %xmm1, %xmm0, %xmm0		; HASWELL-NO-FMA-NEXT: vdivss %xmm0, %xmm1, %xmm0
; HASWELL-NO-FMA-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero
; HASWELL-NO-FMA-NEXT: vsubss %xmm0, %xmm2, %xmm0
; HASWELL-NO-FMA-NEXT: vmulss %xmm0, %xmm1, %xmm0
; HASWELL-NO-FMA-NEXT: vaddss %xmm0, %xmm1, %xmm0
; HASWELL-NO-FMA-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm1		; HASWELL-NO-FMA-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm1
; HASWELL-NO-FMA-NEXT: vmulss %xmm0, %xmm1, %xmm0		; HASWELL-NO-FMA-NEXT: vmulss %xmm0, %xmm1, %xmm0
; HASWELL-NO-FMA-NEXT: retq		; HASWELL-NO-FMA-NEXT: retq
;		;
; AVX512-LABEL: f32_one_step_2_divs:		; AVX512-LABEL: f32_one_step_2_divs:
; AVX512: # BB#0:		; AVX512: # BB#0:
; AVX512-NEXT: vrcp14ss %xmm0, %xmm0, %xmm1		; AVX512-NEXT: vrcp14ss %xmm0, %xmm0, %xmm1
; AVX512-NEXT: vfnmadd213ss {{.*}}(%rip), %xmm1, %xmm0		; AVX512-NEXT: vfnmadd213ss {{.*}}(%rip), %xmm1, %xmm0
; AVX512-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm0		; AVX512-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm0
; AVX512-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm1		; AVX512-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm1
; AVX512-NEXT: vmulss %xmm0, %xmm1, %xmm0		; AVX512-NEXT: vmulss %xmm0, %xmm1, %xmm0
; AVX512-NEXT: retq		; AVX512-NEXT: retq
%div = fdiv fast float 3456.0, %x		%div = fdiv fast float 3456.0, %x
%div2 = fdiv fast float %div, %x		%div2 = fdiv fast float %div, %x
ret float %div2		ret float %div2
}		}

define float @f32_two_step_2(float %x) #2 {		define float @f32_two_step_2(float %x) #2 {
; SSE-LABEL: f32_two_step_2:		; SSE-LABEL: f32_two_step_2:
; SSE: # BB#0:		; SSE: # BB#0:
; SSE-NEXT: rcpss %xmm0, %xmm2		; SSE-NEXT: rcpss %xmm0, %xmm2
		; SSE-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero
; SSE-NEXT: movaps %xmm0, %xmm3		; SSE-NEXT: movaps %xmm0, %xmm3
; SSE-NEXT: mulss %xmm2, %xmm3		; SSE-NEXT: mulss %xmm2, %xmm3
; SSE-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero
; SSE-NEXT: movaps %xmm1, %xmm4		; SSE-NEXT: movaps %xmm1, %xmm4
; SSE-NEXT: subss %xmm3, %xmm4		; SSE-NEXT: subss %xmm3, %xmm4
; SSE-NEXT: mulss %xmm2, %xmm4		; SSE-NEXT: mulss %xmm2, %xmm4
; SSE-NEXT: addss %xmm2, %xmm4		; SSE-NEXT: addss %xmm2, %xmm4
; SSE-NEXT: mulss %xmm4, %xmm0		; SSE-NEXT: mulss %xmm4, %xmm0
; SSE-NEXT: subss %xmm0, %xmm1		; SSE-NEXT: subss %xmm0, %xmm1
; SSE-NEXT: mulss %xmm4, %xmm1		; SSE-NEXT: mulss %xmm4, %xmm1
; SSE-NEXT: addss %xmm4, %xmm1		; SSE-NEXT: addss %xmm4, %xmm1
; SSE-NEXT: mulss {{.*}}(%rip), %xmm1		; SSE-NEXT: mulss {{.*}}(%rip), %xmm1
; SSE-NEXT: movaps %xmm1, %xmm0		; SSE-NEXT: movaps %xmm1, %xmm0
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-RECIP-LABEL: f32_two_step_2:		; AVX-RECIP-LABEL: f32_two_step_2:
; AVX-RECIP: # BB#0:		; AVX-RECIP: # BB#0:
; AVX-RECIP-NEXT: vrcpss %xmm0, %xmm0, %xmm1		; AVX-RECIP-NEXT: vrcpss %xmm0, %xmm0, %xmm1
; AVX-RECIP-NEXT: vmulss %xmm1, %xmm0, %xmm2		; AVX-RECIP-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero
; AVX-RECIP-NEXT: vmovss {{.*#+}} xmm3 = mem[0],zero,zero,zero		; AVX-RECIP-NEXT: vmulss %xmm1, %xmm0, %xmm3
; AVX-RECIP-NEXT: vsubss %xmm2, %xmm3, %xmm2		; AVX-RECIP-NEXT: vsubss %xmm3, %xmm2, %xmm3
; AVX-RECIP-NEXT: vmulss %xmm2, %xmm1, %xmm2		; AVX-RECIP-NEXT: vmulss %xmm1, %xmm3, %xmm3
; AVX-RECIP-NEXT: vaddss %xmm2, %xmm1, %xmm1		; AVX-RECIP-NEXT: vaddss %xmm1, %xmm3, %xmm1
; AVX-RECIP-NEXT: vmulss %xmm1, %xmm0, %xmm0		; AVX-RECIP-NEXT: vmulss %xmm1, %xmm0, %xmm0
; AVX-RECIP-NEXT: vsubss %xmm0, %xmm3, %xmm0		; AVX-RECIP-NEXT: vsubss %xmm0, %xmm2, %xmm0
; AVX-RECIP-NEXT: vmulss %xmm0, %xmm1, %xmm0		; AVX-RECIP-NEXT: vmulss %xmm1, %xmm0, %xmm0
; AVX-RECIP-NEXT: vaddss %xmm0, %xmm1, %xmm0		; AVX-RECIP-NEXT: vaddss %xmm1, %xmm0, %xmm0
; AVX-RECIP-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0		; AVX-RECIP-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0
; AVX-RECIP-NEXT: retq		; AVX-RECIP-NEXT: retq
;		;
; FMA-RECIP-LABEL: f32_two_step_2:		; FMA-RECIP-LABEL: f32_two_step_2:
; FMA-RECIP: # BB#0:		; FMA-RECIP: # BB#0:
; FMA-RECIP-NEXT: vrcpss %xmm0, %xmm0, %xmm1		; FMA-RECIP-NEXT: vrcpss %xmm0, %xmm0, %xmm1
; FMA-RECIP-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero		; FMA-RECIP-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero
; FMA-RECIP-NEXT: vmovaps %xmm1, %xmm3		; FMA-RECIP-NEXT: vmovaps %xmm0, %xmm3
; FMA-RECIP-NEXT: vfnmadd213ss %xmm2, %xmm0, %xmm3		; FMA-RECIP-NEXT: vfnmadd132ss %xmm1, %xmm2, %xmm3
; FMA-RECIP-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm3		; FMA-RECIP-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm3
; FMA-RECIP-NEXT: vfnmadd213ss %xmm2, %xmm3, %xmm0		; FMA-RECIP-NEXT: vfnmadd132ss %xmm3, %xmm2, %xmm0
; FMA-RECIP-NEXT: vfmadd132ss %xmm3, %xmm3, %xmm0		; FMA-RECIP-NEXT: vfmadd132ss %xmm3, %xmm3, %xmm0
; FMA-RECIP-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0		; FMA-RECIP-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0
; FMA-RECIP-NEXT: retq		; FMA-RECIP-NEXT: retq
;		;
; BTVER2-LABEL: f32_two_step_2:		; BTVER2-LABEL: f32_two_step_2:
; BTVER2: # BB#0:		; BTVER2: # BB#0:
; BTVER2-NEXT: vmovss {{.*#+}} xmm3 = mem[0],zero,zero,zero		; BTVER2-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero
; BTVER2-NEXT: vrcpss %xmm0, %xmm0, %xmm1		; BTVER2-NEXT: vrcpss %xmm0, %xmm0, %xmm1
; BTVER2-NEXT: vmulss %xmm1, %xmm0, %xmm2		; BTVER2-NEXT: vmulss %xmm1, %xmm0, %xmm3
; BTVER2-NEXT: vsubss %xmm2, %xmm3, %xmm2		; BTVER2-NEXT: vsubss %xmm3, %xmm2, %xmm3
; BTVER2-NEXT: vmulss %xmm2, %xmm1, %xmm2		; BTVER2-NEXT: vmulss %xmm1, %xmm3, %xmm3
; BTVER2-NEXT: vaddss %xmm2, %xmm1, %xmm1		; BTVER2-NEXT: vaddss %xmm1, %xmm3, %xmm1
; BTVER2-NEXT: vmulss %xmm1, %xmm0, %xmm0		; BTVER2-NEXT: vmulss %xmm1, %xmm0, %xmm0
; BTVER2-NEXT: vsubss %xmm0, %xmm3, %xmm0		; BTVER2-NEXT: vsubss %xmm0, %xmm2, %xmm0
; BTVER2-NEXT: vmulss %xmm0, %xmm1, %xmm0		; BTVER2-NEXT: vmulss %xmm1, %xmm0, %xmm0
; BTVER2-NEXT: vaddss %xmm0, %xmm1, %xmm0		; BTVER2-NEXT: vaddss %xmm1, %xmm0, %xmm0
; BTVER2-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0		; BTVER2-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0
; BTVER2-NEXT: retq		; BTVER2-NEXT: retq
;		;
; SANDY-LABEL: f32_two_step_2:		; SANDY-LABEL: f32_two_step_2:
; SANDY: # BB#0:		; SANDY: # BB#0:
; SANDY-NEXT: vrcpss %xmm0, %xmm0, %xmm1		; SANDY-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
; SANDY-NEXT: vmulss %xmm1, %xmm0, %xmm2		; SANDY-NEXT: vdivss %xmm0, %xmm1, %xmm0
; SANDY-NEXT: vmovss {{.*#+}} xmm3 = mem[0],zero,zero,zero
; SANDY-NEXT: vsubss %xmm2, %xmm3, %xmm2
; SANDY-NEXT: vmulss %xmm2, %xmm1, %xmm2
; SANDY-NEXT: vaddss %xmm2, %xmm1, %xmm1
; SANDY-NEXT: vmulss %xmm1, %xmm0, %xmm0
; SANDY-NEXT: vsubss %xmm0, %xmm3, %xmm0
; SANDY-NEXT: vmulss %xmm0, %xmm1, %xmm0
; SANDY-NEXT: vaddss %xmm0, %xmm1, %xmm0
; SANDY-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0
; SANDY-NEXT: retq		; SANDY-NEXT: retq
;		;
; HASWELL-LABEL: f32_two_step_2:		; HASWELL-LABEL: f32_two_step_2:
; HASWELL: # BB#0:		; HASWELL: # BB#0:
; HASWELL-NEXT: vrcpss %xmm0, %xmm0, %xmm1		; HASWELL-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
; HASWELL-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero		; HASWELL-NEXT: vdivss %xmm0, %xmm1, %xmm0
; HASWELL-NEXT: vmovaps %xmm1, %xmm3
; HASWELL-NEXT: vfnmadd213ss %xmm2, %xmm0, %xmm3
; HASWELL-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm3
; HASWELL-NEXT: vfnmadd213ss %xmm2, %xmm3, %xmm0
; HASWELL-NEXT: vfmadd132ss %xmm3, %xmm3, %xmm0
; HASWELL-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0
; HASWELL-NEXT: retq		; HASWELL-NEXT: retq
;		;
; HASWELL-NO-FMA-LABEL: f32_two_step_2:		; HASWELL-NO-FMA-LABEL: f32_two_step_2:
; HASWELL-NO-FMA: # BB#0:		; HASWELL-NO-FMA: # BB#0:
; HASWELL-NO-FMA-NEXT: vrcpss %xmm0, %xmm0, %xmm1		; HASWELL-NO-FMA-NEXT: vmovss {{.*#+}} xmm1 = mem[0],zero,zero,zero
; HASWELL-NO-FMA-NEXT: vmulss %xmm1, %xmm0, %xmm2		; HASWELL-NO-FMA-NEXT: vdivss %xmm0, %xmm1, %xmm0
; HASWELL-NO-FMA-NEXT: vmovss {{.*#+}} xmm3 = mem[0],zero,zero,zero
; HASWELL-NO-FMA-NEXT: vsubss %xmm2, %xmm3, %xmm2
; HASWELL-NO-FMA-NEXT: vmulss %xmm2, %xmm1, %xmm2
; HASWELL-NO-FMA-NEXT: vaddss %xmm2, %xmm1, %xmm1
; HASWELL-NO-FMA-NEXT: vmulss %xmm1, %xmm0, %xmm0
; HASWELL-NO-FMA-NEXT: vsubss %xmm0, %xmm3, %xmm0
; HASWELL-NO-FMA-NEXT: vmulss %xmm0, %xmm1, %xmm0
; HASWELL-NO-FMA-NEXT: vaddss %xmm0, %xmm1, %xmm0
; HASWELL-NO-FMA-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0
; HASWELL-NO-FMA-NEXT: retq		; HASWELL-NO-FMA-NEXT: retq
;		;
; AVX512-LABEL: f32_two_step_2:		; AVX512-LABEL: f32_two_step_2:
; AVX512: # BB#0:		; AVX512: # BB#0:
; AVX512-NEXT: vrcp14ss %xmm0, %xmm0, %xmm1		; AVX512-NEXT: vrcp14ss %xmm0, %xmm0, %xmm1
; AVX512-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero		; AVX512-NEXT: vmovss {{.*#+}} xmm2 = mem[0],zero,zero,zero
; AVX512-NEXT: vmovaps %xmm1, %xmm3		; AVX512-NEXT: vmovaps %xmm1, %xmm3
; AVX512-NEXT: vfnmadd213ss %xmm2, %xmm0, %xmm3		; AVX512-NEXT: vfnmadd213ss %xmm2, %xmm0, %xmm3
; AVX512-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm3		; AVX512-NEXT: vfmadd132ss %xmm1, %xmm1, %xmm3
; AVX512-NEXT: vfnmadd213ss %xmm2, %xmm3, %xmm0		; AVX512-NEXT: vfnmadd213ss %xmm2, %xmm3, %xmm0
; AVX512-NEXT: vfmadd132ss %xmm3, %xmm3, %xmm0		; AVX512-NEXT: vfmadd132ss %xmm3, %xmm3, %xmm0
; AVX512-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0		; AVX512-NEXT: vmulss {{.*}}(%rip), %xmm0, %xmm0
; AVX512-NEXT: retq		; AVX512-NEXT: retq
%div = fdiv fast float 6789.0, %x		%div = fdiv fast float 6789.0, %x
ret float %div		ret float %div
}		}

define <4 x float> @v4f32_one_step2(<4 x float> %x) #1 {		define <4 x float> @v4f32_one_step2(<4 x float> %x) #1 {
; SSE-LABEL: v4f32_one_step2:		; SSE-LABEL: v4f32_one_step2:
; SSE: # BB#0:		; SSE: # BB#0:
; SSE-NEXT: rcpps %xmm0, %xmm2		; SSE-NEXT: rcpps %xmm0, %xmm2
		; SSE-NEXT: movaps ${{\.LCPI.*}}, %xmm1
; SSE-NEXT: mulps %xmm2, %xmm0		; SSE-NEXT: mulps %xmm2, %xmm0
; SSE-NEXT: movaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
; SSE-NEXT: subps %xmm0, %xmm1		; SSE-NEXT: subps %xmm0, %xmm1
; SSE-NEXT: mulps %xmm2, %xmm1		; SSE-NEXT: mulps %xmm2, %xmm1
; SSE-NEXT: addps %xmm2, %xmm1		; SSE-NEXT: addps %xmm2, %xmm1
; SSE-NEXT: mulps {{.*}}(%rip), %xmm1		; SSE-NEXT: mulps {{.*}}(%rip), %xmm1
; SSE-NEXT: movaps %xmm1, %xmm0		; SSE-NEXT: movaps %xmm1, %xmm0
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-RECIP-LABEL: v4f32_one_step2:		; AVX-RECIP-LABEL: v4f32_one_step2:
; AVX-RECIP: # BB#0:		; AVX-RECIP: # BB#0:
; AVX-RECIP-NEXT: vrcpps %xmm0, %xmm1		; AVX-RECIP-NEXT: vrcpps %xmm0, %xmm1
; AVX-RECIP-NEXT: vmulps %xmm1, %xmm0, %xmm0
; AVX-RECIP-NEXT: vmovaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]		; AVX-RECIP-NEXT: vmovaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
		; AVX-RECIP-NEXT: vmulps %xmm1, %xmm0, %xmm0
; AVX-RECIP-NEXT: vsubps %xmm0, %xmm2, %xmm0		; AVX-RECIP-NEXT: vsubps %xmm0, %xmm2, %xmm0
; AVX-RECIP-NEXT: vmulps %xmm0, %xmm1, %xmm0		; AVX-RECIP-NEXT: vmulps %xmm1, %xmm0, %xmm0
; AVX-RECIP-NEXT: vaddps %xmm0, %xmm1, %xmm0		; AVX-RECIP-NEXT: vaddps %xmm1, %xmm0, %xmm0
; AVX-RECIP-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0		; AVX-RECIP-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0
; AVX-RECIP-NEXT: retq		; AVX-RECIP-NEXT: retq
;		;
; FMA-RECIP-LABEL: v4f32_one_step2:		; FMA-RECIP-LABEL: v4f32_one_step2:
; FMA-RECIP: # BB#0:		; FMA-RECIP: # BB#0:
; FMA-RECIP-NEXT: vrcpps %xmm0, %xmm1		; FMA-RECIP-NEXT: vrcpps %xmm0, %xmm1
; FMA-RECIP-NEXT: vfnmadd213ps {{.*}}(%rip), %xmm1, %xmm0		; FMA-RECIP-NEXT: vmovaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
; FMA-RECIP-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm0		; FMA-RECIP-NEXT: vfnmadd231ps %xmm1, %xmm0, %xmm2
; FMA-RECIP-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0		; FMA-RECIP-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm2
		; FMA-RECIP-NEXT: vmulps {{.*}}(%rip), %xmm2, %xmm0
; FMA-RECIP-NEXT: retq		; FMA-RECIP-NEXT: retq
;		;
; BTVER2-LABEL: v4f32_one_step2:		; BTVER2-LABEL: v4f32_one_step2:
; BTVER2: # BB#0:		; BTVER2: # BB#0:
; BTVER2-NEXT: vmovaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]		; BTVER2-NEXT: vmovaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
; BTVER2-NEXT: vrcpps %xmm0, %xmm1		; BTVER2-NEXT: vrcpps %xmm0, %xmm1
; BTVER2-NEXT: vmulps %xmm1, %xmm0, %xmm0		; BTVER2-NEXT: vmulps %xmm1, %xmm0, %xmm0
; BTVER2-NEXT: vsubps %xmm0, %xmm2, %xmm0		; BTVER2-NEXT: vsubps %xmm0, %xmm2, %xmm0
; BTVER2-NEXT: vmulps %xmm0, %xmm1, %xmm0		; BTVER2-NEXT: vmulps %xmm1, %xmm0, %xmm0
; BTVER2-NEXT: vaddps %xmm0, %xmm1, %xmm0		; BTVER2-NEXT: vaddps %xmm1, %xmm0, %xmm0
; BTVER2-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0		; BTVER2-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0
; BTVER2-NEXT: retq		; BTVER2-NEXT: retq
;		;
; SANDY-LABEL: v4f32_one_step2:		; SANDY-LABEL: v4f32_one_step2:
; SANDY: # BB#0:		; SANDY: # BB#0:
; SANDY-NEXT: vrcpps %xmm0, %xmm1		; SANDY-NEXT: vmovaps {{.*#+}} xmm1 = [1.000000e+00,2.000000e+00,3.000000e+00,4.000000e+00]
; SANDY-NEXT: vmulps %xmm1, %xmm0, %xmm0		; SANDY-NEXT: vdivps %xmm0, %xmm1, %xmm0
; SANDY-NEXT: vmovaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
; SANDY-NEXT: vsubps %xmm0, %xmm2, %xmm0
; SANDY-NEXT: vmulps %xmm0, %xmm1, %xmm0
; SANDY-NEXT: vaddps %xmm0, %xmm1, %xmm0
; SANDY-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0
; SANDY-NEXT: retq		; SANDY-NEXT: retq
;		;
; HASWELL-LABEL: v4f32_one_step2:		; HASWELL-LABEL: v4f32_one_step2:
; HASWELL: # BB#0:		; HASWELL: # BB#0:
; HASWELL-NEXT: vrcpps %xmm0, %xmm1		; HASWELL-NEXT: vmovaps {{.*#+}} xmm1 = [1.000000e+00,2.000000e+00,3.000000e+00,4.000000e+00]
; HASWELL-NEXT: vbroadcastss {{.*}}(%rip), %xmm2		; HASWELL-NEXT: vdivps %xmm0, %xmm1, %xmm0
; HASWELL-NEXT: vfnmadd213ps %xmm2, %xmm1, %xmm0
; HASWELL-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm0
; HASWELL-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0
; HASWELL-NEXT: retq		; HASWELL-NEXT: retq
;		;
; HASWELL-NO-FMA-LABEL: v4f32_one_step2:		; HASWELL-NO-FMA-LABEL: v4f32_one_step2:
; HASWELL-NO-FMA: # BB#0:		; HASWELL-NO-FMA: # BB#0:
; HASWELL-NO-FMA-NEXT: vrcpps %xmm0, %xmm1		; HASWELL-NO-FMA-NEXT: vmovaps {{.*#+}} xmm1 = [1.000000e+00,2.000000e+00,3.000000e+00,4.000000e+00]
; HASWELL-NO-FMA-NEXT: vmulps %xmm1, %xmm0, %xmm0		; HASWELL-NO-FMA-NEXT: vdivps %xmm0, %xmm1, %xmm0
; HASWELL-NO-FMA-NEXT: vbroadcastss {{.*}}(%rip), %xmm2
; HASWELL-NO-FMA-NEXT: vsubps %xmm0, %xmm2, %xmm0
; HASWELL-NO-FMA-NEXT: vmulps %xmm0, %xmm1, %xmm0
; HASWELL-NO-FMA-NEXT: vaddps %xmm0, %xmm1, %xmm0
; HASWELL-NO-FMA-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0
; HASWELL-NO-FMA-NEXT: retq		; HASWELL-NO-FMA-NEXT: retq
;		;
; KNL-LABEL: v4f32_one_step2:		; KNL-LABEL: v4f32_one_step2:
; KNL: # BB#0:		; KNL: # BB#0:
; KNL-NEXT: vrcpps %xmm0, %xmm1		; KNL-NEXT: vrcpps %xmm0, %xmm1
; KNL-NEXT: vbroadcastss {{.*}}(%rip), %xmm2		; KNL-NEXT: vbroadcastss {{.*}}(%rip), %xmm2
; KNL-NEXT: vfnmadd213ps %xmm2, %xmm1, %xmm0		; KNL-NEXT: vfnmadd213ps %xmm2, %xmm1, %xmm0
; KNL-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm0		; KNL-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm0
Show All 9 Lines
; SKX-NEXT: retq		; SKX-NEXT: retq
%div = fdiv fast <4 x float> <float 1.0, float 2.0, float 3.0, float 4.0>, %x		%div = fdiv fast <4 x float> <float 1.0, float 2.0, float 3.0, float 4.0>, %x
ret <4 x float> %div		ret <4 x float> %div
}		}

define <4 x float> @v4f32_one_step_2_divs(<4 x float> %x) #1 {		define <4 x float> @v4f32_one_step_2_divs(<4 x float> %x) #1 {
; SSE-LABEL: v4f32_one_step_2_divs:		; SSE-LABEL: v4f32_one_step_2_divs:
; SSE: # BB#0:		; SSE: # BB#0:
; SSE-NEXT: rcpps %xmm0, %xmm1		; SSE-NEXT: movaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
; SSE-NEXT: mulps %xmm1, %xmm0		; SSE-NEXT: rcpps %xmm0, %xmm2
; SSE-NEXT: movaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
; SSE-NEXT: subps %xmm0, %xmm2
; SSE-NEXT: mulps %xmm1, %xmm2
; SSE-NEXT: addps %xmm1, %xmm2
; SSE-NEXT: movaps {{.*#+}} xmm0 = [1.000000e+00,2.000000e+00,3.000000e+00,4.000000e+00]
; SSE-NEXT: mulps %xmm2, %xmm0
; SSE-NEXT: mulps %xmm2, %xmm0		; SSE-NEXT: mulps %xmm2, %xmm0
		; SSE-NEXT: subps %xmm0, %xmm1
		; SSE-NEXT: mulps %xmm2, %xmm1
		; SSE-NEXT: addps %xmm2, %xmm1
		; SSE-NEXT: movaps {{.*#+}} xmm0 = [1.000000e+00,2.000000e+00,3.000000e+00,4.000000e+00]
		; SSE-NEXT: mulps %xmm1, %xmm0
		; SSE-NEXT: mulps %xmm1, %xmm0
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-RECIP-LABEL: v4f32_one_step_2_divs:		; AVX-RECIP-LABEL: v4f32_one_step_2_divs:
; AVX-RECIP: # BB#0:		; AVX-RECIP: # BB#0:
; AVX-RECIP-NEXT: vrcpps %xmm0, %xmm1		; AVX-RECIP-NEXT: vmovaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
; AVX-RECIP-NEXT: vmulps %xmm1, %xmm0, %xmm0		; AVX-RECIP-NEXT: vrcpps %xmm0, %xmm2
; AVX-RECIP-NEXT: vmovaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]		; AVX-RECIP-NEXT: vmulps %xmm2, %xmm0, %xmm0
; AVX-RECIP-NEXT: vsubps %xmm0, %xmm2, %xmm0		; AVX-RECIP-NEXT: vsubps %xmm0, %xmm1, %xmm0
; AVX-RECIP-NEXT: vmulps %xmm0, %xmm1, %xmm0		; AVX-RECIP-NEXT: vmulps %xmm2, %xmm0, %xmm0
; AVX-RECIP-NEXT: vaddps %xmm0, %xmm1, %xmm0		; AVX-RECIP-NEXT: vaddps %xmm2, %xmm0, %xmm0
; AVX-RECIP-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm1		; AVX-RECIP-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm1
; AVX-RECIP-NEXT: vmulps %xmm0, %xmm1, %xmm0		; AVX-RECIP-NEXT: vmulps %xmm0, %xmm1, %xmm0
; AVX-RECIP-NEXT: retq		; AVX-RECIP-NEXT: retq
;		;
; FMA-RECIP-LABEL: v4f32_one_step_2_divs:		; FMA-RECIP-LABEL: v4f32_one_step_2_divs:
; FMA-RECIP: # BB#0:		; FMA-RECIP: # BB#0:
; FMA-RECIP-NEXT: vrcpps %xmm0, %xmm1		; FMA-RECIP-NEXT: vrcpps %xmm0, %xmm1
; FMA-RECIP-NEXT: vfnmadd213ps {{.*}}(%rip), %xmm1, %xmm0		; FMA-RECIP-NEXT: vfnmadd213ps {{.*}}(%rip), %xmm1, %xmm0
; FMA-RECIP-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm0		; FMA-RECIP-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm0
; FMA-RECIP-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm1		; FMA-RECIP-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm1
; FMA-RECIP-NEXT: vmulps %xmm0, %xmm1, %xmm0		; FMA-RECIP-NEXT: vmulps %xmm0, %xmm1, %xmm0
; FMA-RECIP-NEXT: retq		; FMA-RECIP-NEXT: retq
;		;
; BTVER2-LABEL: v4f32_one_step_2_divs:		; BTVER2-LABEL: v4f32_one_step_2_divs:
; BTVER2: # BB#0:		; BTVER2: # BB#0:
; BTVER2-NEXT: vmovaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]		; BTVER2-NEXT: vmovaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
; BTVER2-NEXT: vrcpps %xmm0, %xmm1		; BTVER2-NEXT: vrcpps %xmm0, %xmm2
; BTVER2-NEXT: vmulps %xmm1, %xmm0, %xmm0		; BTVER2-NEXT: vmulps %xmm2, %xmm0, %xmm0
; BTVER2-NEXT: vsubps %xmm0, %xmm2, %xmm0		; BTVER2-NEXT: vsubps %xmm0, %xmm1, %xmm0
; BTVER2-NEXT: vmulps %xmm0, %xmm1, %xmm0		; BTVER2-NEXT: vmulps %xmm2, %xmm0, %xmm0
; BTVER2-NEXT: vaddps %xmm0, %xmm1, %xmm0		; BTVER2-NEXT: vaddps %xmm2, %xmm0, %xmm0
; BTVER2-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm1		; BTVER2-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm1
; BTVER2-NEXT: vmulps %xmm0, %xmm1, %xmm0		; BTVER2-NEXT: vmulps %xmm0, %xmm1, %xmm0
; BTVER2-NEXT: retq		; BTVER2-NEXT: retq
;		;
; SANDY-LABEL: v4f32_one_step_2_divs:		; SANDY-LABEL: v4f32_one_step_2_divs:
; SANDY: # BB#0:		; SANDY: # BB#0:
; SANDY-NEXT: vrcpps %xmm0, %xmm1		; SANDY-NEXT: vmovaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
; SANDY-NEXT: vmulps %xmm1, %xmm0, %xmm0		; SANDY-NEXT: vdivps %xmm0, %xmm1, %xmm0
; SANDY-NEXT: vmovaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
; SANDY-NEXT: vsubps %xmm0, %xmm2, %xmm0
; SANDY-NEXT: vmulps %xmm0, %xmm1, %xmm0
; SANDY-NEXT: vaddps %xmm0, %xmm1, %xmm0
; SANDY-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm1		; SANDY-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm1
; SANDY-NEXT: vmulps %xmm0, %xmm1, %xmm0		; SANDY-NEXT: vmulps %xmm0, %xmm1, %xmm0
; SANDY-NEXT: retq		; SANDY-NEXT: retq
;		;
; HASWELL-LABEL: v4f32_one_step_2_divs:		; HASWELL-LABEL: v4f32_one_step_2_divs:
; HASWELL: # BB#0:		; HASWELL: # BB#0:
; HASWELL-NEXT: vrcpps %xmm0, %xmm1		; HASWELL-NEXT: vbroadcastss {{.*}}(%rip), %xmm1
; HASWELL-NEXT: vbroadcastss {{.*}}(%rip), %xmm2		; HASWELL-NEXT: vrcpps %xmm0, %xmm2
; HASWELL-NEXT: vfnmadd213ps %xmm2, %xmm1, %xmm0		; HASWELL-NEXT: vfnmadd231ps %xmm2, %xmm0, %xmm1
; HASWELL-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm0		; HASWELL-NEXT: vfmadd132ps %xmm2, %xmm2, %xmm1
; HASWELL-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm1		; HASWELL-NEXT: vmulps {{.*}}(%rip), %xmm1, %xmm0
; HASWELL-NEXT: vmulps %xmm0, %xmm1, %xmm0		; HASWELL-NEXT: vmulps %xmm1, %xmm0, %xmm0
; HASWELL-NEXT: retq		; HASWELL-NEXT: retq
;		;
; HASWELL-NO-FMA-LABEL: v4f32_one_step_2_divs:		; HASWELL-NO-FMA-LABEL: v4f32_one_step_2_divs:
; HASWELL-NO-FMA: # BB#0:		; HASWELL-NO-FMA: # BB#0:
; HASWELL-NO-FMA-NEXT: vrcpps %xmm0, %xmm1		; HASWELL-NO-FMA-NEXT: vbroadcastss {{.*}}(%rip), %xmm1
; HASWELL-NO-FMA-NEXT: vmulps %xmm1, %xmm0, %xmm0		; HASWELL-NO-FMA-NEXT: vdivps %xmm0, %xmm1, %xmm0
; HASWELL-NO-FMA-NEXT: vbroadcastss {{.*}}(%rip), %xmm2
; HASWELL-NO-FMA-NEXT: vsubps %xmm0, %xmm2, %xmm0
; HASWELL-NO-FMA-NEXT: vmulps %xmm0, %xmm1, %xmm0
; HASWELL-NO-FMA-NEXT: vaddps %xmm0, %xmm1, %xmm0
; HASWELL-NO-FMA-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm1		; HASWELL-NO-FMA-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm1
; HASWELL-NO-FMA-NEXT: vmulps %xmm0, %xmm1, %xmm0		; HASWELL-NO-FMA-NEXT: vmulps %xmm0, %xmm1, %xmm0
; HASWELL-NO-FMA-NEXT: retq		; HASWELL-NO-FMA-NEXT: retq
;		;
; KNL-LABEL: v4f32_one_step_2_divs:		; KNL-LABEL: v4f32_one_step_2_divs:
; KNL: # BB#0:		; KNL: # BB#0:
; KNL-NEXT: vrcpps %xmm0, %xmm1		; KNL-NEXT: vrcpps %xmm0, %xmm1
; KNL-NEXT: vbroadcastss {{.*}}(%rip), %xmm2		; KNL-NEXT: vbroadcastss {{.*}}(%rip), %xmm2
Show All 15 Lines	; SKX-NEXT: retq
%div2 = fdiv fast <4 x float> %div, %x		%div2 = fdiv fast <4 x float> %div, %x
ret <4 x float> %div2		ret <4 x float> %div2
}		}

define <4 x float> @v4f32_two_step2(<4 x float> %x) #2 {		define <4 x float> @v4f32_two_step2(<4 x float> %x) #2 {
; SSE-LABEL: v4f32_two_step2:		; SSE-LABEL: v4f32_two_step2:
; SSE: # BB#0:		; SSE: # BB#0:
; SSE-NEXT: rcpps %xmm0, %xmm2		; SSE-NEXT: rcpps %xmm0, %xmm2
		; SSE-NEXT: movaps ${{\.LCPI.*}}, %xmm1
; SSE-NEXT: movaps %xmm0, %xmm3		; SSE-NEXT: movaps %xmm0, %xmm3
; SSE-NEXT: mulps %xmm2, %xmm3		; SSE-NEXT: mulps %xmm2, %xmm3
; SSE-NEXT: movaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
; SSE-NEXT: movaps %xmm1, %xmm4		; SSE-NEXT: movaps %xmm1, %xmm4
; SSE-NEXT: subps %xmm3, %xmm4		; SSE-NEXT: subps %xmm3, %xmm4
; SSE-NEXT: mulps %xmm2, %xmm4		; SSE-NEXT: mulps %xmm2, %xmm4
; SSE-NEXT: addps %xmm2, %xmm4		; SSE-NEXT: addps %xmm2, %xmm4
; SSE-NEXT: mulps %xmm4, %xmm0		; SSE-NEXT: mulps %xmm4, %xmm0
; SSE-NEXT: subps %xmm0, %xmm1		; SSE-NEXT: subps %xmm0, %xmm1
; SSE-NEXT: mulps %xmm4, %xmm1		; SSE-NEXT: mulps %xmm4, %xmm1
; SSE-NEXT: addps %xmm4, %xmm1		; SSE-NEXT: addps %xmm4, %xmm1
; SSE-NEXT: mulps {{.*}}(%rip), %xmm1		; SSE-NEXT: mulps {{.*}}(%rip), %xmm1
; SSE-NEXT: movaps %xmm1, %xmm0		; SSE-NEXT: movaps %xmm1, %xmm0
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-RECIP-LABEL: v4f32_two_step2:		; AVX-RECIP-LABEL: v4f32_two_step2:
; AVX-RECIP: # BB#0:		; AVX-RECIP: # BB#0:
; AVX-RECIP-NEXT: vrcpps %xmm0, %xmm1		; AVX-RECIP-NEXT: vrcpps %xmm0, %xmm1
; AVX-RECIP-NEXT: vmulps %xmm1, %xmm0, %xmm2		; AVX-RECIP-NEXT: vmovaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
; AVX-RECIP-NEXT: vmovaps {{.*#+}} xmm3 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]		; AVX-RECIP-NEXT: vmulps %xmm1, %xmm0, %xmm3
; AVX-RECIP-NEXT: vsubps %xmm2, %xmm3, %xmm2		; AVX-RECIP-NEXT: vsubps %xmm3, %xmm2, %xmm3
; AVX-RECIP-NEXT: vmulps %xmm2, %xmm1, %xmm2		; AVX-RECIP-NEXT: vmulps %xmm1, %xmm3, %xmm3
; AVX-RECIP-NEXT: vaddps %xmm2, %xmm1, %xmm1		; AVX-RECIP-NEXT: vaddps %xmm1, %xmm3, %xmm1
; AVX-RECIP-NEXT: vmulps %xmm1, %xmm0, %xmm0		; AVX-RECIP-NEXT: vmulps %xmm1, %xmm0, %xmm0
; AVX-RECIP-NEXT: vsubps %xmm0, %xmm3, %xmm0		; AVX-RECIP-NEXT: vsubps %xmm0, %xmm2, %xmm0
; AVX-RECIP-NEXT: vmulps %xmm0, %xmm1, %xmm0		; AVX-RECIP-NEXT: vmulps %xmm1, %xmm0, %xmm0
; AVX-RECIP-NEXT: vaddps %xmm0, %xmm1, %xmm0		; AVX-RECIP-NEXT: vaddps %xmm1, %xmm0, %xmm0
; AVX-RECIP-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0		; AVX-RECIP-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0
; AVX-RECIP-NEXT: retq		; AVX-RECIP-NEXT: retq
;		;
; FMA-RECIP-LABEL: v4f32_two_step2:		; FMA-RECIP-LABEL: v4f32_two_step2:
; FMA-RECIP: # BB#0:		; FMA-RECIP: # BB#0:
; FMA-RECIP-NEXT: vrcpps %xmm0, %xmm1		; FMA-RECIP-NEXT: vrcpps %xmm0, %xmm1
; FMA-RECIP-NEXT: vmovaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]		; FMA-RECIP-NEXT: vmovaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
; FMA-RECIP-NEXT: vmovaps %xmm1, %xmm3		; FMA-RECIP-NEXT: vmovaps %xmm0, %xmm3
; FMA-RECIP-NEXT: vfnmadd213ps %xmm2, %xmm0, %xmm3		; FMA-RECIP-NEXT: vfnmadd132ps %xmm1, %xmm2, %xmm3
; FMA-RECIP-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm3		; FMA-RECIP-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm3
; FMA-RECIP-NEXT: vfnmadd213ps %xmm2, %xmm3, %xmm0		; FMA-RECIP-NEXT: vfnmadd132ps %xmm3, %xmm2, %xmm0
; FMA-RECIP-NEXT: vfmadd132ps %xmm3, %xmm3, %xmm0		; FMA-RECIP-NEXT: vfmadd132ps %xmm3, %xmm3, %xmm0
; FMA-RECIP-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0		; FMA-RECIP-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0
; FMA-RECIP-NEXT: retq		; FMA-RECIP-NEXT: retq
;		;
; BTVER2-LABEL: v4f32_two_step2:		; BTVER2-LABEL: v4f32_two_step2:
; BTVER2: # BB#0:		; BTVER2: # BB#0:
; BTVER2-NEXT: vmovaps {{.*#+}} xmm3 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]		; BTVER2-NEXT: vmovaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
; BTVER2-NEXT: vrcpps %xmm0, %xmm1		; BTVER2-NEXT: vrcpps %xmm0, %xmm1
; BTVER2-NEXT: vmulps %xmm1, %xmm0, %xmm2		; BTVER2-NEXT: vmulps %xmm1, %xmm0, %xmm3
; BTVER2-NEXT: vsubps %xmm2, %xmm3, %xmm2		; BTVER2-NEXT: vsubps %xmm3, %xmm2, %xmm3
; BTVER2-NEXT: vmulps %xmm2, %xmm1, %xmm2		; BTVER2-NEXT: vmulps %xmm1, %xmm3, %xmm3
; BTVER2-NEXT: vaddps %xmm2, %xmm1, %xmm1		; BTVER2-NEXT: vaddps %xmm1, %xmm3, %xmm1
; BTVER2-NEXT: vmulps %xmm1, %xmm0, %xmm0		; BTVER2-NEXT: vmulps %xmm1, %xmm0, %xmm0
; BTVER2-NEXT: vsubps %xmm0, %xmm3, %xmm0		; BTVER2-NEXT: vsubps %xmm0, %xmm2, %xmm0
; BTVER2-NEXT: vmulps %xmm0, %xmm1, %xmm0		; BTVER2-NEXT: vmulps %xmm1, %xmm0, %xmm0
; BTVER2-NEXT: vaddps %xmm0, %xmm1, %xmm0		; BTVER2-NEXT: vaddps %xmm1, %xmm0, %xmm0
; BTVER2-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0		; BTVER2-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0
; BTVER2-NEXT: retq		; BTVER2-NEXT: retq
;		;
; SANDY-LABEL: v4f32_two_step2:		; SANDY-LABEL: v4f32_two_step2:
; SANDY: # BB#0:		; SANDY: # BB#0:
; SANDY-NEXT: vrcpps %xmm0, %xmm1		; SANDY-NEXT: vmovaps {{.*#+}} xmm1 = [1.000000e+00,2.000000e+00,3.000000e+00,4.000000e+00]
; SANDY-NEXT: vmulps %xmm1, %xmm0, %xmm2		; SANDY-NEXT: vdivps %xmm0, %xmm1, %xmm0
; SANDY-NEXT: vmovaps {{.*#+}} xmm3 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
; SANDY-NEXT: vsubps %xmm2, %xmm3, %xmm2
; SANDY-NEXT: vmulps %xmm2, %xmm1, %xmm2
; SANDY-NEXT: vaddps %xmm2, %xmm1, %xmm1
; SANDY-NEXT: vmulps %xmm1, %xmm0, %xmm0
; SANDY-NEXT: vsubps %xmm0, %xmm3, %xmm0
; SANDY-NEXT: vmulps %xmm0, %xmm1, %xmm0
; SANDY-NEXT: vaddps %xmm0, %xmm1, %xmm0
; SANDY-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0
; SANDY-NEXT: retq		; SANDY-NEXT: retq
;		;
; HASWELL-LABEL: v4f32_two_step2:		; HASWELL-LABEL: v4f32_two_step2:
; HASWELL: # BB#0:		; HASWELL: # BB#0:
; HASWELL-NEXT: vrcpps %xmm0, %xmm1		; HASWELL-NEXT: vmovaps {{.*#+}} xmm1 = [1.000000e+00,2.000000e+00,3.000000e+00,4.000000e+00]
; HASWELL-NEXT: vbroadcastss {{.*}}(%rip), %xmm2		; HASWELL-NEXT: vdivps %xmm0, %xmm1, %xmm0
; HASWELL-NEXT: vmovaps %xmm1, %xmm3
; HASWELL-NEXT: vfnmadd213ps %xmm2, %xmm0, %xmm3
; HASWELL-NEXT: vfmadd132ps %xmm1, %xmm1, %xmm3
; HASWELL-NEXT: vfnmadd213ps %xmm2, %xmm3, %xmm0
; HASWELL-NEXT: vfmadd132ps %xmm3, %xmm3, %xmm0
; HASWELL-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0
; HASWELL-NEXT: retq		; HASWELL-NEXT: retq
;		;
; HASWELL-NO-FMA-LABEL: v4f32_two_step2:		; HASWELL-NO-FMA-LABEL: v4f32_two_step2:
; HASWELL-NO-FMA: # BB#0:		; HASWELL-NO-FMA: # BB#0:
; HASWELL-NO-FMA-NEXT: vrcpps %xmm0, %xmm1		; HASWELL-NO-FMA-NEXT: vmovaps {{.*#+}} xmm1 = [1.000000e+00,2.000000e+00,3.000000e+00,4.000000e+00]
; HASWELL-NO-FMA-NEXT: vmulps %xmm1, %xmm0, %xmm2		; HASWELL-NO-FMA-NEXT: vdivps %xmm0, %xmm1, %xmm0
; HASWELL-NO-FMA-NEXT: vbroadcastss {{.*}}(%rip), %xmm3
; HASWELL-NO-FMA-NEXT: vsubps %xmm2, %xmm3, %xmm2
; HASWELL-NO-FMA-NEXT: vmulps %xmm2, %xmm1, %xmm2
; HASWELL-NO-FMA-NEXT: vaddps %xmm2, %xmm1, %xmm1
; HASWELL-NO-FMA-NEXT: vmulps %xmm1, %xmm0, %xmm0
; HASWELL-NO-FMA-NEXT: vsubps %xmm0, %xmm3, %xmm0
; HASWELL-NO-FMA-NEXT: vmulps %xmm0, %xmm1, %xmm0
; HASWELL-NO-FMA-NEXT: vaddps %xmm0, %xmm1, %xmm0
; HASWELL-NO-FMA-NEXT: vmulps {{.*}}(%rip), %xmm0, %xmm0
; HASWELL-NO-FMA-NEXT: retq		; HASWELL-NO-FMA-NEXT: retq
;		;
; KNL-LABEL: v4f32_two_step2:		; KNL-LABEL: v4f32_two_step2:
; KNL: # BB#0:		; KNL: # BB#0:
; KNL-NEXT: vrcpps %xmm0, %xmm1		; KNL-NEXT: vrcpps %xmm0, %xmm1
; KNL-NEXT: vbroadcastss {{.*}}(%rip), %xmm2		; KNL-NEXT: vbroadcastss {{.*}}(%rip), %xmm2
; KNL-NEXT: vmovaps %xmm1, %xmm3		; KNL-NEXT: vmovaps %xmm1, %xmm3
; KNL-NEXT: vfnmadd213ps %xmm2, %xmm0, %xmm3		; KNL-NEXT: vfnmadd213ps %xmm2, %xmm0, %xmm3
Show All 16 Lines
; SKX-NEXT: retq		; SKX-NEXT: retq
%div = fdiv fast <4 x float> <float 1.0, float 2.0, float 3.0, float 4.0>, %x		%div = fdiv fast <4 x float> <float 1.0, float 2.0, float 3.0, float 4.0>, %x
ret <4 x float> %div		ret <4 x float> %div
}		}

define <8 x float> @v8f32_one_step2(<8 x float> %x) #1 {		define <8 x float> @v8f32_one_step2(<8 x float> %x) #1 {
; SSE-LABEL: v8f32_one_step2:		; SSE-LABEL: v8f32_one_step2:
; SSE: # BB#0:		; SSE: # BB#0:
; SSE-NEXT: rcpps %xmm1, %xmm4		; SSE-NEXT: rcpps %xmm0, %xmm4
; SSE-NEXT: mulps %xmm4, %xmm1		; SSE-NEXT: movaps ${{\.LCPI.*}}, %xmm2
; SSE-NEXT: movaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]		; SSE-NEXT: mulps %xmm4, %xmm0
; SSE-NEXT: movaps %xmm2, %xmm3		; SSE-NEXT: movaps %xmm2, %xmm3
; SSE-NEXT: subps %xmm1, %xmm3		; SSE-NEXT: subps %xmm0, %xmm3
; SSE-NEXT: mulps %xmm4, %xmm3		; SSE-NEXT: mulps %xmm4, %xmm3
; SSE-NEXT: addps %xmm4, %xmm3		; SSE-NEXT: addps %xmm4, %xmm3
; SSE-NEXT: rcpps %xmm0, %xmm1
; SSE-NEXT: mulps %xmm1, %xmm0
; SSE-NEXT: subps %xmm0, %xmm2
; SSE-NEXT: mulps %xmm1, %xmm2
; SSE-NEXT: addps %xmm1, %xmm2
; SSE-NEXT: mulps {{.*}}(%rip), %xmm2
; SSE-NEXT: mulps {{.*}}(%rip), %xmm3		; SSE-NEXT: mulps {{.*}}(%rip), %xmm3
; SSE-NEXT: movaps %xmm2, %xmm0		; SSE-NEXT: rcpps %xmm1, %xmm0
; SSE-NEXT: movaps %xmm3, %xmm1		; SSE-NEXT: mulps %xmm0, %xmm1
		; SSE-NEXT: subps %xmm1, %xmm2
		; SSE-NEXT: mulps %xmm0, %xmm2
		; SSE-NEXT: addps %xmm0, %xmm2
		; SSE-NEXT: mulps {{.*}}(%rip), %xmm2
		; SSE-NEXT: movaps %xmm3, %xmm0
		; SSE-NEXT: movaps %xmm2, %xmm1
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-RECIP-LABEL: v8f32_one_step2:		; AVX-RECIP-LABEL: v8f32_one_step2:
; AVX-RECIP: # BB#0:		; AVX-RECIP: # BB#0:
; AVX-RECIP-NEXT: vrcpps %ymm0, %ymm1		; AVX-RECIP-NEXT: vrcpps %ymm0, %ymm1
; AVX-RECIP-NEXT: vmulps %ymm1, %ymm0, %ymm0
; AVX-RECIP-NEXT: vmovaps {{.*#+}} ymm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]		; AVX-RECIP-NEXT: vmovaps {{.*#+}} ymm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
		; AVX-RECIP-NEXT: vmulps %ymm1, %ymm0, %ymm0
; AVX-RECIP-NEXT: vsubps %ymm0, %ymm2, %ymm0		; AVX-RECIP-NEXT: vsubps %ymm0, %ymm2, %ymm0
; AVX-RECIP-NEXT: vmulps %ymm0, %ymm1, %ymm0		; AVX-RECIP-NEXT: vmulps %ymm1, %ymm0, %ymm0
; AVX-RECIP-NEXT: vaddps %ymm0, %ymm1, %ymm0		; AVX-RECIP-NEXT: vaddps %ymm1, %ymm0, %ymm0
; AVX-RECIP-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0		; AVX-RECIP-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0
; AVX-RECIP-NEXT: retq		; AVX-RECIP-NEXT: retq
;		;
; FMA-RECIP-LABEL: v8f32_one_step2:		; FMA-RECIP-LABEL: v8f32_one_step2:
; FMA-RECIP: # BB#0:		; FMA-RECIP: # BB#0:
; FMA-RECIP-NEXT: vrcpps %ymm0, %ymm1		; FMA-RECIP-NEXT: vrcpps %ymm0, %ymm1
; FMA-RECIP-NEXT: vfnmadd213ps {{.*}}(%rip), %ymm1, %ymm0		; FMA-RECIP-NEXT: vmovaps {{.*#+}} ymm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
; FMA-RECIP-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm0		; FMA-RECIP-NEXT: vfnmadd231ps %ymm1, %ymm0, %ymm2
; FMA-RECIP-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0		; FMA-RECIP-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm2
		; FMA-RECIP-NEXT: vmulps {{.*}}(%rip), %ymm2, %ymm0
; FMA-RECIP-NEXT: retq		; FMA-RECIP-NEXT: retq
;		;
; BTVER2-LABEL: v8f32_one_step2:		; BTVER2-LABEL: v8f32_one_step2:
; BTVER2: # BB#0:		; BTVER2: # BB#0:
; BTVER2-NEXT: vmovaps {{.*#+}} ymm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]		; BTVER2-NEXT: vmovaps {{.*#+}} ymm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
; BTVER2-NEXT: vrcpps %ymm0, %ymm1		; BTVER2-NEXT: vrcpps %ymm0, %ymm1
; BTVER2-NEXT: vmulps %ymm1, %ymm0, %ymm0		; BTVER2-NEXT: vmulps %ymm1, %ymm0, %ymm0
; BTVER2-NEXT: vsubps %ymm0, %ymm2, %ymm0		; BTVER2-NEXT: vsubps %ymm0, %ymm2, %ymm0
; BTVER2-NEXT: vmulps %ymm0, %ymm1, %ymm0		; BTVER2-NEXT: vmulps %ymm1, %ymm0, %ymm0
; BTVER2-NEXT: vaddps %ymm0, %ymm1, %ymm0		; BTVER2-NEXT: vaddps %ymm1, %ymm0, %ymm0
; BTVER2-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0		; BTVER2-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0
; BTVER2-NEXT: retq		; BTVER2-NEXT: retq
;		;
; SANDY-LABEL: v8f32_one_step2:		; SANDY-LABEL: v8f32_one_step2:
; SANDY: # BB#0:		; SANDY: # BB#0:
; SANDY-NEXT: vrcpps %ymm0, %ymm1		; SANDY-NEXT: vmovaps {{.*#+}} ymm1 = [1.000000e+00,2.000000e+00,3.000000e+00,4.000000e+00,5.000000e+00,6.000000e+00,7.000000e+00,8.000000e+00]
; SANDY-NEXT: vmulps %ymm1, %ymm0, %ymm0		; SANDY-NEXT: vdivps %ymm0, %ymm1, %ymm0
; SANDY-NEXT: vmovaps {{.*#+}} ymm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
; SANDY-NEXT: vsubps %ymm0, %ymm2, %ymm0
; SANDY-NEXT: vmulps %ymm0, %ymm1, %ymm0
; SANDY-NEXT: vaddps %ymm0, %ymm1, %ymm0
; SANDY-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0
; SANDY-NEXT: retq		; SANDY-NEXT: retq
;		;
; HASWELL-LABEL: v8f32_one_step2:		; HASWELL-LABEL: v8f32_one_step2:
; HASWELL: # BB#0:		; HASWELL: # BB#0:
; HASWELL-NEXT: vrcpps %ymm0, %ymm1		; HASWELL-NEXT: vmovaps {{.*#+}} ymm1 = [1.000000e+00,2.000000e+00,3.000000e+00,4.000000e+00,5.000000e+00,6.000000e+00,7.000000e+00,8.000000e+00]
; HASWELL-NEXT: vbroadcastss {{.*}}(%rip), %ymm2		; HASWELL-NEXT: vdivps %ymm0, %ymm1, %ymm0
; HASWELL-NEXT: vfnmadd213ps %ymm2, %ymm1, %ymm0
; HASWELL-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm0
; HASWELL-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0
; HASWELL-NEXT: retq		; HASWELL-NEXT: retq
;		;
; HASWELL-NO-FMA-LABEL: v8f32_one_step2:		; HASWELL-NO-FMA-LABEL: v8f32_one_step2:
; HASWELL-NO-FMA: # BB#0:		; HASWELL-NO-FMA: # BB#0:
; HASWELL-NO-FMA-NEXT: vrcpps %ymm0, %ymm1		; HASWELL-NO-FMA-NEXT: vmovaps {{.*#+}} ymm1 = [1.000000e+00,2.000000e+00,3.000000e+00,4.000000e+00,5.000000e+00,6.000000e+00,7.000000e+00,8.000000e+00]
; HASWELL-NO-FMA-NEXT: vmulps %ymm1, %ymm0, %ymm0		; HASWELL-NO-FMA-NEXT: vdivps %ymm0, %ymm1, %ymm0
; HASWELL-NO-FMA-NEXT: vbroadcastss {{.*}}(%rip), %ymm2
; HASWELL-NO-FMA-NEXT: vsubps %ymm0, %ymm2, %ymm0
; HASWELL-NO-FMA-NEXT: vmulps %ymm0, %ymm1, %ymm0
; HASWELL-NO-FMA-NEXT: vaddps %ymm0, %ymm1, %ymm0
; HASWELL-NO-FMA-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0
; HASWELL-NO-FMA-NEXT: retq		; HASWELL-NO-FMA-NEXT: retq
;		;
; KNL-LABEL: v8f32_one_step2:		; KNL-LABEL: v8f32_one_step2:
; KNL: # BB#0:		; KNL: # BB#0:
; KNL-NEXT: vrcpps %ymm0, %ymm1		; KNL-NEXT: vrcpps %ymm0, %ymm1
; KNL-NEXT: vbroadcastss {{.*}}(%rip), %ymm2		; KNL-NEXT: vbroadcastss {{.*}}(%rip), %ymm2
; KNL-NEXT: vfnmadd213ps %ymm2, %ymm1, %ymm0		; KNL-NEXT: vfnmadd213ps %ymm2, %ymm1, %ymm0
; KNL-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm0		; KNL-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm0
Show All 9 Lines
; SKX-NEXT: retq		; SKX-NEXT: retq
%div = fdiv fast <8 x float> <float 1.0, float 2.0, float 3.0, float 4.0, float 5.0, float 6.0, float 7.0, float 8.0>, %x		%div = fdiv fast <8 x float> <float 1.0, float 2.0, float 3.0, float 4.0, float 5.0, float 6.0, float 7.0, float 8.0>, %x
ret <8 x float> %div		ret <8 x float> %div
}		}

define <8 x float> @v8f32_one_step_2_divs(<8 x float> %x) #1 {		define <8 x float> @v8f32_one_step_2_divs(<8 x float> %x) #1 {
; SSE-LABEL: v8f32_one_step_2_divs:		; SSE-LABEL: v8f32_one_step_2_divs:
; SSE: # BB#0:		; SSE: # BB#0:
; SSE-NEXT: rcpps %xmm0, %xmm2		; SSE-NEXT: movaps {{.*#+}} xmm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
; SSE-NEXT: mulps %xmm2, %xmm0		; SSE-NEXT: rcpps %xmm1, %xmm3
; SSE-NEXT: movaps {{.*#+}} xmm3 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
; SSE-NEXT: movaps %xmm3, %xmm4
; SSE-NEXT: subps %xmm0, %xmm4
; SSE-NEXT: mulps %xmm2, %xmm4
; SSE-NEXT: addps %xmm2, %xmm4
; SSE-NEXT: rcpps %xmm1, %xmm0
; SSE-NEXT: mulps %xmm0, %xmm1
; SSE-NEXT: subps %xmm1, %xmm3
; SSE-NEXT: mulps %xmm0, %xmm3
; SSE-NEXT: addps %xmm0, %xmm3
; SSE-NEXT: movaps {{.*#+}} xmm1 = [5.000000e+00,6.000000e+00,7.000000e+00,8.000000e+00]
; SSE-NEXT: mulps %xmm3, %xmm1		; SSE-NEXT: mulps %xmm3, %xmm1
		; SSE-NEXT: movaps %xmm2, %xmm4
		; SSE-NEXT: subps %xmm1, %xmm4
		; SSE-NEXT: mulps %xmm3, %xmm4
		; SSE-NEXT: addps %xmm3, %xmm4
		; SSE-NEXT: movaps {{.*#+}} xmm1 = [5.000000e+00,6.000000e+00,7.000000e+00,8.000000e+00]
		; SSE-NEXT: mulps %xmm4, %xmm1
		; SSE-NEXT: rcpps %xmm0, %xmm3
		; SSE-NEXT: mulps %xmm3, %xmm0
		; SSE-NEXT: subps %xmm0, %xmm2
		; SSE-NEXT: mulps %xmm3, %xmm2
		; SSE-NEXT: addps %xmm3, %xmm2
; SSE-NEXT: movaps {{.*#+}} xmm0 = [1.000000e+00,2.000000e+00,3.000000e+00,4.000000e+00]		; SSE-NEXT: movaps {{.*#+}} xmm0 = [1.000000e+00,2.000000e+00,3.000000e+00,4.000000e+00]
; SSE-NEXT: mulps %xmm4, %xmm0		; SSE-NEXT: mulps %xmm2, %xmm0
; SSE-NEXT: mulps %xmm4, %xmm0		; SSE-NEXT: mulps %xmm2, %xmm0
; SSE-NEXT: mulps %xmm3, %xmm1		; SSE-NEXT: mulps %xmm4, %xmm1
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-RECIP-LABEL: v8f32_one_step_2_divs:		; AVX-RECIP-LABEL: v8f32_one_step_2_divs:
; AVX-RECIP: # BB#0:		; AVX-RECIP: # BB#0:
; AVX-RECIP-NEXT: vrcpps %ymm0, %ymm1		; AVX-RECIP-NEXT: vmovaps {{.*#+}} ymm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
; AVX-RECIP-NEXT: vmulps %ymm1, %ymm0, %ymm0		; AVX-RECIP-NEXT: vrcpps %ymm0, %ymm2
; AVX-RECIP-NEXT: vmovaps {{.*#+}} ymm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]		; AVX-RECIP-NEXT: vmulps %ymm2, %ymm0, %ymm0
; AVX-RECIP-NEXT: vsubps %ymm0, %ymm2, %ymm0		; AVX-RECIP-NEXT: vsubps %ymm0, %ymm1, %ymm0
; AVX-RECIP-NEXT: vmulps %ymm0, %ymm1, %ymm0		; AVX-RECIP-NEXT: vmulps %ymm2, %ymm0, %ymm0
; AVX-RECIP-NEXT: vaddps %ymm0, %ymm1, %ymm0		; AVX-RECIP-NEXT: vaddps %ymm2, %ymm0, %ymm0
; AVX-RECIP-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm1		; AVX-RECIP-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm1
; AVX-RECIP-NEXT: vmulps %ymm0, %ymm1, %ymm0		; AVX-RECIP-NEXT: vmulps %ymm0, %ymm1, %ymm0
; AVX-RECIP-NEXT: retq		; AVX-RECIP-NEXT: retq
;		;
; FMA-RECIP-LABEL: v8f32_one_step_2_divs:		; FMA-RECIP-LABEL: v8f32_one_step_2_divs:
; FMA-RECIP: # BB#0:		; FMA-RECIP: # BB#0:
; FMA-RECIP-NEXT: vrcpps %ymm0, %ymm1		; FMA-RECIP-NEXT: vrcpps %ymm0, %ymm1
; FMA-RECIP-NEXT: vfnmadd213ps {{.*}}(%rip), %ymm1, %ymm0		; FMA-RECIP-NEXT: vfnmadd213ps {{.*}}(%rip), %ymm1, %ymm0
; FMA-RECIP-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm0		; FMA-RECIP-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm0
; FMA-RECIP-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm1		; FMA-RECIP-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm1
; FMA-RECIP-NEXT: vmulps %ymm0, %ymm1, %ymm0		; FMA-RECIP-NEXT: vmulps %ymm0, %ymm1, %ymm0
; FMA-RECIP-NEXT: retq		; FMA-RECIP-NEXT: retq
;		;
; BTVER2-LABEL: v8f32_one_step_2_divs:		; BTVER2-LABEL: v8f32_one_step_2_divs:
; BTVER2: # BB#0:		; BTVER2: # BB#0:
; BTVER2-NEXT: vmovaps {{.*#+}} ymm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]		; BTVER2-NEXT: vmovaps {{.*#+}} ymm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
; BTVER2-NEXT: vrcpps %ymm0, %ymm1		; BTVER2-NEXT: vrcpps %ymm0, %ymm2
; BTVER2-NEXT: vmulps %ymm1, %ymm0, %ymm0		; BTVER2-NEXT: vmulps %ymm2, %ymm0, %ymm0
; BTVER2-NEXT: vsubps %ymm0, %ymm2, %ymm0		; BTVER2-NEXT: vsubps %ymm0, %ymm1, %ymm0
; BTVER2-NEXT: vmulps %ymm0, %ymm1, %ymm0		; BTVER2-NEXT: vmulps %ymm2, %ymm0, %ymm0
; BTVER2-NEXT: vaddps %ymm0, %ymm1, %ymm0		; BTVER2-NEXT: vaddps %ymm2, %ymm0, %ymm0
; BTVER2-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm1		; BTVER2-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm1
; BTVER2-NEXT: vmulps %ymm0, %ymm1, %ymm0		; BTVER2-NEXT: vmulps %ymm0, %ymm1, %ymm0
; BTVER2-NEXT: retq		; BTVER2-NEXT: retq
;		;
; SANDY-LABEL: v8f32_one_step_2_divs:		; SANDY-LABEL: v8f32_one_step_2_divs:
; SANDY: # BB#0:		; SANDY: # BB#0:
; SANDY-NEXT: vrcpps %ymm0, %ymm1		; SANDY-NEXT: vmovaps {{.*#+}} ymm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
; SANDY-NEXT: vmulps %ymm1, %ymm0, %ymm0		; SANDY-NEXT: vdivps %ymm0, %ymm1, %ymm0
; SANDY-NEXT: vmovaps {{.*#+}} ymm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
; SANDY-NEXT: vsubps %ymm0, %ymm2, %ymm0
; SANDY-NEXT: vmulps %ymm0, %ymm1, %ymm0
; SANDY-NEXT: vaddps %ymm0, %ymm1, %ymm0
; SANDY-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm1		; SANDY-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm1
; SANDY-NEXT: vmulps %ymm0, %ymm1, %ymm0		; SANDY-NEXT: vmulps %ymm0, %ymm1, %ymm0
; SANDY-NEXT: retq		; SANDY-NEXT: retq
;		;
; HASWELL-LABEL: v8f32_one_step_2_divs:		; HASWELL-LABEL: v8f32_one_step_2_divs:
; HASWELL: # BB#0:		; HASWELL: # BB#0:
; HASWELL-NEXT: vrcpps %ymm0, %ymm1		; HASWELL-NEXT: vbroadcastss {{.*}}(%rip), %ymm1
; HASWELL-NEXT: vbroadcastss {{.*}}(%rip), %ymm2		; HASWELL-NEXT: vrcpps %ymm0, %ymm2
; HASWELL-NEXT: vfnmadd213ps %ymm2, %ymm1, %ymm0		; HASWELL-NEXT: vfnmadd231ps %ymm2, %ymm0, %ymm1
; HASWELL-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm0		; HASWELL-NEXT: vfmadd132ps %ymm2, %ymm2, %ymm1
; HASWELL-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm1		; HASWELL-NEXT: vmulps {{.*}}(%rip), %ymm1, %ymm0
; HASWELL-NEXT: vmulps %ymm0, %ymm1, %ymm0		; HASWELL-NEXT: vmulps %ymm1, %ymm0, %ymm0
; HASWELL-NEXT: retq		; HASWELL-NEXT: retq
;		;
; HASWELL-NO-FMA-LABEL: v8f32_one_step_2_divs:		; HASWELL-NO-FMA-LABEL: v8f32_one_step_2_divs:
; HASWELL-NO-FMA: # BB#0:		; HASWELL-NO-FMA: # BB#0:
; HASWELL-NO-FMA-NEXT: vrcpps %ymm0, %ymm1		; HASWELL-NO-FMA-NEXT: vbroadcastss {{.*}}(%rip), %ymm1
; HASWELL-NO-FMA-NEXT: vmulps %ymm1, %ymm0, %ymm0		; HASWELL-NO-FMA-NEXT: vdivps %ymm0, %ymm1, %ymm0
; HASWELL-NO-FMA-NEXT: vbroadcastss {{.*}}(%rip), %ymm2
; HASWELL-NO-FMA-NEXT: vsubps %ymm0, %ymm2, %ymm0
; HASWELL-NO-FMA-NEXT: vmulps %ymm0, %ymm1, %ymm0
; HASWELL-NO-FMA-NEXT: vaddps %ymm0, %ymm1, %ymm0
; HASWELL-NO-FMA-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm1		; HASWELL-NO-FMA-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm1
; HASWELL-NO-FMA-NEXT: vmulps %ymm0, %ymm1, %ymm0		; HASWELL-NO-FMA-NEXT: vmulps %ymm0, %ymm1, %ymm0
; HASWELL-NO-FMA-NEXT: retq		; HASWELL-NO-FMA-NEXT: retq
;		;
; KNL-LABEL: v8f32_one_step_2_divs:		; KNL-LABEL: v8f32_one_step_2_divs:
; KNL: # BB#0:		; KNL: # BB#0:
; KNL-NEXT: vrcpps %ymm0, %ymm1		; KNL-NEXT: vrcpps %ymm0, %ymm1
; KNL-NEXT: vbroadcastss {{.*}}(%rip), %ymm2		; KNL-NEXT: vbroadcastss {{.*}}(%rip), %ymm2
Show All 14 Lines	; SKX-NEXT: retq
%div = fdiv fast <8 x float> <float 1.0, float 2.0, float 3.0, float 4.0, float 5.0, float 6.0, float 7.0, float 8.0>, %x		%div = fdiv fast <8 x float> <float 1.0, float 2.0, float 3.0, float 4.0, float 5.0, float 6.0, float 7.0, float 8.0>, %x
%div2 = fdiv fast <8 x float> %div, %x		%div2 = fdiv fast <8 x float> %div, %x
ret <8 x float> %div2		ret <8 x float> %div2
}		}

define <8 x float> @v8f32_two_step2(<8 x float> %x) #2 {		define <8 x float> @v8f32_two_step2(<8 x float> %x) #2 {
; SSE-LABEL: v8f32_two_step2:		; SSE-LABEL: v8f32_two_step2:
; SSE: # BB#0:		; SSE: # BB#0:
; SSE-NEXT: movaps %xmm0, %xmm2		; SSE-NEXT: movaps %xmm1, %xmm2
; SSE-NEXT: rcpps %xmm1, %xmm3		; SSE-NEXT: rcpps %xmm0, %xmm3
; SSE-NEXT: movaps %xmm1, %xmm4		; SSE-NEXT: movaps ${{\.LCPI.*}}, %xmm1
		RKSimonUnsubmitted Not Done Reply Inline Actions Any ideas why we fail to get: movaps {{.#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] There are other cases here as well. RKSimon:* Any ideas why we fail to get: ``` movaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.
		; SSE-NEXT: movaps %xmm0, %xmm4
; SSE-NEXT: mulps %xmm3, %xmm4		; SSE-NEXT: mulps %xmm3, %xmm4
; SSE-NEXT: movaps {{.*#+}} xmm0 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]		; SSE-NEXT: movaps %xmm1, %xmm5
; SSE-NEXT: movaps %xmm0, %xmm5
; SSE-NEXT: subps %xmm4, %xmm5		; SSE-NEXT: subps %xmm4, %xmm5
; SSE-NEXT: mulps %xmm3, %xmm5		; SSE-NEXT: mulps %xmm3, %xmm5
; SSE-NEXT: addps %xmm3, %xmm5		; SSE-NEXT: addps %xmm3, %xmm5
; SSE-NEXT: mulps %xmm5, %xmm1		; SSE-NEXT: mulps %xmm5, %xmm0
; SSE-NEXT: movaps %xmm0, %xmm3		; SSE-NEXT: movaps %xmm1, %xmm3
; SSE-NEXT: subps %xmm1, %xmm3		; SSE-NEXT: subps %xmm0, %xmm3
; SSE-NEXT: mulps %xmm5, %xmm3		; SSE-NEXT: mulps %xmm5, %xmm3
; SSE-NEXT: addps %xmm5, %xmm3		; SSE-NEXT: addps %xmm5, %xmm3
; SSE-NEXT: rcpps %xmm2, %xmm1		; SSE-NEXT: mulps {{.*}}(%rip), %xmm3
		; SSE-NEXT: rcpps %xmm2, %xmm0
; SSE-NEXT: movaps %xmm2, %xmm4		; SSE-NEXT: movaps %xmm2, %xmm4
; SSE-NEXT: mulps %xmm1, %xmm4		; SSE-NEXT: mulps %xmm0, %xmm4
; SSE-NEXT: movaps %xmm0, %xmm5		; SSE-NEXT: movaps %xmm1, %xmm5
; SSE-NEXT: subps %xmm4, %xmm5		; SSE-NEXT: subps %xmm4, %xmm5
; SSE-NEXT: mulps %xmm1, %xmm5		; SSE-NEXT: mulps %xmm0, %xmm5
; SSE-NEXT: addps %xmm1, %xmm5		; SSE-NEXT: addps %xmm0, %xmm5
; SSE-NEXT: mulps %xmm5, %xmm2		; SSE-NEXT: mulps %xmm5, %xmm2
; SSE-NEXT: subps %xmm2, %xmm0		; SSE-NEXT: subps %xmm2, %xmm1
; SSE-NEXT: mulps %xmm5, %xmm0		; SSE-NEXT: mulps %xmm5, %xmm1
; SSE-NEXT: addps %xmm5, %xmm0		; SSE-NEXT: addps %xmm5, %xmm1
; SSE-NEXT: mulps {{.*}}(%rip), %xmm0		; SSE-NEXT: mulps {{.*}}(%rip), %xmm1
; SSE-NEXT: mulps {{.*}}(%rip), %xmm3		; SSE-NEXT: movaps %xmm3, %xmm0
; SSE-NEXT: movaps %xmm3, %xmm1
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-RECIP-LABEL: v8f32_two_step2:		; AVX-RECIP-LABEL: v8f32_two_step2:
; AVX-RECIP: # BB#0:		; AVX-RECIP: # BB#0:
; AVX-RECIP-NEXT: vrcpps %ymm0, %ymm1		; AVX-RECIP-NEXT: vrcpps %ymm0, %ymm1
; AVX-RECIP-NEXT: vmulps %ymm1, %ymm0, %ymm2		; AVX-RECIP-NEXT: vmovaps {{.*#+}} ymm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
; AVX-RECIP-NEXT: vmovaps {{.*#+}} ymm3 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]		; AVX-RECIP-NEXT: vmulps %ymm1, %ymm0, %ymm3
; AVX-RECIP-NEXT: vsubps %ymm2, %ymm3, %ymm2		; AVX-RECIP-NEXT: vsubps %ymm3, %ymm2, %ymm3
; AVX-RECIP-NEXT: vmulps %ymm2, %ymm1, %ymm2		; AVX-RECIP-NEXT: vmulps %ymm1, %ymm3, %ymm3
; AVX-RECIP-NEXT: vaddps %ymm2, %ymm1, %ymm1		; AVX-RECIP-NEXT: vaddps %ymm1, %ymm3, %ymm1
; AVX-RECIP-NEXT: vmulps %ymm1, %ymm0, %ymm0		; AVX-RECIP-NEXT: vmulps %ymm1, %ymm0, %ymm0
; AVX-RECIP-NEXT: vsubps %ymm0, %ymm3, %ymm0		; AVX-RECIP-NEXT: vsubps %ymm0, %ymm2, %ymm0
; AVX-RECIP-NEXT: vmulps %ymm0, %ymm1, %ymm0		; AVX-RECIP-NEXT: vmulps %ymm1, %ymm0, %ymm0
; AVX-RECIP-NEXT: vaddps %ymm0, %ymm1, %ymm0		; AVX-RECIP-NEXT: vaddps %ymm1, %ymm0, %ymm0
; AVX-RECIP-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0		; AVX-RECIP-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0
; AVX-RECIP-NEXT: retq		; AVX-RECIP-NEXT: retq
;		;
; FMA-RECIP-LABEL: v8f32_two_step2:		; FMA-RECIP-LABEL: v8f32_two_step2:
; FMA-RECIP: # BB#0:		; FMA-RECIP: # BB#0:
; FMA-RECIP-NEXT: vrcpps %ymm0, %ymm1		; FMA-RECIP-NEXT: vrcpps %ymm0, %ymm1
; FMA-RECIP-NEXT: vmovaps {{.*#+}} ymm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]		; FMA-RECIP-NEXT: vmovaps {{.*#+}} ymm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
; FMA-RECIP-NEXT: vmovaps %ymm1, %ymm3		; FMA-RECIP-NEXT: vmovaps %ymm0, %ymm3
; FMA-RECIP-NEXT: vfnmadd213ps %ymm2, %ymm0, %ymm3		; FMA-RECIP-NEXT: vfnmadd132ps %ymm1, %ymm2, %ymm3
; FMA-RECIP-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm3		; FMA-RECIP-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm3
; FMA-RECIP-NEXT: vfnmadd213ps %ymm2, %ymm3, %ymm0		; FMA-RECIP-NEXT: vfnmadd132ps %ymm3, %ymm2, %ymm0
; FMA-RECIP-NEXT: vfmadd132ps %ymm3, %ymm3, %ymm0		; FMA-RECIP-NEXT: vfmadd132ps %ymm3, %ymm3, %ymm0
; FMA-RECIP-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0		; FMA-RECIP-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0
; FMA-RECIP-NEXT: retq		; FMA-RECIP-NEXT: retq
;		;
; BTVER2-LABEL: v8f32_two_step2:		; BTVER2-LABEL: v8f32_two_step2:
; BTVER2: # BB#0:		; BTVER2: # BB#0:
; BTVER2-NEXT: vmovaps {{.*#+}} ymm3 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]		; BTVER2-NEXT: vmovaps {{.*#+}} ymm2 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
; BTVER2-NEXT: vrcpps %ymm0, %ymm1		; BTVER2-NEXT: vrcpps %ymm0, %ymm1
; BTVER2-NEXT: vmulps %ymm1, %ymm0, %ymm2		; BTVER2-NEXT: vmulps %ymm1, %ymm0, %ymm3
; BTVER2-NEXT: vsubps %ymm2, %ymm3, %ymm2		; BTVER2-NEXT: vsubps %ymm3, %ymm2, %ymm3
; BTVER2-NEXT: vmulps %ymm2, %ymm1, %ymm2		; BTVER2-NEXT: vmulps %ymm1, %ymm3, %ymm3
; BTVER2-NEXT: vaddps %ymm2, %ymm1, %ymm1		; BTVER2-NEXT: vaddps %ymm1, %ymm3, %ymm1
; BTVER2-NEXT: vmulps %ymm1, %ymm0, %ymm0		; BTVER2-NEXT: vmulps %ymm1, %ymm0, %ymm0
; BTVER2-NEXT: vsubps %ymm0, %ymm3, %ymm0		; BTVER2-NEXT: vsubps %ymm0, %ymm2, %ymm0
; BTVER2-NEXT: vmulps %ymm0, %ymm1, %ymm0		; BTVER2-NEXT: vmulps %ymm1, %ymm0, %ymm0
; BTVER2-NEXT: vaddps %ymm0, %ymm1, %ymm0		; BTVER2-NEXT: vaddps %ymm1, %ymm0, %ymm0
; BTVER2-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0		; BTVER2-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0
; BTVER2-NEXT: retq		; BTVER2-NEXT: retq
;		;
; SANDY-LABEL: v8f32_two_step2:		; SANDY-LABEL: v8f32_two_step2:
; SANDY: # BB#0:		; SANDY: # BB#0:
; SANDY-NEXT: vrcpps %ymm0, %ymm1		; SANDY-NEXT: vmovaps {{.*#+}} ymm1 = [1.000000e+00,2.000000e+00,3.000000e+00,4.000000e+00,5.000000e+00,6.000000e+00,7.000000e+00,8.000000e+00]
; SANDY-NEXT: vmulps %ymm1, %ymm0, %ymm2		; SANDY-NEXT: vdivps %ymm0, %ymm1, %ymm0
; SANDY-NEXT: vmovaps {{.*#+}} ymm3 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]
; SANDY-NEXT: vsubps %ymm2, %ymm3, %ymm2
; SANDY-NEXT: vmulps %ymm2, %ymm1, %ymm2
; SANDY-NEXT: vaddps %ymm2, %ymm1, %ymm1
; SANDY-NEXT: vmulps %ymm1, %ymm0, %ymm0
; SANDY-NEXT: vsubps %ymm0, %ymm3, %ymm0
; SANDY-NEXT: vmulps %ymm0, %ymm1, %ymm0
; SANDY-NEXT: vaddps %ymm0, %ymm1, %ymm0
; SANDY-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0
; SANDY-NEXT: retq		; SANDY-NEXT: retq
;		;
; HASWELL-LABEL: v8f32_two_step2:		; HASWELL-LABEL: v8f32_two_step2:
; HASWELL: # BB#0:		; HASWELL: # BB#0:
; HASWELL-NEXT: vrcpps %ymm0, %ymm1		; HASWELL-NEXT: vmovaps {{.*#+}} ymm1 = [1.000000e+00,2.000000e+00,3.000000e+00,4.000000e+00,5.000000e+00,6.000000e+00,7.000000e+00,8.000000e+00]
; HASWELL-NEXT: vbroadcastss {{.*}}(%rip), %ymm2		; HASWELL-NEXT: vdivps %ymm0, %ymm1, %ymm0
; HASWELL-NEXT: vmovaps %ymm1, %ymm3
; HASWELL-NEXT: vfnmadd213ps %ymm2, %ymm0, %ymm3
; HASWELL-NEXT: vfmadd132ps %ymm1, %ymm1, %ymm3
; HASWELL-NEXT: vfnmadd213ps %ymm2, %ymm3, %ymm0
; HASWELL-NEXT: vfmadd132ps %ymm3, %ymm3, %ymm0
; HASWELL-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0
; HASWELL-NEXT: retq		; HASWELL-NEXT: retq
;		;
; HASWELL-NO-FMA-LABEL: v8f32_two_step2:		; HASWELL-NO-FMA-LABEL: v8f32_two_step2:
; HASWELL-NO-FMA: # BB#0:		; HASWELL-NO-FMA: # BB#0:
; HASWELL-NO-FMA-NEXT: vrcpps %ymm0, %ymm1		; HASWELL-NO-FMA-NEXT: vmovaps {{.*#+}} ymm1 = [1.000000e+00,2.000000e+00,3.000000e+00,4.000000e+00,5.000000e+00,6.000000e+00,7.000000e+00,8.000000e+00]
; HASWELL-NO-FMA-NEXT: vmulps %ymm1, %ymm0, %ymm2		; HASWELL-NO-FMA-NEXT: vdivps %ymm0, %ymm1, %ymm0
; HASWELL-NO-FMA-NEXT: vbroadcastss {{.*}}(%rip), %ymm3
; HASWELL-NO-FMA-NEXT: vsubps %ymm2, %ymm3, %ymm2
; HASWELL-NO-FMA-NEXT: vmulps %ymm2, %ymm1, %ymm2
; HASWELL-NO-FMA-NEXT: vaddps %ymm2, %ymm1, %ymm1
; HASWELL-NO-FMA-NEXT: vmulps %ymm1, %ymm0, %ymm0
; HASWELL-NO-FMA-NEXT: vsubps %ymm0, %ymm3, %ymm0
; HASWELL-NO-FMA-NEXT: vmulps %ymm0, %ymm1, %ymm0
; HASWELL-NO-FMA-NEXT: vaddps %ymm0, %ymm1, %ymm0
; HASWELL-NO-FMA-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0
; HASWELL-NO-FMA-NEXT: retq		; HASWELL-NO-FMA-NEXT: retq
;		;
; KNL-LABEL: v8f32_two_step2:		; KNL-LABEL: v8f32_two_step2:
; KNL: # BB#0:		; KNL: # BB#0:
; KNL-NEXT: vrcpps %ymm0, %ymm1		; KNL-NEXT: vrcpps %ymm0, %ymm1
; KNL-NEXT: vbroadcastss {{.*}}(%rip), %ymm2		; KNL-NEXT: vbroadcastss {{.*}}(%rip), %ymm2
; KNL-NEXT: vmovaps %ymm1, %ymm3		; KNL-NEXT: vmovaps %ymm1, %ymm3
; KNL-NEXT: vfnmadd213ps %ymm2, %ymm0, %ymm3		; KNL-NEXT: vfnmadd213ps %ymm2, %ymm0, %ymm3
▲ Show 20 Lines • Show All 66 Lines • ▼ Show 20 Lines
; SKX-NEXT: retq		; SKX-NEXT: retq
%div = fdiv fast <8 x float> <float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0>, %x		%div = fdiv fast <8 x float> <float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0, float 1.0>, %x
ret <8 x float> %div		ret <8 x float> %div
}		}

define <8 x float> @v8f32_no_step2(<8 x float> %x) #3 {		define <8 x float> @v8f32_no_step2(<8 x float> %x) #3 {
; SSE-LABEL: v8f32_no_step2:		; SSE-LABEL: v8f32_no_step2:
; SSE: # BB#0:		; SSE: # BB#0:
; SSE-NEXT: rcpps %xmm1, %xmm1
; SSE-NEXT: rcpps %xmm0, %xmm0		; SSE-NEXT: rcpps %xmm0, %xmm0
; SSE-NEXT: mulps {{.*}}(%rip), %xmm0		; SSE-NEXT: mulps {{.*}}(%rip), %xmm0
		; SSE-NEXT: rcpps %xmm1, %xmm1
; SSE-NEXT: mulps {{.*}}(%rip), %xmm1		; SSE-NEXT: mulps {{.*}}(%rip), %xmm1
; SSE-NEXT: retq		; SSE-NEXT: retq
;		;
; AVX-RECIP-LABEL: v8f32_no_step2:		; AVX-RECIP-LABEL: v8f32_no_step2:
; AVX-RECIP: # BB#0:		; AVX-RECIP: # BB#0:
; AVX-RECIP-NEXT: vrcpps %ymm0, %ymm0		; AVX-RECIP-NEXT: vrcpps %ymm0, %ymm0
; AVX-RECIP-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0		; AVX-RECIP-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0
; AVX-RECIP-NEXT: retq		; AVX-RECIP-NEXT: retq
;		;
; FMA-RECIP-LABEL: v8f32_no_step2:		; FMA-RECIP-LABEL: v8f32_no_step2:
; FMA-RECIP: # BB#0:		; FMA-RECIP: # BB#0:
; FMA-RECIP-NEXT: vrcpps %ymm0, %ymm0		; FMA-RECIP-NEXT: vrcpps %ymm0, %ymm0
; FMA-RECIP-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0		; FMA-RECIP-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0
; FMA-RECIP-NEXT: retq		; FMA-RECIP-NEXT: retq
;		;
; BTVER2-LABEL: v8f32_no_step2:		; BTVER2-LABEL: v8f32_no_step2:
; BTVER2: # BB#0:		; BTVER2: # BB#0:
; BTVER2-NEXT: vrcpps %ymm0, %ymm0		; BTVER2-NEXT: vrcpps %ymm0, %ymm0
; BTVER2-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0		; BTVER2-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0
; BTVER2-NEXT: retq		; BTVER2-NEXT: retq
;		;
; SANDY-LABEL: v8f32_no_step2:		; SANDY-LABEL: v8f32_no_step2:
; SANDY: # BB#0:		; SANDY: # BB#0:
; SANDY-NEXT: vrcpps %ymm0, %ymm0		; SANDY-NEXT: vmovaps {{.*#+}} ymm1 = [1.000000e+00,2.000000e+00,3.000000e+00,4.000000e+00,5.000000e+00,6.000000e+00,7.000000e+00,8.000000e+00]
; SANDY-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0		; SANDY-NEXT: vdivps %ymm0, %ymm1, %ymm0
; SANDY-NEXT: retq		; SANDY-NEXT: retq
;		;
; HASWELL-LABEL: v8f32_no_step2:		; HASWELL-LABEL: v8f32_no_step2:
; HASWELL: # BB#0:		; HASWELL: # BB#0:
; HASWELL-NEXT: vrcpps %ymm0, %ymm0		; HASWELL-NEXT: vmovaps {{.*#+}} ymm1 = [1.000000e+00,2.000000e+00,3.000000e+00,4.000000e+00,5.000000e+00,6.000000e+00,7.000000e+00,8.000000e+00]
; HASWELL-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0		; HASWELL-NEXT: vdivps %ymm0, %ymm1, %ymm0
; HASWELL-NEXT: retq		; HASWELL-NEXT: retq
;		;
; HASWELL-NO-FMA-LABEL: v8f32_no_step2:		; HASWELL-NO-FMA-LABEL: v8f32_no_step2:
; HASWELL-NO-FMA: # BB#0:		; HASWELL-NO-FMA: # BB#0:
; HASWELL-NO-FMA-NEXT: vrcpps %ymm0, %ymm0		; HASWELL-NO-FMA-NEXT: vmovaps {{.*#+}} ymm1 = [1.000000e+00,2.000000e+00,3.000000e+00,4.000000e+00,5.000000e+00,6.000000e+00,7.000000e+00,8.000000e+00]
; HASWELL-NO-FMA-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0		; HASWELL-NO-FMA-NEXT: vdivps %ymm0, %ymm1, %ymm0
; HASWELL-NO-FMA-NEXT: retq		; HASWELL-NO-FMA-NEXT: retq
;		;
; KNL-LABEL: v8f32_no_step2:		; KNL-LABEL: v8f32_no_step2:
; KNL: # BB#0:		; KNL: # BB#0:
; KNL-NEXT: vrcpps %ymm0, %ymm0		; KNL-NEXT: vrcpps %ymm0, %ymm0
; KNL-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0		; KNL-NEXT: vmulps {{.*}}(%rip), %ymm0, %ymm0
; KNL-NEXT: retq		; KNL-NEXT: retq
;		;
Show All 14 Lines

test/CodeGen/X86/recip-pic.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mtriple=i386-unknown-linux-gnu -enable-unsafe-fp-math -mcpu=slm -relocation-model=pic \| FileCheck %s --check-prefix=CHECK			; RUN: llc < %s -mtriple=i386-unknown-linux-gnu -enable-unsafe-fp-math -mcpu=slm -relocation-model=pic \| FileCheck %s --check-prefix=CHECK

	define fastcc float @foo(float %x) unnamed_addr #0 {			define fastcc float @foo(float %x) unnamed_addr #0 {
	; CHECK-LABEL: foo:			; CHECK-LABEL: foo:
	; CHECK: # BB#0: # %entry			; CHECK: # BB#0: # %entry
				; CHECK-NEXT: rcpss %xmm0, %xmm2
				; CHECK-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero
	; CHECK-NEXT: calll .L0$pb			; CHECK-NEXT: calll .L0$pb
	; CHECK-NEXT: .Lcfi0:			; CHECK-NEXT: .Lcfi0:
	; CHECK-NEXT: .cfi_adjust_cfa_offset 4			; CHECK-NEXT: .cfi_adjust_cfa_offset 4
	; CHECK-NEXT: .L0$pb:			; CHECK-NEXT: .L0$pb:
	; CHECK-NEXT: popl %eax			; CHECK-NEXT: popl %eax
	; CHECK-NEXT: .Lcfi1:			; CHECK-NEXT: .Lcfi1:
	; CHECK-NEXT: .cfi_adjust_cfa_offset -4			; CHECK-NEXT: .cfi_adjust_cfa_offset -4
	; CHECK-NEXT: .Ltmp0:			; CHECK-NEXT: .Ltmp0:
	; CHECK-NEXT: addl $_GLOBAL_OFFSET_TABLE_+(.Ltmp0-.L0$pb), %eax			; CHECK-NEXT: addl $_GLOBAL_OFFSET_TABLE_+(.Ltmp0-.L0$pb), %eax
	; CHECK-NEXT: movss {{.*#+}} xmm1 = mem[0],zero,zero,zero			; CHECK-NEXT: mulss %xmm2, %xmm0
	; CHECK-NEXT: divss %xmm0, %xmm1			; CHECK-NEXT: subss %xmm0, %xmm1
				; CHECK-NEXT: mulss %xmm2, %xmm1
				; CHECK-NEXT: addss %xmm2, %xmm1
				; CHECK-NEXT: mulss {{\.LCPI.*}}@GOTOFF(%eax), %xmm1
	; CHECK-NEXT: movaps %xmm1, %xmm0			; CHECK-NEXT: movaps %xmm1, %xmm0
	; CHECK-NEXT: movss %xmm1, (%eax)			; CHECK-NEXT: movss %xmm1, (%eax)
	; CHECK-NEXT: retl			; CHECK-NEXT: retl
	entry:			entry:
	%div = fdiv fast float 3.0, %x			%div = fdiv fast float 3.0, %x
	store float %div, float* undef, align 4			store float %div, float* undef, align 4
	ret float %div			ret float %div
	}			}

This is an archive of the discontinued LLVM Phabricator instance.

New unsafe-fp-math implementation for X86 targetNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 90329

include/llvm/CodeGen/MachineCombinerPattern.h

include/llvm/Target/TargetLowering.h

lib/CodeGen/MachineCombiner.cpp

lib/CodeGen/MachineTraceMetrics.cpp

lib/CodeGen/TargetLoweringBase.cpp

lib/Target/X86/X86ISelLowering.h

lib/Target/X86/X86ISelLowering.cpp

lib/Target/X86/X86InstrInfo.h

lib/Target/X86/X86InstrInfo.cpp

test/CodeGen/X86/recip-fastmath.ll

test/CodeGen/X86/recip-fastmath2.ll

test/CodeGen/X86/recip-pic.ll

New unsafe-fp-math implementation for X86 target
Needs ReviewPublic