This is an archive of the discontinued LLVM Phabricator instance.

New unsafe-fp-math implementation for X86 target
Needs ReviewPublic

Authored by avt77 on Nov 18 2016, 7:24 AM.

Download Raw Diff

Details

Reviewers

spatel
dsanders
scanon
Gerolf
craig.topper
hfinkel
RKSimon

Summary

The current fast-math implementation is based on DAGCombiner. One of disadvantages of that is ignoring of any cost model (throughput and/or length). Another disadvantage is implementation of target specific optimization at improper level of transformations. The introduced version moves the implementation lower into MachineCombiner. As result we're using the existing mechanism of transformations like getMachineCombinerPatterns and getAlternativeCodeSequence. In addition both throughput and length cost models are being used automatically.

This patch is only initial step to demonstrate the intention. The patch implements only one type of transformation: the reciprocal estimated code instead of vdivss instruction. I'm going to support other types of reciprocal optimizations as well. But I'd like to get first comments on this my job. The code is safely compiled and produces the working output.

Diff Detail

Event Timeline

avt77 updated this revision to Diff 78520.Nov 18 2016, 7:24 AM

avt77 retitled this revision from to New unsafe-fp-math implementation for X86 target.

avt77 updated this object.

avt77 added reviewers: RKSimon, spatel, ABataev.

RKSimon added a subscriber: llvm-commits.Nov 18 2016, 7:43 AM

RKSimon added inline comments.Nov 21 2016, 7:20 AM

lib/Target/X86/X86InstrInfo.cpp
8726	This needs refactoring to support scalar and packed versions of SSE and AVX opcodes if possible.
8774	We don't do reciprocal estimates for double types - these all need removing.
8816	Replace this with a basic TODO comment for future FSQRT pattern support - don't comment out code (especially when it doesn't exist).

The new reciprocal implementation is done on more than 97%: we still don't have public tests but the code changes are almost completed. Please review and send me your comments.

In D26855#606635, @avt77 wrote:

The new reciprocal implementation is done on more than 97%: we still don't have public tests but the code changes are almost completed. Please review and send me your comments.

I applied the patch to trunk, but every example that I tried after that crashed:
Assertion failed: (i < getNumOperands() && "getOperand() out of range!"), function getOperand, file llvm/include/llvm/CodeGen/MachineInstr.h, line 280.

Is it correct that this patch will allow us to remove 'fake' subtarget features like FeatureFastScalarFSQRT / FeatureFastVectorFSQRT ?

In D26855#608325, @spatel wrote:

In D26855#606635, @avt77 wrote:

The new reciprocal implementation is done on more than 97%: we still don't have public tests but the code changes are almost completed. Please review and send me your comments.

I applied the patch to trunk, but every example that I tried after that crashed:
Assertion failed: (i < getNumOperands() && "getOperand() out of range!"), function getOperand, file llvm/include/llvm/CodeGen/MachineInstr.h, line 280.

Is it correct that this patch will allow us to remove 'fake' subtarget features like FeatureFastScalarFSQRT / FeatureFastVectorFSQRT ?

Sorry for crashes - it's a result of my last minute changes and missing of corresponding tests. Hope to fix it today-tomorrow.
About FeatureFastScalarFSQRT / FeatureFastVectorFSQRT: I did not look at these features yet but it seems YES, we'll be able to remove them because the corresponding functionality should appear automatically. The idea of the patch: the throughput(latency) should be calculated automatically for every specific alternative code sequence (e.g. recip instead of div) but I did not check yet if it works at the moment. And we should check other similar features if they exist (FeatureFastLZCNT, etc.) Does it make sense?

BTW, I have a question: this is the first patch in the possible series of patches. This patch introduces the the suggested approach on the example of FDiv - Recip implementation. Should I complete this patch with FDIV only or should I extend it with rsqrt? The patch is rather big that's why I suggest to make it really working with fdiv only and commit. After that I'll be able to make other similar patches faster and safely. Is it OK?

In fact if you try something like

llc < .../llvm/test/CodeGen/X86/recip-fastmath.ll -mtriple=x86_64-unknown-unknown -mattr=+avx

it should work just now: up to now I was working with this test only .

In D26855#608851, @avt77 wrote:

BTW, I have a question: this is the first patch in the possible series of patches. This patch introduces the the suggested approach on the example of FDiv - Recip implementation. Should I complete this patch with FDIV only or should I extend it with rsqrt? The patch is rather big that's why I suggest to make it really working with fdiv only and commit. After that I'll be able to make other similar patches faster and safely. Is it OK?

I think this is the right direction because we want to make use of the actual CPU scheduler model instead of estimates, but implementing this will almost certainly lead to unintended consequences that need to be worked around. For example, one thing I noticed when adding the reassociation patterns to machine combiner: we could significantly increase compile time because calculating critical path/resource height with MachineTraceMetrics can be expensive.

You should send a note to llvm-dev about this plan, the motivation, the benefits, etc. and reference this patch as the starting point. The changes you are proposing could be adapted by all targets, so there should be feedback from various backend people to avoid duplicated effort.

Is this patch an NFC patch? If no, the test must be added

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
14950 ↗	(On Diff #79408)	Why is this line commented? If it is not required (but I rather doubt) you should just remove it.
lib/Target/X86/X86InstrInfo.cpp
8651–8653	I believe this can be replaced by a range-based loop
8658	auto &?
8674–8678	Should these loops be range-based loops?
8681–8683	Range-based loop?
8687–8689	No braces required
8762–8763	reformat this

Still looking through this, but here are some initial comments.

lib/Target/X86/X86InstrInfo.cpp
8651	Use a range iterator if you can, same for the other cases below
8661	This should be C->isExactlyValue(1.0)
8669	This should be isDividendExactlyOne. We should be calling it Dividend not Divident as well.
8688	Embed hasAllOnesOperand here.
8722	Replace SmallVector<int, 7> with ArrayRef<int>
8836	This isn't necessary if DividentIsAllOnes == true

The new version for next discussions

I fixed one assertion failing (but there is another one at the moment), moved to the range for, formatted and made some sugar changes.

All failed assertions were resolved. The ability to use old reciprocal implementation for non-x86 platforms was restored. Now we have only one failed test CodeGen/X86/recip-fastmath.ll but that's exactly what we were expecting because I changed the code generation here. I hope that's 99.99% of the patch - please review.

I updated the test related to reciprocal code gen. Now it shows most of possible variants of dividend (not only 1.0 as it was before). Now the patch is ready for final review.

avt77 added reviewers: hfinkel, Gerolf, dsanders.Dec 6 2016, 3:01 AM

Some minor comments, but I'd like to hear other people's thoughts. This might be the first of a number of X86 MC patterns and I want to be sure we're doing this correctly.

include/llvm/CodeGen/MachineInstr.h
1155 ↗	(On Diff #80397)	Explain the purpose of the new 'P' argument?
lib/Target/X86/X86InstrInfo.cpp
8648	Rename this to hasExactlyOneOperand ? AllOnes means 'all bits are set'
8677	Rename to isDividendEactlyOne and update the comment.
8748	Dividend is only used once, merge these? unsigned DividendReg = Root.getOperand(1).getReg();
8749	AllOnes -> ExactlyOne
8814	Can we replace TII->get(Instrs[2]) with MIDesc?
8821	Can we replace TII->get(Instrs[2]) with MIDesc?
8933	Inconsistent braces to the other cases.

Also, please add the additional tests to trunk now with the current codegen so this patch shows the diff.

I fixed all requirements raised by Simon and Alexey. The special test is here as well.

avt77 mentioned this in rL289931: Extra coverage tests to demonstrate fixes in D72618 and D26855.Dec 16 2016, 2:06 AM

I've updated the reciprocal related tests to see difference between old and new code gen more clearly. In fact there is no real difference but the new approach allows to take into account the schedule cost model when we deal with different machine code patterns. This patch should become the first step in the future similar optimizations like rsqrt, etc.

Does anyone else have comments on this please? There has been interest in moving more of such codegen to the MC so that we can use more realistic scheduling costs so I'm keen to see this go in.

test/CodeGen/X86/recip-fastmath2.ll
191 ↗	(On Diff #81735)	Any ideas why we fail to get: movaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00] There are other cases here as well.

A few general remarks:

I'm very much in favor of the MC combiner approach
But I'm getting increasingly concerned about the MC code quality. I felt the FMA/Aarch64 support starts looking crabby (I'm the culprit) and this doesn't look much better I'm afraid. The approach is in need of more automation for better maintainability. This is meant as food for thought. Since I"m too blame for the approach I can't to be harsh in reviews :-)
I'm also concerned about the compile-time (in particular since we don't track x86 specific issues ( or any other backend than Aarch64- or least I'm not aware that anyone is watching closely). Could you share some specific data about your perf gains and compile-time measurements? However, I think this optimization is for fast math only and compile-time is probably less of an issue in that mode. One way to deal with compile-time issues is to wrap some MC under an option.
Perhaps I missed it but I expected the optimization to kick in only under fast math. I saw 'fast' in the test cases, but didn't see a check in the code.

Thanks
Gerolf

lib/Target/X86/X86InstrInfo.cpp
8654	This is an expensive search. There must be a direct simpler way to get that info.
8705	For which types? All FP?
8710	Can you elaborate? eg. on #iterations?
8757	Why?

Gerolf added inline comments.Dec 21 2016, 2:36 PM

include/llvm/CodeGen/MachineInstr.h
1154 ↗	(On Diff #81735)	MBB instead of P? Is this for debugging?
lib/CodeGen/MachineCombiner.cpp
158 ↗	(On Diff #81735)	Please add a comment that explains why default is used. + assert(DefIdx \|\| UseIdx);
lib/CodeGen/MachineInstr.cpp
1712 ↗	(On Diff #81735)	you can use the new parameter and then if (!MBB) MBB=getParent()

So what do you suggest? Can we get this in before refactoring the MC alternative code pattern system or will that cause too much additional complexity?

Whatever, it sounds like we'd be better off doing this all after the 4.0 branch, do you agree?

What do you mean when you speak about "automation"? Do you mean a possibility to describe alternative sequences with tools like TableGen? If yes I'm afraid it'll require some real time to implement. That's why from my point of view the hand-made patterns similar to the given one could be really useful in future. But of course it'd be really interesting to launch such a project. Right?

The current trunk has already changes in Machine::print, etc. similar in my initial patch. Because of that I removed all corresponding changes and did not answer on all corresponding comments.
It seems I fixed all other requirements from Gerolf
But the main question is the same: should we continue with the effort?

In D26855#629715, @avt77 wrote:

But the main question is the same: should we continue with the effort?

Absolutely, but I'm starting to think it make sense to wait until after the 4.0 branch, then get this in as we need an x86 implementation for reference and then begin an investigation into how we want the MachineCombiner to develop.

The "automatic" generation of pattern e.g. with TableGen is on my longer term wish list, not a requirement for this patch. Sorry if my wording was confusing.
Do you have performance numbers?

lib/Target/X86/X86InstrInfo.cpp
8652	-> dividend
8666	Why this special case? just C = C->getSplatValue() would be easier to read.
8758	wey -> why?
8832	Should there be a 2:?

Yes, I've just got the numbers. I created 2 versions of clang compiler: directly from trunk and with my patch applied. Then with help of these compilers I created 2 new compilers with the following configuration:

cmake -G "Ninja" -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=<trun/patch compiler home>/build/bin/clang -DCMAKE_CXX_COMPILER=<trunk/patch compiler home>/build/bin/clang++ -DCMAKE_C_FLAGS="-O3 -ffast-math" -DCMAKE_CXX_FLAGS="-O3 -ffast-math" ../llvm

Below you can see the times (I did 2 experiments for every compiler):

Compiler with patch

real 32m10.783s
user 125m19.424s
sys 3m8.456s

real 31m20.432s
user 122m2.012s
sys 3m4.444s

Trunk based compiler

real 31m46.001s
user 123m39.192s
sys 3m10.180s

real 40m6.791s
user 156m5.472s
sys 3m36.476s

Of course it's very rough estimations because I used our server and there a lot of things around. But general picture is clear from my point of view: my patch does not increase the compilation time.
Is it enough or I should do other experiments?

In D26855#630340, @avt77 wrote:

Yes, I've just got the numbers. I created 2 versions of clang compiler: directly from trunk and with my patch applied. Then with help of these compilers I created 2 new compilers with the following configuration:

cmake -G "Ninja" -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=<trun/patch compiler home>/build/bin/clang -DCMAKE_CXX_COMPILER=<trunk/patch compiler home>/build/bin/clang++ -DCMAKE_C_FLAGS="-O3 -ffast-math" -DCMAKE_CXX_FLAGS="-O3 -ffast-math" ../llvm

Below you can see the times (I did 2 experiments for every compiler):

Compiler with patch

real 32m10.783s
user 125m19.424s
sys 3m8.456s

real 31m20.432s
user 122m2.012s
sys 3m4.444s

Trunk based compiler

real 31m46.001s
user 123m39.192s
sys 3m10.180s

real 40m6.791s
user 156m5.472s
sys 3m36.476s

Of course it's very rough estimations because I used our server and there a lot of things around. But general picture is clear from my point of view: my patch does not increase the compilation time.
Is it enough or I should do other experiments?

I'll let Simon decide but these numbers are iffy. I can't necessarily conclude your patch increases compile time but I can't conclude anything else either. In particular, the stock clang measurement have a variance of 20% between consecutive runs, so I have very little faith in the numbers collected.
Rafael recent('ish)ly published a set of suggestions/knob to turn on to get relatively stable numbers on a Linux machine. I'm also pretty sure the topic of how to get {reliable numbers, numbers you can have faith in} has been discussed multiple times (look at the archives, Sean has generally pretty informative posts/insights on the topic).

Two more comments:

I did not update the patch accordingly to the latest Gerolf comments: I'll do it asap
Gerolf asked: "Perhaps I missed it but I expected the optimization to kick in only under fast math. I saw 'fast' in the test cases, but didn't see a check in the code."

I check fast-math in "static bool getFDIVPatterns"

switch (TLI->getRecipEstimateDivEnabled(VT, *MF)) { this line checks per-function option case TLI->ReciprocalEstimate::Disabled: return false; case TLI->ReciprocalEstimate::Unspecified: if (Root.getParent()->getParent()->getTarget().Options.UnsafeFPMath) and here I check command line option if there is no per-function code

In D26855#630341, @davide wrote:

I'll let Simon decide but these numbers are iffy. I can't necessarily conclude your patch increases compile time but I can't conclude anything else either. In particular, the stock clang measurement have a variance of 20% between consecutive runs, so I have very little faith in the numbers collected.
Rafael recent('ish)ly published a set of suggestions/knob to turn on to get relatively stable numbers on a Linux machine. I'm also pretty sure the topic of how to get {reliable numbers, numbers you can have faith in} has been discussed multiple times (look at the archives, Sean has generally pretty informative posts/insights on the topic).

Could you give me a reference to the info? Of course I'll try to find it myself but with your help it could be faster.

In D26855#630344, @avt77 wrote:

In D26855#630341, @davide wrote:

I'll let Simon decide but these numbers are iffy. I can't necessarily conclude your patch increases compile time but I can't conclude anything else either. In particular, the stock clang measurement have a variance of 20% between consecutive runs, so I have very little faith in the numbers collected.
Rafael recent('ish)ly published a set of suggestions/knob to turn on to get relatively stable numbers on a Linux machine. I'm also pretty sure the topic of how to get {reliable numbers, numbers you can have faith in} has been discussed multiple times (look at the archives, Sean has generally pretty informative posts/insights on the topic).

Could you give me a reference to the info? Of course I'll try to find it myself but with your help it could be faster.

http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20161017/398831.html

I made new experiments but now I use a dedicated computer for it:

atischenko@ip-172-31-21-62:~/workspaces$ cat time.log
cmake -G "Ninja" -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=/home/atischenko/workspaces/step3/build/bin/clang -DCMAKE_CXX_COMPILER=/home/atischenko/workspaces/step3/build/bin/clang++ -DCMAKE_C_FLAGS="-O3 -ffast-math" -DCMAKE_CXX_FLAGS="-O3 -ffast-math" ../llvm
time ninja -j4

Compiler with patch

real 32m10.783s
user 125m19.424s
sys 3m8.456s

real 31m20.432s
user 122m2.012s
sys 3m4.444s

On dedicated computer

real 31m23.564s
user 122m13.120s
sys 3m5.340s

real 31m28.115s
user 122m30.596s
sys 3m6.588s

real 31m22.861s
user 122m8.920s
sys 3m7.236s

Trunk based compiler

cmake -G "Ninja" -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=/home/atischenko/workspaces/bootstrap-trunk/build/bin/clang -DCMAKE_CXX_COMPILER=/home/atischenko/workspaces/bootstrap-trunk/build/bin/clang++ -DCMAKE_C_FLAGS="-O3 -ffast-math" -DCMAKE_CXX_FLAGS="-O3 -ffast-math" ../llvm
time ninja -j4

real 31m46.001s
user 123m39.192s
sys 3m10.180s

real 40m6.791s
user 156m5.472s
sys 3m36.476s

On dedicated computer

real 31m24.634s
user 122m14.912s
sys 3m8.080s

real 31m22.833s
user 122m9.836s
sys 3m5.676s

real 31m22.795s
user 122m5.924s
sys 3m8.588s

Hope now we can trust my results.

I fixed the last issues raised by Gerolf except one related to special case of "if" because the suggested change breaks the current logic.

From my perspective the implementation is close and only requires a few minor changes.

The compile-time numbers I've seen so far are meaningless (wide variation, unclear if/how many times your code actually fires etc), but I'm not too concerned about O3 fast-math ct and would give it benefit of the doubt.

I did ask the question about performance benefits twice to no avail and admit I'm still curious. I assume to get these numbers you do set your machines into perf mode rather than using some servers running some random load.

Thanks
Gerolf

lib/Target/X86/X86InstrInfo.cpp
8648	Add comment what the function does before the function
8652	it's -> it is
8720	input to this function are 7 parameters, the comment only lists 6.
8757	I keep stumbling over this comment and every time i read it: did you mean to say something like //Execute at least one iteration. Iternations = Max(1, Iterations);
8760	Where is that input sequence documented?

I fixed everything except one comment (see below). And I collected new perf numbers. Now I used the following command for bootstrap building:

time make -j 1

As result the reproducing is very well from my point of view. In addition I tried to get numbers accordingly to description in http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20161017/398831.html . The reproducing is almost the same but the required time is even longer (about 2 hours for every test). Because of that I kept my numbers. The test itself is very simple: I created 2 versions of compiler: the first one was built directly from trunk and the second one was built after applying of my patch. Then with help of every compiler I created 4 bootstraps. The results are below:

The trunk version of the compiler builds the bootstrap like here (there were 4 starts with "time make -j 1"):

real  91m7.998s
real  90m23.861s
real  90m26.154s
real  90m31.533s

The version of the compiler with my patch builds the bootstrap like here (there were the same 4 starts with "time make -j 1"):

real  90m43.970s
real 90m7.257s
real  90m6.671s
real  90m11.733s

Obviously, the compilation time does not depend on my patch.

avt77 added inline comments.Jan 16 2017, 3:22 AM

lib/Target/X86/X86InstrInfo.cpp
8720	I did not understand this comment: what should I do here?

RKSimon added inline comments.Jan 16 2017, 1:41 PM

lib/Target/X86/X86InstrInfo.cpp
8720	I think it means that while ArrayRef<int> Instrs have 7 instructions listed, the codegen in the comment only shows 6 instructions

avt77 added inline comments.Jan 18 2017, 1:03 AM

lib/Target/X86/X86InstrInfo.cpp
8720	But in fact all 7 instructions are shown but from index 0 to index 6 (maybe in "strange" order: 0,2,1,3,4,5,6). If you'd like I could change the order and/or start numbering from 1. Gerolf, should I do it or we fixed everything?

In D26855#647135, @avt77 wrote:

I fixed everything except one comment (see below). And I collected new perf numbers. Now I used the following command for bootstrap building:

time make -j 1

As result the reproducing is very well from my point of view. In addition I tried to get numbers accordingly to description in http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20161017/398831.html . The reproducing is almost the same but the required time is even longer (about 2 hours for every test). Because of that I kept my numbers. The test itself is very simple: I created 2 versions of compiler: the first one was built directly from trunk and the second one was built after applying of my patch. Then with help of every compiler I created 4 bootstraps. The results are below:

So, I'm mostly lurking, but I want to point out a serious issue here: Clang and LLVM have as little floating point as we could manage in them. So I would expect them to be quite uninteresting in testing the compile time impact of a patch that is only concerned with floating point code....

And there still are no numbers around the improvement here...

I suspect you'll need to provide benchmark data from at least SPEC and/or the LLVM test suite that shows this is an improvement and compile time numbers to show that the improvement doesn't cost too much... At least, that would be my expectation. Those at least do include some floating point code. You might also try running benchmarks from the Eigen project which has a very large amount of floating point code. However, they usually don't build with any unsafe math flags, so correctness issues may dominate.

It'd also be great to hear from others invested in LLVM's FP lowering like Hal, Steve, etc...

What is "Eigen project"? Could you point me to it?

I'm leaning towards a LGTM since you addressed basically all my issues, but more people mushroomed and are curious about your performance data. So I think can't dodge that question anymore and need to share some good data for the benchmark(s) you are tuning for before you get the nod.

Cheers
Gerolf

lib/Target/X86/X86InstrInfo.cpp
8720	In the comments I only count 6 instructions from vmovss to vaddss, but the code handles 7. What am I missing?

Unless the change is trivially and obviously an improvement on inspection of the result, I think you need data before making it. =] I've not looked at this particular change in enough detail to say whether it satisfies that. Maybe Gerolf or Craig could.

If there are compile time concerns (which actually came up) and they deal with floating point and unsafe floating point in particular, then I don't think a bootstrap is a useful way to measure anything one way or the other (there just isn't enough floating point).

In D26855#649578, @avt77 wrote:

What is "Eigen project"? Could you point me to it?

http://eigen.tuxfamily.org/

In D26855#650115, @chandlerc wrote:

Unless the change is trivially and obviously an improvement on inspection of the result, I think you need data before making it. =] I've not looked at this particular change in enough detail to say whether it satisfies that. Maybe Gerolf or Craig could.

Another way besides benchmark data to show runtime improvements are tools like IACA. This is how we curated most of the vector shuffle improvements over the last few years for example.

Chandler, in fact this patch should not show any improvement in generating code. If you look in changes made in tests you'll see that the newly generated code is almost identical to the previous one (only some names, order of instructions, etc.). The idea of the patch is moving of such kind of optimization from the rather high level (DAGCombiner) to the really low level (MachineCombiner), Here we see real target machine instructions and as result we can use real cost model to estimate the real cost of possible transformation (in the given case the transformation is the replacement of one instruction (div) with some sequence of instructions). The transformation itself already exists inside Clang but the patch suggests to implement it in another place and that's it. If we agree with this new place of implementation then it will be the base for future possible similar optimizations like rsqrt, etc. And in addition this (and follow up) patch(es) will allow us to remove 'fake' subtarget features like FeatureFastScalarFSQRT / FeatureFastVectorFSQRT, etc. The question from Gerolf was not about the quality of the generated code (it's the same like we have now) but about the compilation time only.

Of course I'll try to collect the required perf numbers but they should be the same like we have now. Do we really need it?

Hi All,
I found a really "stress" test for div operations (see the attachment)

spill_fdiv.ll644 KBDownload

(tnx to Sanjay Patel). The test shows maybe the worst case of the possible degradation because of this patch. I used the following command with 2 different compilers:

cmake -E time ./llc spill_div.ll -o /dev/null -enable-unsafe-fp-math

For "pure" trunk compiler I got: Elapsed time 2 s.
For compiler with patch I got: Elapsed time 18 s.

(I launched the test several times with the same results.)
What's now? Is it acceptable? Should I try to optimize the patch? Should I try other benchmarks?
(I tried both LNT and Eigen but unfortunately they don't work for me at the moment because of unpredictable runtime issues.)

In D26855#657287, @avt77 wrote:

Hi All,
I found a really "stress" test for div operations (see the attachment)
spill_fdiv.ll644 KBDownload
(tnx to Sanjay Patel). The test shows maybe the worst case of the possible degradation because of this patch. I used the following command with 2 different compilers:

.......

For "pure" trunk compiler I got: Elapsed time 2 s.
For compiler with patch I got: Elapsed time 18 s.

Do you have any profiling info on where the time is going please? @Gerolf might then be able to advise whether we can improve the MCCombiner mechanism before/after this patch has gone in.

In D26855#657464, @RKSimon wrote:

In D26855#657287, @avt77 wrote:

Hi All,
I found a really "stress" test for div operations (see the attachment)
spill_fdiv.ll644 KBDownload
(tnx to Sanjay Patel). The test shows maybe the worst case of the possible degradation because of this patch. I used the following command with 2 different compilers:

.......

For "pure" trunk compiler I got: Elapsed time 2 s.
For compiler with patch I got: Elapsed time 18 s.

Do you have any profiling info on where the time is going please? @Gerolf might then be able to advise whether we can improve the MCCombiner mechanism before/after this patch has gone in.

I'll jump in here because I supplied this (hopefully degenerate and worst) case based on my earlier reassociation transforms for MachineCombiner (see D10321 where I first mentioned the potential compile-time problem). When I looked into that, the time was all going into MachineTraceMetrics::Ensemble::computeInstrDepths() and MachineTraceMetrics::Ensemble::
computeInstrHeights(). Those got very expensive for long sequences of instructions. Possible fixes would be to improve how those are computed, cache those results, and/or eliminate how often we ask for those values.

We were ok with some additional potential compile-time cost because reassociation opportunities do not appear to be that common and were limited to -ffast-math compiles . We can make similar arguments for the recip transforms in this patch?

But it is worth noting that since the time of D10321, the number of reassociation candidate opcodes that x86 matches has grown to ~200 (X86InstrInfo::isAssociativeAndCommutative()) and includes integer ops. We're probably overdue on measuring and improving the perf of MachineCombiner.

I just re-read some of the comments in this review, and back on Nov 30 I explained the possible compile-time hit. I requested a note to llvm-dev at that time. Based on the confusion in the subsequent comments (people think this patch will affect execution perf), I am making the suggestion to post to llvm-dev again. Moving from DAG heuristics to MI-scheduler-based transforms is a change in strategy for all targets and explaining that to a wider audience is an opportunity to get good feedback.

In D26855#657577, @spatel wrote:

In D26855#657464, @RKSimon wrote:

In D26855#657287, @avt77 wrote:

Hi All,
I found a really "stress" test for div operations (see the attachment)
spill_fdiv.ll644 KBDownload
(tnx to Sanjay Patel). The test shows maybe the worst case of the possible degradation because of this patch. I used the following command with 2 different compilers:

.......

For "pure" trunk compiler I got: Elapsed time 2 s.
For compiler with patch I got: Elapsed time 18 s.

Do you have any profiling info on where the time is going please? @Gerolf might then be able to advise whether we can improve the MCCombiner mechanism before/after this patch has gone in.

I'll jump in here because I supplied this (hopefully degenerate and worst) case based on my earlier reassociation transforms for MachineCombiner (see D10321 where I first mentioned the potential compile-time problem). When I looked into that, the time was all going into MachineTraceMetrics::Ensemble::computeInstrDepths() and MachineTraceMetrics::Ensemble::
computeInstrHeights(). Those got very expensive for long sequences of instructions. Possible fixes would be to improve how those are computed, cache those results, and/or eliminate how often we ask for those values.

Certainly seems worthwhile exploring whether those can be cached (if I understand what they're doing, we do essentially cache very-similar values during MI scheduling). This worst-case hit is definitely undesirable, and we can certainly run into lots of machine-generated straight-line code, so hitting these kinds of cases in the wild is not unthinkable.

We were ok with some additional potential compile-time cost because reassociation opportunities do not appear to be that common and were limited to -ffast-math compiles . We can make similar arguments for the recip transforms in this patch?

But it is worth noting that since the time of D10321, the number of reassociation candidate opcodes that x86 matches has grown to ~200 (X86InstrInfo::isAssociativeAndCommutative()) and includes integer ops. We're probably overdue on measuring and improving the perf of MachineCombiner.

I think the only issue that needs to be addressed is (finally!) sharing perf data. This has been raised at least 3 times. The possible compile-time implication, the speciality of the application (fast-math) etc are well understood.

Gerolf

In D26855#657735, @Gerolf wrote:

I think the only issue that needs to be addressed is (finally!) sharing perf data. This has been raised at least 3 times. The possible compile-time implication, the speciality of the application (fast-math) etc are well understood.

Gerolf

As I understand it the idea is that by moving this to the MC, then these alternative patterns will only be used if (1) the fast-math code permits it and (2) that the target cpu's scheduler model indicates that its quicker? So what you are asking is that we time the two versions of the code on specific cpus to check if in each case the correct decision is made?

This probably means that the tests should be updated to check against a couple of specific target cpus as well - we're limited by what x86 schedulers we have as but I know Jaguar (btver2) should use the rcpps version in all cases and expect Haswell should use divps.

A quick look at the SandyBridge scheduler model suggests its latency for FDIV is too low (especially ymm as it only has a 128-bit div alu) so that will select divps when it probably shouldn't....

In D26855#657735, @Gerolf wrote:

I think the only issue that needs to be addressed is (finally!) sharing perf data. This has been raised at least 3 times. The possible compile-time implication, the speciality of the application (fast-math) etc are well understood.

Gerolf

Sorry but I don't understand what means sharing in this case? I put all perf numbers here in comments. Is not it enough for sharing? If not where should I share it? Or maybe my perf numbers are not perf numbers from your point of view? Please, clarify.

And next question about profiling data. Should I collect it? I've already started the process but now I'm not sure if it's interesting for somebody.

I got the first profiling data. In fact it's the same that was described by Sanjay:

Samples: 1M of event 'cycles:pp', Event count (approx.): 1180464
Overhead Command Source Shared Object Source Symbol

15,65%  llc      llc                   [.] llvm::MachineTraceMetrics::Ensemble::computeInstrDepths
15,18%  llc      llc                   [.] getDataDeps
 9,23%  llc      llc                   [.] llvm::MachineTraceMetrics::Ensemble::computeInstrHeights
 8,29%  llc      llc                   [.] pushDepHeight
 8,15%  llc      llc                   [.] llvm::MachineTraceMetrics::Ensemble::invhalidate
 5,64%  llc      llc                   [.] llvm::TargetInstrInfo::defaultDefLatency
 4,89%  llc      llc                   [.] llvm::MachineTraceMetrics::getResources
 2,44%  llc      llc                   [.] llvm::X86InstrInfo::isHighLatencyDef

Should I try to cash the metrics or it's a question of a special patch?

I updated recip-fastmath2.ll test accordingly to Simon recommendations. Now it includes special checks for different CPUs: SandyBridge, Haswell and btver2. These new checks demonstrate that alternative sequence of instructions is being selected when it's really cheaper than the single fdiv instruction. (Obviously we should change cost numbers for SandyBridge because they are too small.)

In D26855#658647, @avt77 wrote:
I got the first profiling data. In fact it's the same that was described by Sanjay:

Samples: 1M of event 'cycles:pp', Event count (approx.): 1180464
Overhead Command Source Shared Object Source Symbol
15,65%  llc      llc                   [.] llvm::MachineTraceMetrics::Ensemble::computeInstrDepths
15,18%  llc      llc                   [.] getDataDeps
 9,23%  llc      llc                   [.] llvm::MachineTraceMetrics::Ensemble::computeInstrHeights
 8,29%  llc      llc                   [.] pushDepHeight
 8,15%  llc      llc                   [.] llvm::MachineTraceMetrics::Ensemble::invhalidate
 5,64%  llc      llc                   [.] llvm::TargetInstrInfo::defaultDefLatency
 4,89%  llc      llc                   [.] llvm::MachineTraceMetrics::getResources
 2,44%  llc      llc                   [.] llvm::X86InstrInfo::isHighLatencyDef
Should I try to cash the metrics or it's a question of a special patch?

I think you should look at caching them, or limiting their depth (beyond a certain point, the exact answer might not matter), or both.

In D26855#658576, @avt77 wrote:

In D26855#657735, @Gerolf wrote:

I think the only issue that needs to be addressed is (finally!) sharing perf data. This has been raised at least 3 times. The possible compile-time implication, the speciality of the application (fast-math) etc are well understood.

Gerolf

Sorry but I don't understand what means sharing in this case? I put all perf numbers here in comments. Is not it enough for sharing? If not where should I share it? Or maybe my perf numbers are not perf numbers from your point of view? Please, clarify.

You shared both compile-time numbers and runtime numbers by building clang, which is "insensitive" to floating point optimization. So it was asked to motivate better your change with benchmarks that can show codegen improvements in practice.

Chandler, in fact this patch should not show any improvement in generating code. [...] The idea of the patch is moving of such kind of optimization from the rather high level (DAGCombiner) to the really low level (MachineCombiner), [....] if we agree with this new place of implementation then it will be the base for future possible similar optimizations like rsqrt, etc. And in addition this (and follow up) patch(es) will allow us to remove 'fake' subtarget features like FeatureFastScalarFSQRT / FeatureFastVectorFSQRT, etc.

At this point, without an example showing that you can outperform the DAG approach, it is quite hypothetical ("believe me it is better!").
However GlobalISel could be a motivation? It'll need to reproduce all the combine from the DAG there isn't it?

The question from Gerolf was not about the quality of the generated code (it's the same like we have now) but about the compilation time only.

I think he was asking both :)

guyblank added a subscriber: guyblank.Jan 30 2017, 9:02 AM

guyblank added inline comments.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15331 ↗	(On Diff #86042)	sorry for joining the discussion so late. IIUC this affects all X86 CPUs, but I didn't see handling for all possible types (as stated in a TODO below). specifically, does this affect AVX-512 code?

avt77 added inline comments.Jan 31 2017, 1:46 AM

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15331 ↗	(On Diff #86042)	At the moment we support only limited set of types (see X86InstrInfo::genAlternativeCodeSequence below) but we are ready to extend it if it is necessary.

guyblank added inline comments.Feb 1 2017, 1:29 AM

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15331 ↗	(On Diff #86042)	you support the machine combiner approach for a limited set of types/instructions, but you've disabled the DAG combine approach for ALL types/instructions. I ran CodeGen/X86/recip-fastmath.ll on knl with your changes. the output for @f32_one_step (for example) changed from vrcp14ss %xmm0, %xmm0, %xmm1 vfnmadd213ss .LCPI1_0(%rip), %xmm1, %xmm0 vfmadd132ss %xmm1, %xmm1, %xmm0 to vmovss .LCPI1_0(%rip), %xmm1 vdivss %xmm0, %xmm1, %xmm0

AsafBadouh added a subscriber: AsafBadouh.Feb 1 2017, 1:44 AM

I fixed the issue with compile time increasing - see usage of MinInstr->getTrace(MBB). Now we're getting the trace when we really need it only. As result the executing profile was totally changed and the compiling time is now even less than it was in DAG Combiner - about 1.5 s on my laptop (I'm speaking about our worst case test only).

avt77 added inline comments.Feb 3 2017, 3:03 AM

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15331 ↗	(On Diff #86042)	I did not do anything with knl and as result you are right: this code generation was lost. Do you know other examples like that? If the problem in knl only then I could easily support it. But of course I could add special option like "-use_machine-combiner" or "-use-dag-combiner". What's better?

Please separate out the compile time improvements into a new patch for review - now that we have it the "compile time question" shouldn't hold up this patch any longer so we should be able to get this in as soon as the avx512 (knl/skx) issue is answered.

As for AVX512, I am ok with excluding it specifically in DAGCombiner.

But I'm also concerned about other cpus for which we don't have an accurate scheduler model (broadwell, skylake), should these be excluded as well?

lib/Target/X86/X86InstrInfo.cpp
8897	currently in Haswell and newer cpus, the generated sequence is using fma instructions. this should be taken into account here in the patterns, right?
test/CodeGen/X86/recip-fastmath.ll
3 ↗	(On Diff #86947)	please add RUN commands for specific cpus (as in the other test + AVX512 target) also commit the test changes and rebase your patch on them so we can see the output changes in these new runs.

RKSimon mentioned this in rL294128: [X86][SSE] Add target cpu specific reciprocal tests.Feb 5 2017, 10:37 AM

I've added the target cpu specific tests to trunk (rL294128) to help us track AVX512 (SKX/KNL) perf

In D26855#667181, @guyblank wrote:

But I'm also concerned about other cpus for which we don't have an accurate scheduler model (broadwell, skylake), should these be excluded as well?

If the cpu is using an older scheduler model then its already not necessarily being optimally scheduled - I don't see this patch being any different.

lib/Target/X86/X86InstrInfo.cpp
8897	Yes this should be taken into account, but it does mean that the number of codegen patterns is going to start to balloon if we're not careful.

In D26855#667256, @RKSimon wrote:

In D26855#667181, @guyblank wrote:

But I'm also concerned about other cpus for which we don't have an accurate scheduler model (broadwell, skylake), should these be excluded as well?

If the cpu is using an older scheduler model then its already not necessarily being optimally scheduled - I don't see this patch being any different.

Currently the older schedule "only" affects the schedule, but with this patch (and future ones using the machine combiner framework) it will affect which instructions we emit. This could possibly lead to generating worse code than we do at the moment.

mkuper added a subscriber: mkuper.Feb 6 2017, 12:58 AM

RKSimon mentioned this in rL294587: [X86][SSE] Added extra FMA/NO-FMA reciprocal test cases for D26855.Feb 9 2017, 6:25 AM

In D26855#667484, @guyblank wrote:

In D26855#667256, @RKSimon wrote:

In D26855#667181, @guyblank wrote:

But I'm also concerned about other cpus for which we don't have an accurate scheduler model (broadwell, skylake), should these be excluded as well?

If the cpu is using an older scheduler model then its already not necessarily being optimally scheduled - I don't see this patch being any different.

Currently the older schedule "only" affects the schedule, but with this patch (and future ones using the machine combiner framework) it will affect which instructions we emit. This could possibly lead to generating worse code than we do at the moment.

Recap:

This (and hopefully an equivalent sqrt/rsqrt future patch) are the stand out examples of cases that affect the actual result depending on MC and the scheduler model - minor code changes could cause it to switch between full precision divps and rcpps+nr, but this only happens with the necessary fast/unsafe flags enabled which means the user knows what to expect.

For most other possibly MC cases (e.g. constant rematerialization, shuffles, slow-vs-fast path instructions) there will be differences in the instructions used but not the final result.

RKSimon mentioned this in rL294750: [X86][SSE] Added chained FDIV test cases for D26855.Feb 10 2017, 7:07 AM

I fixed all known issues:

AVX512 is now again supported by DAGCombiner
FMA instructions are being used when FMA is enabled

This version clearly shows the advantage of sched model usage: it selects reciprocal code when it's profit only (e.g. compare v8f32_one_step and v8f32_one_step_2_divs, etc.)

In addition I removed the speedup related code from this patch because it was included into the special review: D29627

Herald added a subscriber: igorb. · View Herald TranscriptFeb 10 2017, 8:31 AM

ABataev resigned from this revision.Feb 13 2017, 11:42 AM

Hi All,
Do we except anything more here?
It seems I fixed all requirements. Maybe it's time for LGTM?

Guy Blank found a problem with PIC relocation model on SLM architecture. I fixed it and added the corresponding test.

avt77 mentioned this in rL296746: Added special test covering a problem with PIC relocation model on SLM….Mar 2 2017, 5:59 AM

I committed PIC-related test in trunk and updated this patch to be able to compare it with new code generation.

RKSimon added inline comments.Mar 9 2017, 11:27 AM

lib/CodeGen/MachineTraceMetrics.cpp
521 ↗	(On Diff #90329)	Is this still true after D29627?

avt77 added inline comments.Mar 10 2017, 12:09 AM

lib/CodeGen/MachineTraceMetrics.cpp

521 ↗

(On Diff #90329)

Yes, the profile shows that now (after D29627) this feature eats more time than any other

58,19%  llc      llc                   [.] llvm::MachineTraceMetrics::Ensemble::invalidate
 3,44%  llc      llc                   [.] (anonymous namespace)::TwoAddressInstructionPass::scanUses
 3,19%  llc      llc                   [.] llvm::ScheduleDAGSDNodes::ClusterNeighboringLoads
 1,59%  llc      llc                   [.] llvm::SparseMultiSet<llvm::VReg2SUnit, llvm::VirtReg2IndexFunctor, unsigned char>::find
 1,08%  llc      llc                   [.] llvm::X86InstrInfo::areLoadsFromSameBasePtr

RKSimon mentioned this in D38318: [X86][SSE] Match PSHUFLW/PSHUFHW + PSHUFD vXi16 shuffle patterns (PR34686).Oct 8 2017, 6:11 AM

@avt77 As we discussed offline, please can you strip out the debug changes and put them into a new patch?

avt77 mentioned this in D43813: [Machine Combiner] Valid use of OptSize.Feb 28 2018, 4:51 AM

RKSimon resigned from this revision.Feb 19 2019, 10:30 AM

Herald added a subscriber: jdoerfert. · View Herald TranscriptFeb 19 2019, 10:30 AM

Revision Contents

Path

Size

include/

llvm/

CodeGen/

MachineCombinerPattern.h

6 lines

lib/

Target/

X86/

X86InstrInfo.h

15 lines

X86InstrInfo.cpp

173 lines

Diff 78520

include/llvm/CodeGen/MachineCombinerPattern.h

Show First 20 Lines • Show All 65 Lines • ▼ Show 20 Lines	enum class MachineCombinerPattern {
FMLAv4i32_indexed_OP2,		FMLAv4i32_indexed_OP2,
FMLSv1i32_indexed_OP2,		FMLSv1i32_indexed_OP2,
FMLSv1i64_indexed_OP2,		FMLSv1i64_indexed_OP2,
FMLSv2i32_indexed_OP2,		FMLSv2i32_indexed_OP2,
FMLSv2i64_indexed_OP2,		FMLSv2i64_indexed_OP2,
FMLSv2f32_OP2,		FMLSv2f32_OP2,
FMLSv2f64_OP2,		FMLSv2f64_OP2,
FMLSv4i32_indexed_OP2,		FMLSv4i32_indexed_OP2,
FMLSv4f32_OP2		FMLSv4f32_OP2,

		// These are FDIV-RECIP patterns matched by X86 machine combiner
		DivSxRecip,
		VDivSxRecip
};		};

} // end namespace llvm		} // end namespace llvm

#endif		#endif

lib/Target/X86/X86InstrInfo.h

Show First 20 Lines • Show All 494 Lines • ▼ Show 20 Lines	bool hasHighOperandLatency(const TargetSchedModel &SchedModel,
const MachineInstr &DefMI, unsigned DefIdx,		const MachineInstr &DefMI, unsigned DefIdx,
const MachineInstr &UseMI,		const MachineInstr &UseMI,
unsigned UseIdx) const override;		unsigned UseIdx) const override;

bool useMachineCombiner() const override {		bool useMachineCombiner() const override {
return true;		return true;
}		}

		/// When getMachineCombinerPatterns() finds patterns, this function generates
		/// the instructions that could replace the original code sequence
		void genAlternativeCodeSequence(
		MachineInstr &Root, MachineCombinerPattern Pattern,
		SmallVectorImpl<MachineInstr *> &InsInstrs,
		SmallVectorImpl<MachineInstr *> &DelInstrs,
		DenseMap<unsigned, unsigned> &InstrIdxForVirtReg) const override;

		/// Return true when there is potentially a faster code sequence
		/// for an instruction chain ending in <Root>. All potential patterns are
		/// listed in the <Patterns> array.
		bool getMachineCombinerPatterns(
		MachineInstr &Root,
		SmallVectorImpl<MachineCombinerPattern> &Patterns) const override;

bool isAssociativeAndCommutative(const MachineInstr &Inst) const override;		bool isAssociativeAndCommutative(const MachineInstr &Inst) const override;

bool hasReassociableOperands(const MachineInstr &Inst,		bool hasReassociableOperands(const MachineInstr &Inst,
const MachineBasicBlock *MBB) const override;		const MachineBasicBlock *MBB) const override;

void setSpecialOperandAttr(MachineInstr &OldMI1, MachineInstr &OldMI2,		void setSpecialOperandAttr(MachineInstr &OldMI1, MachineInstr &OldMI2,
MachineInstr &NewMI1,		MachineInstr &NewMI1,
MachineInstr &NewMI2) const override;		MachineInstr &NewMI2) const override;
▲ Show 20 Lines • Show All 94 Lines • Show Last 20 Lines

lib/Target/X86/X86InstrInfo.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 8,639 Lines • ▼ Show 20 Lines	bool X86InstrInfo::isAssociativeAndCommutative(const MachineInstr &Inst) const {
case X86::VMULSDZrr:		case X86::VMULSDZrr:
case X86::VMULSSZrr:		case X86::VMULSSZrr:
return Inst.getParent()->getParent()->getTarget().Options.UnsafeFPMath;		return Inst.getParent()->getParent()->getTarget().Options.UnsafeFPMath;
default:		default:
return false;		return false;
}		}
}		}

		/// genFDivReciprocal - Generates reciprocal estimate instructions instead
		RKSimonUnsubmitted Not Done Reply Inline Actions Rename this to hasExactlyOneOperand ? AllOnes means 'all bits are set' RKSimon: Rename this to hasExactlyOneOperand ? AllOnes means 'all bits are set'
		GerolfUnsubmitted Not Done Reply Inline Actions Add comment what the function does before the function Gerolf: Add comment what the function does before the function
		/// of the given fdiv instruction
		///
		/// Example (-x86-asm-syntax=intel): instead of
		RKSimonUnsubmitted Not Done Reply Inline Actions Use a range iterator if you can, same for the other cases below RKSimon: Use a range iterator if you can, same for the other cases below
		///
		GerolfUnsubmitted Not Done Reply Inline Actions -> dividend Gerolf: -> dividend
		GerolfUnsubmitted Not Done Reply Inline Actions it's -> it is Gerolf: it's -> it is
		/// vmovss xmm1, dword ptr [rip + .LCPI0_0] # xmm1 = mem[0],zero,zero,zero
		ABataevUnsubmitted Not Done Reply Inline Actions I believe this can be replaced by a range-based loop ABataev: I believe this can be replaced by a range-based loop
		/// vdivss xmm0, xmm1, xmm0
		GerolfUnsubmitted Not Done Reply Inline Actions This is an expensive search. There must be a direct simpler way to get that info. Gerolf: This is an expensive search. There must be a direct simpler way to get that info.
		///
		/// we're generating
		///
		/// vmovss xmm1, dword ptr [rip + .LCPI0_0] # xmm1 = mem[0],zero,zero,zero
		ABataevUnsubmitted Not Done Reply Inline Actions auto &? ABataev: auto &?
		/// vrcpss xmm2, xmm0, xmm0
		/// vmulss xmm0, xmm0, xmm2
		/// vsubss xmm0, xmm1, xmm0
		RKSimonUnsubmitted Not Done Reply Inline Actions This should be C->isExactlyValue(1.0) RKSimon: This should be C->isExactlyValue(1.0)
		/// vmulss xmm0, xmm0, xmm2
		/// vaddss xmm0, xmm0, xmm2

		static MachineInstr *
		genFDivReciprocal(MachineFunction &MF, MachineRegisterInfo &MRI,
		GerolfUnsubmitted Not Done Reply Inline Actions Why this special case? just C = C->getSplatValue() would be easier to read. Gerolf: Why this special case? just C = C->getSplatValue() would be easier to read.
		const TargetInstrInfo *TII, MachineInstr &Root,
		SmallVectorImpl<MachineInstr *> &InsInstrs,
		DenseMap<unsigned, unsigned> &InstrIdxForVirtReg) {
		RKSimonUnsubmitted Not Done Reply Inline Actions This should be isDividendExactlyOne. We should be calling it Dividend not Divident as well. RKSimon: This should be isDividendExactlyOne. We should be calling it Dividend not Divident as well.

		unsigned ResultReg = Root.getOperand(0).getReg();
		unsigned SrcReg1 = Root.getOperand(1).getReg();
		bool Src1IsKill = Root.getOperand(1).isKill();
		unsigned SrcReg2 = Root.getOperand(2).getReg();
		bool Src2IsKill = Root.getOperand(2).isKill();

		const TargetRegisterInfo *TRI = MF.getSubtarget().getRegisterInfo();
		RKSimonUnsubmitted Not Done Reply Inline Actions Rename to isDividendEactlyOne and update the comment. RKSimon: Rename to isDividendEactlyOne and update the comment.
		const TargetRegisterClass *RC = Root.getRegClassConstraint(0, TII, TRI);
		ABataevUnsubmitted Not Done Reply Inline Actions Should these loops be range-based loops? ABataev: Should these loops be range-based loops?

		if (TargetRegisterInfo::isVirtualRegister(ResultReg))
		MRI.constrainRegClass(ResultReg, RC);
		if (TargetRegisterInfo::isVirtualRegister(SrcReg1))
		MRI.constrainRegClass(SrcReg1, RC);
		ABataevUnsubmitted Not Done Reply Inline Actions Range-based loop? ABataev: Range-based loop?
		if (TargetRegisterInfo::isVirtualRegister(SrcReg2))
		MRI.constrainRegClass(SrcReg2, RC);

		// vrcpss
		unsigned NewVR = MRI.createVirtualRegister(RC);
		RKSimonUnsubmitted Not Done Reply Inline Actions Embed hasAllOnesOperand here. RKSimon: Embed hasAllOnesOperand here.
		InstrIdxForVirtReg.insert(std::make_pair(NewVR, 0));
		ABataevUnsubmitted Not Done Reply Inline Actions No braces required ABataev: No braces required
		MachineInstrBuilder Rcp =
		BuildMI(MF, Root.getDebugLoc(), TII->get(llvm::X86::VRCPSSr), NewVR)
		.addReg(SrcReg2, getKillRegState(Src2IsKill))
		.addReg(SrcReg2, getKillRegState(Src2IsKill));
		// vmulss
		unsigned NewVR1 = MRI.createVirtualRegister(RC);
		InstrIdxForVirtReg.insert(std::make_pair(NewVR1, 0));
		MachineInstrBuilder Mul =
		BuildMI(MF, Root.getDebugLoc(), TII->get(llvm::X86::VMULSSrr), NewVR1)
		.addReg(SrcReg2, getKillRegState(Src2IsKill))
		.addReg(NewVR);
		// vsubss
		unsigned NewVR2 = MRI.createVirtualRegister(RC);
		InstrIdxForVirtReg.insert(std::make_pair(NewVR2, 0));
		MachineInstrBuilder Sub =
		BuildMI(MF, Root.getDebugLoc(), TII->get(llvm::X86::VSUBSSrr), NewVR2)
		GerolfUnsubmitted Not Done Reply Inline Actions For which types? All FP? Gerolf: For which types? All FP?
		.addReg(SrcReg1, getKillRegState(Src1IsKill))
		.addReg(NewVR1);
		// vmulss
		unsigned NewVR3 = MRI.createVirtualRegister(RC);
		InstrIdxForVirtReg.insert(std::make_pair(NewVR3, 0));
		GerolfUnsubmitted Not Done Reply Inline Actions Can you elaborate? eg. on #iterations? Gerolf: Can you elaborate? eg. on #iterations?
		MachineInstrBuilder Mul2 =
		BuildMI(MF, Root.getDebugLoc(), TII->get(llvm::X86::VMULSSrr), NewVR3)
		.addReg(NewVR2)
		.addReg(NewVR);
		// vaddss
		MachineInstrBuilder Add =
		BuildMI(MF, Root.getDebugLoc(), TII->get(llvm::X86::VADDSSrr), ResultReg)
		.addReg(NewVR3)
		.addReg(NewVR);
		InsInstrs.push_back(Rcp);
		GerolfUnsubmitted Not Done Reply Inline Actions input to this function are 7 parameters, the comment only lists 6. Gerolf: input to this function are 7 parameters, the comment only lists 6.
		avt77AuthorUnsubmitted Not Done Reply Inline Actions I did not understand this comment: what should I do here? avt77: I did not understand this comment: what should I do here?
		RKSimonUnsubmitted Not Done Reply Inline Actions I think it means that while ArrayRef<int> Instrs have 7 instructions listed, the codegen in the comment only shows 6 instructions RKSimon: I think it means that while ArrayRef<int> Instrs have 7 instructions listed, the codegen in the…
		avt77AuthorUnsubmitted Not Done Reply Inline Actions But in fact all 7 instructions are shown but from index 0 to index 6 (maybe in "strange" order: 0,2,1,3,4,5,6). If you'd like I could change the order and/or start numbering from 1. Gerolf, should I do it or we fixed everything? avt77: But in fact all 7 instructions are shown but from index 0 to index 6 (maybe in "strange" order…
		GerolfUnsubmitted Not Done Reply Inline Actions In the comments I only count 6 instructions from vmovss to vaddss, but the code handles 7. What am I missing? Gerolf: In the comments I only count 6 instructions from vmovss to vaddss, but the code handles 7. What…
		InsInstrs.push_back(Mul);
		InsInstrs.push_back(Sub);
		RKSimonUnsubmitted Not Done Reply Inline Actions Replace SmallVector<int, 7> with ArrayRef<int> RKSimon: Replace SmallVector<int, 7> with ArrayRef<int>
		InsInstrs.push_back(Mul2);
		InsInstrs.push_back(Add);
		return Rcp;
		}
		RKSimonUnsubmitted Not Done Reply Inline Actions This needs refactoring to support scalar and packed versions of SSE and AVX opcodes if possible. RKSimon: This needs refactoring to support scalar and packed versions of SSE and AVX opcodes if possible.

		/// When getMachineCombinerPatterns() finds potential patterns,
		/// this function generates the instructions that could replace the
		/// original code sequence
		void X86InstrInfo::genAlternativeCodeSequence(
		MachineInstr &Root, MachineCombinerPattern Pattern,
		SmallVectorImpl<MachineInstr *> &InsInstrs,
		SmallVectorImpl<MachineInstr *> &DelInstrs,
		DenseMap<unsigned, unsigned> &InstrIdxForVirtReg) const {
		MachineBasicBlock &MBB = *Root.getParent();
		MachineRegisterInfo &MRI = MBB.getParent()->getRegInfo();
		MachineFunction &MF = *MBB.getParent();
		const TargetInstrInfo *TII = MF.getSubtarget().getInstrInfo();

		switch (Pattern) {
		default:
		// Reassociate instructions.
		TargetInstrInfo::genAlternativeCodeSequence(Root, Pattern, InsInstrs,
		DelInstrs, InstrIdxForVirtReg);
		return;
		case MachineCombinerPattern::DivSxRecip:
		DEBUG(dbgs() << "The pattern found for DivSxRecip\n");
		RKSimonUnsubmitted Not Done Reply Inline Actions Dividend is only used once, merge these? unsigned DividendReg = Root.getOperand(1).getReg(); RKSimon: Dividend is only used once, merge these? ``` unsigned DividendReg = Root.getOperand(1).getReg…
		break;
		RKSimonUnsubmitted Not Done Reply Inline Actions AllOnes -> ExactlyOne RKSimon: AllOnes -> ExactlyOne
		case MachineCombinerPattern::VDivSxRecip:
		DEBUG(dbgs() << "The pattern found for VDivSxRecip\n");
		genFDivReciprocal(MF, MRI, TII, Root, InsInstrs, InstrIdxForVirtReg);
		DelInstrs.push_back(&Root); // Record FDiv for deletion
		// MUL = genFusedMultiply(MF, MRI, TII, Root, InsInstrs, 1, Opc, RC);
		break;
		}
		}
		GerolfUnsubmitted Not Done Reply Inline Actions Why? Gerolf: Why?
		GerolfUnsubmitted Not Done Reply Inline Actions I keep stumbling over this comment and every time i read it: did you mean to say something like //Execute at least one iteration. Iternations = Max(1, Iterations); Gerolf: I keep stumbling over this comment and every time i read it: did you mean to say something…

		GerolfUnsubmitted Not Done Reply Inline Actions wey -> why? Gerolf: wey -> why?
		/// Find instructions that can be turned into recip
		static bool getFDIVPatterns(MachineInstr &Root,
		GerolfUnsubmitted Not Done Reply Inline Actions Where is that input sequence documented? Gerolf: Where is that input sequence documented?
		SmallVectorImpl<MachineCombinerPattern> &Patterns) {
		// TODO: we should use function level of options see getRecipEstimateForFunc()
		if (!Root.getParent()->getParent()->getTarget().Options.UnsafeFPMath)
		ABataevUnsubmitted Not Done Reply Inline Actions reformat this ABataev: reformat this
		return false;
		unsigned Opc = Root.getOpcode();
		// MachineBasicBlock &MBB = *Root.getParent();
		bool Found = false;

		switch (Opc) {
		default:
		break;
		// TODO: at the moment we support FP scalar types only
		case X86::DIVSSrr: // f32
		case X86::DIVSDrr: // f64
		RKSimonUnsubmitted Not Done Reply Inline Actions We don't do reciprocal estimates for double types - these all need removing. RKSimon: We don't do reciprocal estimates for double types - these all need removing.
		assert(Root.getOperand(1).isReg() && Root.getOperand(2).isReg() &&
		"DIVSxrr does not have register operands");
		Patterns.push_back(MachineCombinerPattern::DivSxRecip);
		Found = true;
		break;
		// case X86::DIVPSrr: // v4f32
		// case X86::DIVPDrr: // v2f64
		// case X86::VDIVPSrr: // f4f32
		// case X86::VDIVPDrr: // v2f64
		case X86::VDIVSSrr: // f32
		case X86::VDIVSDrr: // f64
		assert(Root.getOperand(1).isReg() && Root.getOperand(2).isReg() &&
		Root.getOperand(2).isReg() &&
		"VDIVSxrrr does not have register operands");
		Patterns.push_back(MachineCombinerPattern::VDivSxRecip);
		Found = true;
		break;
		// case X86::VDIVPSYrr: // v8f32
		// case X86::VDIVPDYrr: // v4f64
		// assert(Root.getOperand(1).isReg() && Root.getOperand(2).isReg() &&
		// "FDIV_Yrr does not have register operands");
		// Patterns.push_back(MachineCombinerPattern::FDivRecip2);
		// Found = true;
		// break;
		}
		return Found;
		}

		/// Return true when there is potentially a faster code sequence for an
		/// instruction chain ending in \p Root. All potential patterns are listed in
		/// the \p Pattern vector. Pattern should be sorted in priority order since the
		/// pattern evaluator stops checking as soon as it finds a faster sequence.

		bool X86InstrInfo::getMachineCombinerPatterns(
		MachineInstr &Root,
		SmallVectorImpl<MachineCombinerPattern> &Patterns) const {
		// FDIV patterns
		if (getFDIVPatterns(Root, Patterns))
		return true;
		// FSQRT patterns
		RKSimonUnsubmitted Not Done Reply Inline Actions Can we replace TII->get(Instrs[2]) with MIDesc? RKSimon: Can we replace TII->get(Instrs[2]) with MIDesc?
		// if (getFSQRTPatterns(Root, Patterns))
		// return true;
		RKSimonUnsubmitted Not Done Reply Inline Actions Replace this with a basic TODO comment for future FSQRT pattern support - don't comment out code (especially when it doesn't exist). RKSimon: Replace this with a basic TODO comment for future FSQRT pattern support - don't comment out…

		return TargetInstrInfo::getMachineCombinerPatterns(Root, Patterns);
		}

/// This is an architecture-specific helper function of reassociateOps.		/// This is an architecture-specific helper function of reassociateOps.
		RKSimonUnsubmitted Not Done Reply Inline Actions Can we replace TII->get(Instrs[2]) with MIDesc? RKSimon: Can we replace TII->get(Instrs[2]) with MIDesc?
/// Set special operand attributes for new instructions after reassociation.		/// Set special operand attributes for new instructions after reassociation.
void X86InstrInfo::setSpecialOperandAttr(MachineInstr &OldMI1,		void X86InstrInfo::setSpecialOperandAttr(MachineInstr &OldMI1,
MachineInstr &OldMI2,		MachineInstr &OldMI2,
MachineInstr &NewMI1,		MachineInstr &NewMI1,
MachineInstr &NewMI2) const {		MachineInstr &NewMI2) const {
// Integer instructions define an implicit EFLAGS source register operand as		// Integer instructions define an implicit EFLAGS source register operand as
// the third source (fourth total) operand.		// the third source (fourth total) operand.
if (OldMI1.getNumOperands() != 4 \|\| OldMI2.getNumOperands() != 4)		if (OldMI1.getNumOperands() != 4 \|\| OldMI2.getNumOperands() != 4)
return;		return;

assert(NewMI1.getNumOperands() == 4 && NewMI2.getNumOperands() == 4 &&		assert(NewMI1.getNumOperands() == 4 && NewMI2.getNumOperands() == 4 &&
		GerolfUnsubmitted Not Done Reply Inline Actions Should there be a 2:? Gerolf: Should there be a 2:?
"Unexpected instruction type for reassociation");		"Unexpected instruction type for reassociation");

MachineOperand &OldOp1 = OldMI1.getOperand(3);		MachineOperand &OldOp1 = OldMI1.getOperand(3);
MachineOperand &OldOp2 = OldMI2.getOperand(3);		MachineOperand &OldOp2 = OldMI2.getOperand(3);
		RKSimonUnsubmitted Not Done Reply Inline Actions This isn't necessary if DividentIsAllOnes == true RKSimon: This isn't necessary if DividentIsAllOnes == true
MachineOperand &NewOp1 = NewMI1.getOperand(3);		MachineOperand &NewOp1 = NewMI1.getOperand(3);
MachineOperand &NewOp2 = NewMI2.getOperand(3);		MachineOperand &NewOp2 = NewMI2.getOperand(3);

assert(OldOp1.isReg() && OldOp1.getReg() == X86::EFLAGS && OldOp1.isDead() &&		assert(OldOp1.isReg() && OldOp1.getReg() == X86::EFLAGS && OldOp1.isDead() &&
"Must have dead EFLAGS operand in reassociable instruction");		"Must have dead EFLAGS operand in reassociable instruction");
assert(OldOp2.isReg() && OldOp2.getReg() == X86::EFLAGS && OldOp2.isDead() &&		assert(OldOp2.isReg() && OldOp2.getReg() == X86::EFLAGS && OldOp2.isDead() &&
"Must have dead EFLAGS operand in reassociable instruction");		"Must have dead EFLAGS operand in reassociable instruction");

▲ Show 20 Lines • Show All 44 Lines • ▼ Show 20 Lines	static const std::pair<unsigned, const char *> TargetFlags[] = {
{MO_TLVP_PIC_BASE, "x86-tlvp-pic-base"},		{MO_TLVP_PIC_BASE, "x86-tlvp-pic-base"},
{MO_SECREL, "x86-secrel"}};		{MO_SECREL, "x86-secrel"}};
return makeArrayRef(TargetFlags);		return makeArrayRef(TargetFlags);
}		}

bool X86InstrInfo::isTailCall(const MachineInstr &Inst) const {		bool X86InstrInfo::isTailCall(const MachineInstr &Inst) const {
switch (Inst.getOpcode()) {		switch (Inst.getOpcode()) {
case X86::TCRETURNdi:		case X86::TCRETURNdi:
case X86::TCRETURNmi:		case X86::TCRETURNmi:
		guyblankUnsubmitted Not Done Reply Inline Actions currently in Haswell and newer cpus, the generated sequence is using fma instructions. this should be taken into account here in the patterns, right? guyblank: currently in Haswell and newer cpus, the generated sequence is using fma instructions. this…
		RKSimonUnsubmitted Not Done Reply Inline Actions Yes this should be taken into account, but it does mean that the number of codegen patterns is going to start to balloon if we're not careful. RKSimon: Yes this should be taken into account, but it does mean that the number of codegen patterns is…
case X86::TCRETURNri:		case X86::TCRETURNri:
case X86::TCRETURNdi64:		case X86::TCRETURNdi64:
case X86::TCRETURNmi64:		case X86::TCRETURNmi64:
case X86::TCRETURNri64:		case X86::TCRETURNri64:
case X86::TAILJMPd:		case X86::TAILJMPd:
case X86::TAILJMPm:		case X86::TAILJMPm:
case X86::TAILJMPr:		case X86::TAILJMPr:
case X86::TAILJMPd64:		case X86::TAILJMPd64:
Show All 19 Lines	bool runOnMachineFunction(MachineFunction &MF) override {
static_cast<const X86TargetMachine *>(&MF.getTarget());		static_cast<const X86TargetMachine *>(&MF.getTarget());
const X86Subtarget &STI = MF.getSubtarget<X86Subtarget>();		const X86Subtarget &STI = MF.getSubtarget<X86Subtarget>();

// Don't do anything if this is 64-bit as 64-bit PIC		// Don't do anything if this is 64-bit as 64-bit PIC
// uses RIP relative addressing.		// uses RIP relative addressing.
if (STI.is64Bit())		if (STI.is64Bit())
return false;		return false;

// Only emit a global base reg in PIC mode.		// Only emit a global base reg in PIC mode.
		RKSimonUnsubmitted Not Done Reply Inline Actions Inconsistent braces to the other cases. RKSimon: Inconsistent braces to the other cases.
if (!TM->isPositionIndependent())		if (!TM->isPositionIndependent())
return false;		return false;

X86MachineFunctionInfo *X86FI = MF.getInfo<X86MachineFunctionInfo>();		X86MachineFunctionInfo *X86FI = MF.getInfo<X86MachineFunctionInfo>();
unsigned GlobalBaseReg = X86FI->getGlobalBaseReg();		unsigned GlobalBaseReg = X86FI->getGlobalBaseReg();

// If we didn't need a GlobalBaseReg, don't insert code.		// If we didn't need a GlobalBaseReg, don't insert code.
if (GlobalBaseReg == 0)		if (GlobalBaseReg == 0)
▲ Show 20 Lines • Show All 160 Lines • Show Last 20 Lines