Page MenuHomePhabricator

New unsafe-fp-math implementation for X86 target
Needs ReviewPublic

Authored by avt77 on Nov 18 2016, 7:24 AM.

Details

Summary

The current fast-math implementation is based on DAGCombiner. One of disadvantages of that is ignoring of any cost model (throughput and/or length). Another disadvantage is implementation of target specific optimization at improper level of transformations. The introduced version moves the implementation lower into MachineCombiner. As result we're using the existing mechanism of transformations like getMachineCombinerPatterns and getAlternativeCodeSequence. In addition both throughput and length cost models are being used automatically.

This patch is only initial step to demonstrate the intention. The patch implements only one type of transformation: the reciprocal estimated code instead of vdivss instruction. I'm going to support other types of reciprocal optimizations as well. But I'd like to get first comments on this my job. The code is safely compiled and produces the working output.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Gerolf added inline comments.Dec 21 2016, 2:36 PM
include/llvm/CodeGen/MachineInstr.h
1154 ↗(On Diff #81735)

MBB instead of P? Is this for debugging?

lib/CodeGen/MachineCombiner.cpp
160

Please add a comment that explains why default is used.

+ assert(DefIdx || UseIdx);

lib/CodeGen/MachineInstr.cpp
1712 ↗(On Diff #81735)

you can use the new parameter and then if (!MBB) MBB=getParent()

So what do you suggest? Can we get this in before refactoring the MC alternative code pattern system or will that cause too much additional complexity?

Whatever, it sounds like we'd be better off doing this all after the 4.0 branch, do you agree?

avt77 added a comment.Dec 22 2016, 4:38 AM

What do you mean when you speak about "automation"? Do you mean a possibility to describe alternative sequences with tools like TableGen? If yes I'm afraid it'll require some real time to implement. That's why from my point of view the hand-made patterns similar to the given one could be really useful in future. But of course it'd be really interesting to launch such a project. Right?

avt77 updated this revision to Diff 82342.Dec 22 2016, 8:12 AM
avt77 edited edge metadata.
  1. The current trunk has already changes in Machine::print, etc. similar in my initial patch. Because of that I removed all corresponding changes and did not answer on all corresponding comments.
  2. It seems I fixed all other requirements from Gerolf
  3. But the main question is the same: should we continue with the effort?
  1. But the main question is the same: should we continue with the effort?

Absolutely, but I'm starting to think it make sense to wait until after the 4.0 branch, then get this in as we need an x86 implementation for reference and then begin an investigation into how we want the MachineCombiner to develop.

The "automatic" generation of pattern e.g. with TableGen is on my longer term wish list, not a requirement for this patch. Sorry if my wording was confusing.
Do you have performance numbers?

lib/Target/X86/X86InstrInfo.cpp
10109

-> dividend

10123

Why this special case? just C = C->getSplatValue() would be easier to read.

10215

wey -> why?

10289

Should there be a 2:?

avt77 added a comment.Dec 23 2016, 8:34 AM

Yes, I've just got the numbers. I created 2 versions of clang compiler: directly from trunk and with my patch applied. Then with help of these compilers I created 2 new compilers with the following configuration:

cmake -G "Ninja" -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=<trun/patch compiler home>/build/bin/clang -DCMAKE_CXX_COMPILER=<trunk/patch compiler home>/build/bin/clang++ -DCMAKE_C_FLAGS="-O3 -ffast-math" -DCMAKE_CXX_FLAGS="-O3 -ffast-math" ../llvm

Below you can see the times (I did 2 experiments for every compiler):


Compiler with patch


real 32m10.783s
user 125m19.424s
sys 3m8.456s

real 31m20.432s
user 122m2.012s
sys 3m4.444s


Trunk based compiler


real 31m46.001s
user 123m39.192s
sys 3m10.180s

real 40m6.791s
user 156m5.472s
sys 3m36.476s

Of course it's very rough estimations because I used our server and there a lot of things around. But general picture is clear from my point of view: my patch does not increase the compilation time.
Is it enough or I should do other experiments?

davide added a subscriber: davide.Dec 23 2016, 8:50 AM

Yes, I've just got the numbers. I created 2 versions of clang compiler: directly from trunk and with my patch applied. Then with help of these compilers I created 2 new compilers with the following configuration:

cmake -G "Ninja" -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=<trun/patch compiler home>/build/bin/clang -DCMAKE_CXX_COMPILER=<trunk/patch compiler home>/build/bin/clang++ -DCMAKE_C_FLAGS="-O3 -ffast-math" -DCMAKE_CXX_FLAGS="-O3 -ffast-math" ../llvm

Below you can see the times (I did 2 experiments for every compiler):


Compiler with patch


real 32m10.783s
user 125m19.424s
sys 3m8.456s

====================================================================================================

real 31m20.432s
user 122m2.012s
sys 3m4.444s


Trunk based compiler


real 31m46.001s
user 123m39.192s
sys 3m10.180s

real 40m6.791s
user 156m5.472s
sys 3m36.476s

Of course it's very rough estimations because I used our server and there a lot of things around. But general picture is clear from my point of view: my patch does not increase the compilation time.
Is it enough or I should do other experiments?

I'll let Simon decide but these numbers are iffy. I can't necessarily conclude your patch increases compile time but I can't conclude anything else either. In particular, the stock clang measurement have a variance of 20% between consecutive runs, so I have very little faith in the numbers collected.
Rafael recent('ish)ly published a set of suggestions/knob to turn on to get relatively stable numbers on a Linux machine. I'm also pretty sure the topic of how to get {reliable numbers, numbers you can have faith in} has been discussed multiple times (look at the archives, Sean has generally pretty informative posts/insights on the topic).

avt77 added a comment.Dec 23 2016, 8:53 AM

Two more comments:

  1. I did not update the patch accordingly to the latest Gerolf comments: I'll do it asap
  2. Gerolf asked: "Perhaps I missed it but I expected the optimization to kick in only under fast math. I saw 'fast' in the test cases, but didn't see a check in the code."

    I check fast-math in "static bool getFDIVPatterns"

    switch (TLI->getRecipEstimateDivEnabled(VT, *MF)) { this line checks per-function option case TLI->ReciprocalEstimate::Disabled: return false; case TLI->ReciprocalEstimate::Unspecified: if (Root.getParent()->getParent()->getTarget().Options.UnsafeFPMath) and here I check command line option if there is no per-function code
avt77 added a comment.Dec 23 2016, 9:01 AM

I'll let Simon decide but these numbers are iffy. I can't necessarily conclude your patch increases compile time but I can't conclude anything else either. In particular, the stock clang measurement have a variance of 20% between consecutive runs, so I have very little faith in the numbers collected.
Rafael recent('ish)ly published a set of suggestions/knob to turn on to get relatively stable numbers on a Linux machine. I'm also pretty sure the topic of how to get {reliable numbers, numbers you can have faith in} has been discussed multiple times (look at the archives, Sean has generally pretty informative posts/insights on the topic).

Could you give me a reference to the info? Of course I'll try to find it myself but with your help it could be faster.

I'll let Simon decide but these numbers are iffy. I can't necessarily conclude your patch increases compile time but I can't conclude anything else either. In particular, the stock clang measurement have a variance of 20% between consecutive runs, so I have very little faith in the numbers collected.
Rafael recent('ish)ly published a set of suggestions/knob to turn on to get relatively stable numbers on a Linux machine. I'm also pretty sure the topic of how to get {reliable numbers, numbers you can have faith in} has been discussed multiple times (look at the archives, Sean has generally pretty informative posts/insights on the topic).

Could you give me a reference to the info? Of course I'll try to find it myself but with your help it could be faster.

http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20161017/398831.html

avt77 added a comment.Dec 24 2016, 6:35 AM

I made new experiments but now I use a dedicated computer for it:

atischenko@ip-172-31-21-62:~/workspaces$ cat time.log
cmake -G "Ninja" -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=/home/atischenko/workspaces/step3/build/bin/clang -DCMAKE_CXX_COMPILER=/home/atischenko/workspaces/step3/build/bin/clang++ -DCMAKE_C_FLAGS="-O3 -ffast-math" -DCMAKE_CXX_FLAGS="-O3 -ffast-math" ../llvm
time ninja -j4


Compiler with patch


real 32m10.783s
user 125m19.424s
sys 3m8.456s

real 31m20.432s
user 122m2.012s
sys 3m4.444s


On dedicated computer

real 31m23.564s
user 122m13.120s
sys 3m5.340s

real 31m28.115s
user 122m30.596s
sys 3m6.588s

real 31m22.861s
user 122m8.920s
sys 3m7.236s


Trunk based compiler


cmake -G "Ninja" -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=/home/atischenko/workspaces/bootstrap-trunk/build/bin/clang -DCMAKE_CXX_COMPILER=/home/atischenko/workspaces/bootstrap-trunk/build/bin/clang++ -DCMAKE_C_FLAGS="-O3 -ffast-math" -DCMAKE_CXX_FLAGS="-O3 -ffast-math" ../llvm
time ninja -j4

real 31m46.001s
user 123m39.192s
sys 3m10.180s

real 40m6.791s
user 156m5.472s
sys 3m36.476s


On dedicated computer

real 31m24.634s
user 122m14.912s
sys 3m8.080s

real 31m22.833s
user 122m9.836s
sys 3m5.676s

real 31m22.795s
user 122m5.924s
sys 3m8.588s

Hope now we can trust my results.

avt77 updated this revision to Diff 82605.Dec 28 2016, 9:35 AM

I fixed the last issues raised by Gerolf except one related to special case of "if" because the suggested change breaks the current logic.

Gerolf added a comment.Jan 2 2017, 1:49 PM

From my perspective the implementation is close and only requires a few minor changes.

The compile-time numbers I've seen so far are meaningless (wide variation, unclear if/how many times your code actually fires etc), but I'm not too concerned about O3 fast-math ct and would give it benefit of the doubt.

I did ask the question about performance benefits twice to no avail and admit I'm still curious. I assume to get these numbers you do set your machines into perf mode rather than using some servers running some random load.

Thanks
Gerolf

lib/Target/X86/X86InstrInfo.cpp
10105

Add comment what the function does before the function

10109

it's -> it is

10177

input to this function are 7 parameters, the comment only lists 6.

10214

I keep stumbling over this comment and every time i read it: did you mean to say something like
//Execute at least one iteration.
Iternations = Max(1, Iterations);

10217

Where is that input sequence documented?

avt77 updated this revision to Diff 84535.Jan 16 2017, 3:18 AM

I fixed everything except one comment (see below). And I collected new perf numbers. Now I used the following command for bootstrap building:

time make -j 1

As result the reproducing is very well from my point of view. In addition I tried to get numbers accordingly to description in http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20161017/398831.html . The reproducing is almost the same but the required time is even longer (about 2 hours for every test). Because of that I kept my numbers. The test itself is very simple: I created 2 versions of compiler: the first one was built directly from trunk and the second one was built after applying of my patch. Then with help of every compiler I created 4 bootstraps. The results are below:

The trunk version of the compiler builds the bootstrap like here (there were 4 starts with "time make -j 1"):

real  91m7.998s
real  90m23.861s
real  90m26.154s
real  90m31.533s

The version of the compiler with my patch builds the bootstrap like here (there were the same 4 starts with "time make -j 1"):

real  90m43.970s
real 90m7.257s
real  90m6.671s
real  90m11.733s

Obviously, the compilation time does not depend on my patch.

avt77 added inline comments.Jan 16 2017, 3:22 AM
lib/Target/X86/X86InstrInfo.cpp
10177

I did not understand this comment: what should I do here?

RKSimon added inline comments.Jan 16 2017, 1:41 PM
lib/Target/X86/X86InstrInfo.cpp
10177

I think it means that while ArrayRef<int> Instrs have 7 instructions listed, the codegen in the comment only shows 6 instructions

avt77 added inline comments.Jan 18 2017, 1:03 AM
lib/Target/X86/X86InstrInfo.cpp
10177

But in fact all 7 instructions are shown but from index 0 to index 6 (maybe in "strange" order: 0,2,1,3,4,5,6). If you'd like I could change the order and/or start numbering from 1. Gerolf, should I do it or we fixed everything?

chandlerc added a subscriber: chandlerc.

I fixed everything except one comment (see below). And I collected new perf numbers. Now I used the following command for bootstrap building:

time make -j 1

As result the reproducing is very well from my point of view. In addition I tried to get numbers accordingly to description in http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20161017/398831.html . The reproducing is almost the same but the required time is even longer (about 2 hours for every test). Because of that I kept my numbers. The test itself is very simple: I created 2 versions of compiler: the first one was built directly from trunk and the second one was built after applying of my patch. Then with help of every compiler I created 4 bootstraps. The results are below:

So, I'm mostly lurking, but I want to point out a serious issue here: Clang and LLVM have as little floating point as we could manage in them. So I would expect them to be quite uninteresting in testing the compile time impact of a patch that is only concerned with floating point code....

And there still are no numbers around the improvement here...

I suspect you'll need to provide benchmark data from at least SPEC and/or the LLVM test suite that shows this is an improvement and compile time numbers to show that the improvement doesn't cost too much... At least, that would be my expectation. Those at least do include some floating point code. You might also try running benchmarks from the Eigen project which has a very large amount of floating point code. However, they usually don't build with any unsafe math flags, so correctness issues may dominate.

It'd also be great to hear from others invested in LLVM's FP lowering like Hal, Steve, etc...

avt77 added a comment.Jan 18 2017, 9:26 AM

What is "Eigen project"? Could you point me to it?

I'm leaning towards a LGTM since you addressed basically all my issues, but more people mushroomed and are curious about your performance data. So I think can't dodge that question anymore and need to share some good data for the benchmark(s) you are tuning for before you get the nod.

Cheers
Gerolf

lib/Target/X86/X86InstrInfo.cpp
10177

In the comments I only count 6 instructions from vmovss to vaddss, but the code handles 7. What am I missing?

Unless the change is trivially and obviously an improvement on inspection of the result, I think you need data before making it. =] I've not looked at this particular change in enough detail to say whether it satisfies that. Maybe Gerolf or Craig could.

If there are compile time concerns (which actually came up) and they deal with floating point and unsafe floating point in particular, then I don't think a bootstrap is a useful way to measure anything one way or the other (there just isn't enough floating point).

What is "Eigen project"? Could you point me to it?

http://eigen.tuxfamily.org/

Unless the change is trivially and obviously an improvement on inspection of the result, I think you need data before making it. =] I've not looked at this particular change in enough detail to say whether it satisfies that. Maybe Gerolf or Craig could.

Another way besides benchmark data to show runtime improvements are tools like IACA. This is how we curated most of the vector shuffle improvements over the last few years for example.

avt77 added a comment.Jan 19 2017, 3:04 AM

Chandler, in fact this patch should not show any improvement in generating code. If you look in changes made in tests you'll see that the newly generated code is almost identical to the previous one (only some names, order of instructions, etc.). The idea of the patch is moving of such kind of optimization from the rather high level (DAGCombiner) to the really low level (MachineCombiner), Here we see real target machine instructions and as result we can use real cost model to estimate the real cost of possible transformation (in the given case the transformation is the replacement of one instruction (div) with some sequence of instructions). The transformation itself already exists inside Clang but the patch suggests to implement it in another place and that's it. If we agree with this new place of implementation then it will be the base for future possible similar optimizations like rsqrt, etc. And in addition this (and follow up) patch(es) will allow us to remove 'fake' subtarget features like FeatureFastScalarFSQRT / FeatureFastVectorFSQRT, etc. The question from Gerolf was not about the quality of the generated code (it's the same like we have now) but about the compilation time only.

Of course I'll try to collect the required perf numbers but they should be the same like we have now. Do we really need it?

Hi All,
I found a really "stress" test for div operations (see the attachment)

(tnx to Sanjay Patel). The test shows maybe the worst case of the possible degradation because of this patch. I used the following command with 2 different compilers:

cmake -E time ./llc spill_div.ll -o /dev/null -enable-unsafe-fp-math

For "pure" trunk compiler I got: Elapsed time 2 s.
For compiler with patch I got: Elapsed time 18 s.

(I launched the test several times with the same results.)
What's now? Is it acceptable? Should I try to optimize the patch? Should I try other benchmarks?
(I tried both LNT and Eigen but unfortunately they don't work for me at the moment because of unpredictable runtime issues.)

Hi All,
I found a really "stress" test for div operations (see the attachment)

(tnx to Sanjay Patel). The test shows maybe the worst case of the possible degradation because of this patch. I used the following command with 2 different compilers:

.......

For "pure" trunk compiler I got: Elapsed time 2 s.
For compiler with patch I got: Elapsed time 18 s.

Do you have any profiling info on where the time is going please? @Gerolf might then be able to advise whether we can improve the MCCombiner mechanism before/after this patch has gone in.

Hi All,
I found a really "stress" test for div operations (see the attachment)

(tnx to Sanjay Patel). The test shows maybe the worst case of the possible degradation because of this patch. I used the following command with 2 different compilers:

.......

For "pure" trunk compiler I got: Elapsed time 2 s.
For compiler with patch I got: Elapsed time 18 s.

Do you have any profiling info on where the time is going please? @Gerolf might then be able to advise whether we can improve the MCCombiner mechanism before/after this patch has gone in.

I'll jump in here because I supplied this (hopefully degenerate and worst) case based on my earlier reassociation transforms for MachineCombiner (see D10321 where I first mentioned the potential compile-time problem). When I looked into that, the time was all going into MachineTraceMetrics::Ensemble::computeInstrDepths() and MachineTraceMetrics::Ensemble::
computeInstrHeights(). Those got very expensive for long sequences of instructions. Possible fixes would be to improve how those are computed, cache those results, and/or eliminate how often we ask for those values.

We were ok with some additional potential compile-time cost because reassociation opportunities do not appear to be that common and were limited to -ffast-math compiles . We can make similar arguments for the recip transforms in this patch?

But it is worth noting that since the time of D10321, the number of reassociation candidate opcodes that x86 matches has grown to ~200 (X86InstrInfo::isAssociativeAndCommutative()) and includes integer ops. We're probably overdue on measuring and improving the perf of MachineCombiner.

I just re-read some of the comments in this review, and back on Nov 30 I explained the possible compile-time hit. I requested a note to llvm-dev at that time. Based on the confusion in the subsequent comments (people think this patch will affect execution perf), I am making the suggestion to post to llvm-dev again. Moving from DAG heuristics to MI-scheduler-based transforms is a change in strategy for all targets and explaining that to a wider audience is an opportunity to get good feedback.

hfinkel edited edge metadata.Jan 26 2017, 8:29 AM

Hi All,
I found a really "stress" test for div operations (see the attachment)

(tnx to Sanjay Patel). The test shows maybe the worst case of the possible degradation because of this patch. I used the following command with 2 different compilers:

.......

For "pure" trunk compiler I got: Elapsed time 2 s.
For compiler with patch I got: Elapsed time 18 s.

Do you have any profiling info on where the time is going please? @Gerolf might then be able to advise whether we can improve the MCCombiner mechanism before/after this patch has gone in.

I'll jump in here because I supplied this (hopefully degenerate and worst) case based on my earlier reassociation transforms for MachineCombiner (see D10321 where I first mentioned the potential compile-time problem). When I looked into that, the time was all going into MachineTraceMetrics::Ensemble::computeInstrDepths() and MachineTraceMetrics::Ensemble::
computeInstrHeights(). Those got very expensive for long sequences of instructions. Possible fixes would be to improve how those are computed, cache those results, and/or eliminate how often we ask for those values.

Certainly seems worthwhile exploring whether those can be cached (if I understand what they're doing, we do essentially cache very-similar values during MI scheduling). This worst-case hit is definitely undesirable, and we can certainly run into lots of machine-generated straight-line code, so hitting these kinds of cases in the wild is not unthinkable.

We were ok with some additional potential compile-time cost because reassociation opportunities do not appear to be that common and were limited to -ffast-math compiles . We can make similar arguments for the recip transforms in this patch?

But it is worth noting that since the time of D10321, the number of reassociation candidate opcodes that x86 matches has grown to ~200 (X86InstrInfo::isAssociativeAndCommutative()) and includes integer ops. We're probably overdue on measuring and improving the perf of MachineCombiner.

I think the only issue that needs to be addressed is (finally!) sharing perf data. This has been raised at least 3 times. The possible compile-time implication, the speciality of the application (fast-math) etc are well understood.

Gerolf

I think the only issue that needs to be addressed is (finally!) sharing perf data. This has been raised at least 3 times. The possible compile-time implication, the speciality of the application (fast-math) etc are well understood.

Gerolf

As I understand it the idea is that by moving this to the MC, then these alternative patterns will only be used if (1) the fast-math code permits it and (2) that the target cpu's scheduler model indicates that its quicker? So what you are asking is that we time the two versions of the code on specific cpus to check if in each case the correct decision is made?

This probably means that the tests should be updated to check against a couple of specific target cpus as well - we're limited by what x86 schedulers we have as but I know Jaguar (btver2) should use the rcpps version in all cases and expect Haswell should use divps.

A quick look at the SandyBridge scheduler model suggests its latency for FDIV is too low (especially ymm as it only has a 128-bit div alu) so that will select divps when it probably shouldn't....

avt77 added a comment.Jan 27 2017, 1:07 AM

I think the only issue that needs to be addressed is (finally!) sharing perf data. This has been raised at least 3 times. The possible compile-time implication, the speciality of the application (fast-math) etc are well understood.

Gerolf

Sorry but I don't understand what means sharing in this case? I put all perf numbers here in comments. Is not it enough for sharing? If not where should I share it? Or maybe my perf numbers are not perf numbers from your point of view? Please, clarify.

And next question about profiling data. Should I collect it? I've already started the process but now I'm not sure if it's interesting for somebody.

avt77 added a comment.Jan 27 2017, 3:45 AM

I got the first profiling data. In fact it's the same that was described by Sanjay:

Samples: 1M of event 'cycles:pp', Event count (approx.): 1180464
Overhead Command Source Shared Object Source Symbol

15,65%  llc      llc                   [.] llvm::MachineTraceMetrics::Ensemble::computeInstrDepths
15,18%  llc      llc                   [.] getDataDeps
 9,23%  llc      llc                   [.] llvm::MachineTraceMetrics::Ensemble::computeInstrHeights
 8,29%  llc      llc                   [.] pushDepHeight
 8,15%  llc      llc                   [.] llvm::MachineTraceMetrics::Ensemble::invhalidate
 5,64%  llc      llc                   [.] llvm::TargetInstrInfo::defaultDefLatency
 4,89%  llc      llc                   [.] llvm::MachineTraceMetrics::getResources
 2,44%  llc      llc                   [.] llvm::X86InstrInfo::isHighLatencyDef

Should I try to cash the metrics or it's a question of a special patch?

avt77 updated this revision to Diff 86042.Jan 27 2017, 4:39 AM

I updated recip-fastmath2.ll test accordingly to Simon recommendations. Now it includes special checks for different CPUs: SandyBridge, Haswell and btver2. These new checks demonstrate that alternative sequence of instructions is being selected when it's really cheaper than the single fdiv instruction. (Obviously we should change cost numbers for SandyBridge because they are too small.)

I got the first profiling data. In fact it's the same that was described by Sanjay:

Samples: 1M of event 'cycles:pp', Event count (approx.): 1180464
Overhead Command Source Shared Object Source Symbol

15,65%  llc      llc                   [.] llvm::MachineTraceMetrics::Ensemble::computeInstrDepths
15,18%  llc      llc                   [.] getDataDeps
 9,23%  llc      llc                   [.] llvm::MachineTraceMetrics::Ensemble::computeInstrHeights
 8,29%  llc      llc                   [.] pushDepHeight
 8,15%  llc      llc                   [.] llvm::MachineTraceMetrics::Ensemble::invhalidate
 5,64%  llc      llc                   [.] llvm::TargetInstrInfo::defaultDefLatency
 4,89%  llc      llc                   [.] llvm::MachineTraceMetrics::getResources
 2,44%  llc      llc                   [.] llvm::X86InstrInfo::isHighLatencyDef

Should I try to cash the metrics or it's a question of a special patch?

I think you should look at caching them, or limiting their depth (beyond a certain point, the exact answer might not matter), or both.

I think the only issue that needs to be addressed is (finally!) sharing perf data. This has been raised at least 3 times. The possible compile-time implication, the speciality of the application (fast-math) etc are well understood.

Gerolf

Sorry but I don't understand what means sharing in this case? I put all perf numbers here in comments. Is not it enough for sharing? If not where should I share it? Or maybe my perf numbers are not perf numbers from your point of view? Please, clarify.

You shared both compile-time numbers and runtime numbers by building clang, which is "insensitive" to floating point optimization. So it was asked to motivate better your change with benchmarks that can show codegen improvements in practice.

Chandler, in fact this patch should not show any improvement in generating code. [...] The idea of the patch is moving of such kind of optimization from the rather high level (DAGCombiner) to the really low level (MachineCombiner), [....] if we agree with this new place of implementation then it will be the base for future possible similar optimizations like rsqrt, etc. And in addition this (and follow up) patch(es) will allow us to remove 'fake' subtarget features like FeatureFastScalarFSQRT / FeatureFastVectorFSQRT, etc.

At this point, without an example showing that you can outperform the DAG approach, it is quite hypothetical ("believe me it is better!").
However GlobalISel could be a motivation? It'll need to reproduce all the combine from the DAG there isn't it?

The question from Gerolf was not about the quality of the generated code (it's the same like we have now) but about the compilation time only.

I think he was asking both :)

guyblank added inline comments.
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15331 ↗(On Diff #86042)

sorry for joining the discussion so late.

IIUC this affects all X86 CPUs, but I didn't see handling for all possible types (as stated in a TODO below).

specifically, does this affect AVX-512 code?

avt77 added inline comments.Jan 31 2017, 1:46 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15331 ↗(On Diff #86042)

At the moment we support only limited set of types (see X86InstrInfo::genAlternativeCodeSequence below) but we are ready to extend it if it is necessary.

guyblank added inline comments.Feb 1 2017, 1:29 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15331 ↗(On Diff #86042)

you support the machine combiner approach for a limited set of types/instructions, but you've disabled the DAG combine approach for ALL types/instructions.

I ran CodeGen/X86/recip-fastmath.ll on knl with your changes.
the output for @f32_one_step (for example) changed from

vrcp14ss %xmm0, %xmm0, %xmm1
vfnmadd213ss .LCPI1_0(%rip), %xmm1, %xmm0
vfmadd132ss %xmm1, %xmm1, %xmm0

to
vmovss .LCPI1_0(%rip), %xmm1
vdivss %xmm0, %xmm1, %xmm0

avt77 updated this revision to Diff 86947.Feb 3 2017, 3:02 AM

I fixed the issue with compile time increasing - see usage of MinInstr->getTrace(MBB). Now we're getting the trace when we really need it only. As result the executing profile was totally changed and the compiling time is now even less than it was in DAG Combiner - about 1.5 s on my laptop (I'm speaking about our worst case test only).

avt77 added inline comments.Feb 3 2017, 3:03 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15331 ↗(On Diff #86042)

I did not do anything with knl and as result you are right: this code generation was lost. Do you know other examples like that? If the problem in knl only then I could easily support it. But of course I could add special option like "-use_machine-combiner" or "-use-dag-combiner". What's better?

Please separate out the compile time improvements into a new patch for review - now that we have it the "compile time question" shouldn't hold up this patch any longer so we should be able to get this in as soon as the avx512 (knl/skx) issue is answered.

As for AVX512, I am ok with excluding it specifically in DAGCombiner.

But I'm also concerned about other cpus for which we don't have an accurate scheduler model (broadwell, skylake), should these be excluded as well?

lib/Target/X86/X86InstrInfo.cpp
10354

currently in Haswell and newer cpus, the generated sequence is using fma instructions.
this should be taken into account here in the patterns, right?

test/CodeGen/X86/recip-fastmath.ll
2–10

please add RUN commands for specific cpus (as in the other test + AVX512 target)
also commit the test changes and rebase your patch on them so we can see the output changes in these new runs.

I've added the target cpu specific tests to trunk (rL294128) to help us track AVX512 (SKX/KNL) perf

But I'm also concerned about other cpus for which we don't have an accurate scheduler model (broadwell, skylake), should these be excluded as well?

If the cpu is using an older scheduler model then its already not necessarily being optimally scheduled - I don't see this patch being any different.

lib/Target/X86/X86InstrInfo.cpp
10354

Yes this should be taken into account, but it does mean that the number of codegen patterns is going to start to balloon if we're not careful.

But I'm also concerned about other cpus for which we don't have an accurate scheduler model (broadwell, skylake), should these be excluded as well?

If the cpu is using an older scheduler model then its already not necessarily being optimally scheduled - I don't see this patch being any different.

Currently the older schedule "only" affects the schedule, but with this patch (and future ones using the machine combiner framework) it will affect which instructions we emit. This could possibly lead to generating worse code than we do at the moment.

mkuper added a subscriber: mkuper.Feb 6 2017, 12:58 AM

But I'm also concerned about other cpus for which we don't have an accurate scheduler model (broadwell, skylake), should these be excluded as well?

If the cpu is using an older scheduler model then its already not necessarily being optimally scheduled - I don't see this patch being any different.

Currently the older schedule "only" affects the schedule, but with this patch (and future ones using the machine combiner framework) it will affect which instructions we emit. This could possibly lead to generating worse code than we do at the moment.

Recap:

This (and hopefully an equivalent sqrt/rsqrt future patch) are the stand out examples of cases that affect the actual result depending on MC and the scheduler model - minor code changes could cause it to switch between full precision divps and rcpps+nr, but this only happens with the necessary fast/unsafe flags enabled which means the user knows what to expect.

For most other possibly MC cases (e.g. constant rematerialization, shuffles, slow-vs-fast path instructions) there will be differences in the instructions used but not the final result.

avt77 updated this revision to Diff 88003.Feb 10 2017, 8:31 AM

I fixed all known issues:

  • AVX512 is now again supported by DAGCombiner
  • FMA instructions are being used when FMA is enabled

This version clearly shows the advantage of sched model usage: it selects reciprocal code when it's profit only (e.g. compare v8f32_one_step and v8f32_one_step_2_divs, etc.)

In addition I removed the speedup related code from this patch because it was included into the special review: D29627

ABataev resigned from this revision.Feb 13 2017, 11:42 AM
avt77 added a comment.Feb 14 2017, 4:20 AM

Hi All,
Do we except anything more here?
It seems I fixed all requirements. Maybe it's time for LGTM?

avt77 updated this revision to Diff 89241.Feb 21 2017, 10:29 AM

Guy Blank found a problem with PIC relocation model on SLM architecture. I fixed it and added the corresponding test.

avt77 updated this revision to Diff 90329.Mar 2 2017, 6:53 AM

I committed PIC-related test in trunk and updated this patch to be able to compare it with new code generation.

RKSimon added inline comments.Mar 9 2017, 11:27 AM
lib/CodeGen/MachineTraceMetrics.cpp
521

Is this still true after D29627?

avt77 added inline comments.Mar 10 2017, 12:09 AM
lib/CodeGen/MachineTraceMetrics.cpp
521

Yes, the profile shows that now (after D29627) this feature eats more time than any other

58,19%  llc      llc                   [.] llvm::MachineTraceMetrics::Ensemble::invalidate
 3,44%  llc      llc                   [.] (anonymous namespace)::TwoAddressInstructionPass::scanUses
 3,19%  llc      llc                   [.] llvm::ScheduleDAGSDNodes::ClusterNeighboringLoads
 1,59%  llc      llc                   [.] llvm::SparseMultiSet<llvm::VReg2SUnit, llvm::VirtReg2IndexFunctor, unsigned char>::find
 1,08%  llc      llc                   [.] llvm::X86InstrInfo::areLoadsFromSameBasePtr
RKSimon edited edge metadata.Nov 18 2017, 9:57 AM

@avt77 As we discussed offline, please can you strip out the debug changes and put them into a new patch?

RKSimon resigned from this revision.Feb 19 2019, 10:30 AM