This is an archive of the discontinued LLVM Phabricator instance.

New unsafe-fp-math implementation for X86 target
Needs ReviewPublic

Authored by avt77 on Nov 18 2016, 7:24 AM.

Details

Summary

The current fast-math implementation is based on DAGCombiner. One of disadvantages of that is ignoring of any cost model (throughput and/or length). Another disadvantage is implementation of target specific optimization at improper level of transformations. The introduced version moves the implementation lower into MachineCombiner. As result we're using the existing mechanism of transformations like getMachineCombinerPatterns and getAlternativeCodeSequence. In addition both throughput and length cost models are being used automatically.

This patch is only initial step to demonstrate the intention. The patch implements only one type of transformation: the reciprocal estimated code instead of vdivss instruction. I'm going to support other types of reciprocal optimizations as well. But I'd like to get first comments on this my job. The code is safely compiled and produces the working output.

Diff Detail

Event Timeline

avt77 updated this revision to Diff 78520.Nov 18 2016, 7:24 AM
avt77 retitled this revision from to New unsafe-fp-math implementation for X86 target.
avt77 updated this object.
avt77 added reviewers: RKSimon, spatel, ABataev.
RKSimon added inline comments.Nov 21 2016, 7:20 AM
lib/Target/X86/X86InstrInfo.cpp
8726

This needs refactoring to support scalar and packed versions of SSE and AVX opcodes if possible.

8774

We don't do reciprocal estimates for double types - these all need removing.

8816

Replace this with a basic TODO comment for future FSQRT pattern support - don't comment out code (especially when it doesn't exist).

avt77 updated this revision to Diff 79408.Nov 28 2016, 9:04 AM

The new reciprocal implementation is done on more than 97%: we still don't have public tests but the code changes are almost completed. Please review and send me your comments.

spatel edited edge metadata.Nov 29 2016, 1:22 PM

The new reciprocal implementation is done on more than 97%: we still don't have public tests but the code changes are almost completed. Please review and send me your comments.

I applied the patch to trunk, but every example that I tried after that crashed:
Assertion failed: (i < getNumOperands() && "getOperand() out of range!"), function getOperand, file llvm/include/llvm/CodeGen/MachineInstr.h, line 280.

Is it correct that this patch will allow us to remove 'fake' subtarget features like FeatureFastScalarFSQRT / FeatureFastVectorFSQRT ?

The new reciprocal implementation is done on more than 97%: we still don't have public tests but the code changes are almost completed. Please review and send me your comments.

I applied the patch to trunk, but every example that I tried after that crashed:
Assertion failed: (i < getNumOperands() && "getOperand() out of range!"), function getOperand, file llvm/include/llvm/CodeGen/MachineInstr.h, line 280.

Is it correct that this patch will allow us to remove 'fake' subtarget features like FeatureFastScalarFSQRT / FeatureFastVectorFSQRT ?

Sorry for crashes - it's a result of my last minute changes and missing of corresponding tests. Hope to fix it today-tomorrow.
About FeatureFastScalarFSQRT / FeatureFastVectorFSQRT: I did not look at these features yet but it seems YES, we'll be able to remove them because the corresponding functionality should appear automatically. The idea of the patch: the throughput(latency) should be calculated automatically for every specific alternative code sequence (e.g. recip instead of div) but I did not check yet if it works at the moment. And we should check other similar features if they exist (FeatureFastLZCNT, etc.) Does it make sense?

BTW, I have a question: this is the first patch in the possible series of patches. This patch introduces the the suggested approach on the example of FDiv - Recip implementation. Should I complete this patch with FDIV only or should I extend it with rsqrt? The patch is rather big that's why I suggest to make it really working with fdiv only and commit. After that I'll be able to make other similar patches faster and safely. Is it OK?

avt77 added a comment.Nov 30 2016, 4:04 AM

In fact if you try something like

llc < .../llvm/test/CodeGen/X86/recip-fastmath.ll -mtriple=x86_64-unknown-unknown -mattr=+avx

it should work just now: up to now I was working with this test only .

BTW, I have a question: this is the first patch in the possible series of patches. This patch introduces the the suggested approach on the example of FDiv - Recip implementation. Should I complete this patch with FDIV only or should I extend it with rsqrt? The patch is rather big that's why I suggest to make it really working with fdiv only and commit. After that I'll be able to make other similar patches faster and safely. Is it OK?

I think this is the right direction because we want to make use of the actual CPU scheduler model instead of estimates, but implementing this will almost certainly lead to unintended consequences that need to be worked around. For example, one thing I noticed when adding the reassociation patterns to machine combiner: we could significantly increase compile time because calculating critical path/resource height with MachineTraceMetrics can be expensive.

You should send a note to llvm-dev about this plan, the motivation, the benefits, etc. and reference this patch as the starting point. The changes you are proposing could be adapted by all targets, so there should be feedback from various backend people to avoid duplicated effort.

ABataev edited edge metadata.Nov 30 2016, 7:13 AM

Is this patch an NFC patch? If no, the test must be added

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
14950 ↗(On Diff #79408)

Why is this line commented? If it is not required (but I rather doubt) you should just remove it.

lib/Target/X86/X86InstrInfo.cpp
8651–8653

I believe this can be replaced by a range-based loop

8658

auto &?

8674–8678

Should these loops be range-based loops?

8681–8683

Range-based loop?

8687–8689

No braces required

8762–8763

reformat this

RKSimon edited edge metadata.Nov 30 2016, 7:38 AM

Still looking through this, but here are some initial comments.

lib/Target/X86/X86InstrInfo.cpp
8651

Use a range iterator if you can, same for the other cases below

8661

This should be C->isExactlyValue(1.0)

8669

This should be isDividendExactlyOne. We should be calling it Dividend not Divident as well.

8688

Embed hasAllOnesOperand here.

8722

Replace SmallVector<int, 7> with ArrayRef<int>

8836

This isn't necessary if DividentIsAllOnes == true

avt77 updated this revision to Diff 79748.Nov 30 2016, 8:33 AM
avt77 edited edge metadata.

The new version for next discussions

avt77 updated this revision to Diff 79900.Dec 1 2016, 5:53 AM

I fixed one assertion failing (but there is another one at the moment), moved to the range for, formatted and made some sugar changes.

avt77 updated this revision to Diff 80065.Dec 2 2016, 6:53 AM

All failed assertions were resolved. The ability to use old reciprocal implementation for non-x86 platforms was restored. Now we have only one failed test CodeGen/X86/recip-fastmath.ll but that's exactly what we were expecting because I changed the code generation here. I hope that's 99.99% of the patch - please review.

avt77 updated this revision to Diff 80397.Dec 6 2016, 2:58 AM

I updated the test related to reciprocal code gen. Now it shows most of possible variants of dividend (not only 1.0 as it was before). Now the patch is ready for final review.

Some minor comments, but I'd like to hear other people's thoughts. This might be the first of a number of X86 MC patterns and I want to be sure we're doing this correctly.

include/llvm/CodeGen/MachineInstr.h
1155 ↗(On Diff #80397)

Explain the purpose of the new 'P' argument?

lib/Target/X86/X86InstrInfo.cpp
8648

Rename this to hasExactlyOneOperand ? AllOnes means 'all bits are set'

8677

Rename to isDividendEactlyOne and update the comment.

8748

Dividend is only used once, merge these?

unsigned DividendReg = Root.getOperand(1).getReg();
8749

AllOnes -> ExactlyOne

8814

Can we replace TII->get(Instrs[2]) with MIDesc?

8821

Can we replace TII->get(Instrs[2]) with MIDesc?

8933

Inconsistent braces to the other cases.

Also, please add the additional tests to trunk now with the current codegen so this patch shows the diff.

avt77 updated this revision to Diff 80583.Dec 7 2016, 5:41 AM

I fixed all requirements raised by Simon and Alexey. The special test is here as well.

avt77 updated this revision to Diff 81735.Dec 16 2016, 4:38 AM

I've updated the reciprocal related tests to see difference between old and new code gen more clearly. In fact there is no real difference but the new approach allows to take into account the schedule cost model when we deal with different machine code patterns. This patch should become the first step in the future similar optimizations like rsqrt, etc.

Does anyone else have comments on this please? There has been interest in moving more of such codegen to the MC so that we can use more realistic scheduling costs so I'm keen to see this go in.

test/CodeGen/X86/recip-fastmath2.ll
191 ↗(On Diff #81735)

Any ideas why we fail to get:

movaps {{.*#+}} xmm1 = [1.000000e+00,1.000000e+00,1.000000e+00,1.000000e+00]

There are other cases here as well.

Gerolf edited edge metadata.Dec 21 2016, 2:36 PM

A few general remarks:

  • I'm very much in favor of the MC combiner approach
  • But I'm getting increasingly concerned about the MC code quality. I felt the FMA/Aarch64 support starts looking crabby (I'm the culprit) and this doesn't look much better I'm afraid. The approach is in need of more automation for better maintainability. This is meant as food for thought. Since I"m too blame for the approach I can't to be harsh in reviews :-)
  • I'm also concerned about the compile-time (in particular since we don't track x86 specific issues ( or any other backend than Aarch64- or least I'm not aware that anyone is watching closely). Could you share some specific data about your perf gains and compile-time measurements? However, I think this optimization is for fast math only and compile-time is probably less of an issue in that mode. One way to deal with compile-time issues is to wrap some MC under an option.
  • Perhaps I missed it but I expected the optimization to kick in only under fast math. I saw 'fast' in the test cases, but didn't see a check in the code.

Thanks
Gerolf

lib/Target/X86/X86InstrInfo.cpp
8654

This is an expensive search. There must be a direct simpler way to get that info.

8705

For which types? All FP?

8710

Can you elaborate? eg. on #iterations?

8757

Why?

Gerolf added inline comments.Dec 21 2016, 2:36 PM
include/llvm/CodeGen/MachineInstr.h
1154 ↗(On Diff #81735)

MBB instead of P? Is this for debugging?

lib/CodeGen/MachineCombiner.cpp
158 ↗(On Diff #81735)

Please add a comment that explains why default is used.

+ assert(DefIdx || UseIdx);

lib/CodeGen/MachineInstr.cpp
1712 ↗(On Diff #81735)

you can use the new parameter and then if (!MBB) MBB=getParent()

So what do you suggest? Can we get this in before refactoring the MC alternative code pattern system or will that cause too much additional complexity?

Whatever, it sounds like we'd be better off doing this all after the 4.0 branch, do you agree?

avt77 added a comment.Dec 22 2016, 4:38 AM

What do you mean when you speak about "automation"? Do you mean a possibility to describe alternative sequences with tools like TableGen? If yes I'm afraid it'll require some real time to implement. That's why from my point of view the hand-made patterns similar to the given one could be really useful in future. But of course it'd be really interesting to launch such a project. Right?

avt77 updated this revision to Diff 82342.Dec 22 2016, 8:12 AM
avt77 edited edge metadata.
  1. The current trunk has already changes in Machine::print, etc. similar in my initial patch. Because of that I removed all corresponding changes and did not answer on all corresponding comments.
  2. It seems I fixed all other requirements from Gerolf
  3. But the main question is the same: should we continue with the effort?
  1. But the main question is the same: should we continue with the effort?

Absolutely, but I'm starting to think it make sense to wait until after the 4.0 branch, then get this in as we need an x86 implementation for reference and then begin an investigation into how we want the MachineCombiner to develop.

The "automatic" generation of pattern e.g. with TableGen is on my longer term wish list, not a requirement for this patch. Sorry if my wording was confusing.
Do you have performance numbers?

lib/Target/X86/X86InstrInfo.cpp
8652

-> dividend

8666

Why this special case? just C = C->getSplatValue() would be easier to read.

8758

wey -> why?

8832

Should there be a 2:?

avt77 added a comment.Dec 23 2016, 8:34 AM

Yes, I've just got the numbers. I created 2 versions of clang compiler: directly from trunk and with my patch applied. Then with help of these compilers I created 2 new compilers with the following configuration:

cmake -G "Ninja" -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=<trun/patch compiler home>/build/bin/clang -DCMAKE_CXX_COMPILER=<trunk/patch compiler home>/build/bin/clang++ -DCMAKE_C_FLAGS="-O3 -ffast-math" -DCMAKE_CXX_FLAGS="-O3 -ffast-math" ../llvm

Below you can see the times (I did 2 experiments for every compiler):


Compiler with patch


real 32m10.783s
user 125m19.424s
sys 3m8.456s

real 31m20.432s
user 122m2.012s
sys 3m4.444s


Trunk based compiler


real 31m46.001s
user 123m39.192s
sys 3m10.180s

real 40m6.791s
user 156m5.472s
sys 3m36.476s

Of course it's very rough estimations because I used our server and there a lot of things around. But general picture is clear from my point of view: my patch does not increase the compilation time.
Is it enough or I should do other experiments?

davide added a subscriber: davide.Dec 23 2016, 8:50 AM

Yes, I've just got the numbers. I created 2 versions of clang compiler: directly from trunk and with my patch applied. Then with help of these compilers I created 2 new compilers with the following configuration:

cmake -G "Ninja" -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=<trun/patch compiler home>/build/bin/clang -DCMAKE_CXX_COMPILER=<trunk/patch compiler home>/build/bin/clang++ -DCMAKE_C_FLAGS="-O3 -ffast-math" -DCMAKE_CXX_FLAGS="-O3 -ffast-math" ../llvm

Below you can see the times (I did 2 experiments for every compiler):


Compiler with patch


real 32m10.783s
user 125m19.424s
sys 3m8.456s

real 31m20.432s
user 122m2.012s
sys 3m4.444s


Trunk based compiler


real 31m46.001s
user 123m39.192s
sys 3m10.180s

real 40m6.791s
user 156m5.472s
sys 3m36.476s

Of course it's very rough estimations because I used our server and there a lot of things around. But general picture is clear from my point of view: my patch does not increase the compilation time.
Is it enough or I should do other experiments?

I'll let Simon decide but these numbers are iffy. I can't necessarily conclude your patch increases compile time but I can't conclude anything else either. In particular, the stock clang measurement have a variance of 20% between consecutive runs, so I have very little faith in the numbers collected.
Rafael recent('ish)ly published a set of suggestions/knob to turn on to get relatively stable numbers on a Linux machine. I'm also pretty sure the topic of how to get {reliable numbers, numbers you can have faith in} has been discussed multiple times (look at the archives, Sean has generally pretty informative posts/insights on the topic).

avt77 added a comment.Dec 23 2016, 8:53 AM

Two more comments:

  1. I did not update the patch accordingly to the latest Gerolf comments: I'll do it asap
  2. Gerolf asked: "Perhaps I missed it but I expected the optimization to kick in only under fast math. I saw 'fast' in the test cases, but didn't see a check in the code."

    I check fast-math in "static bool getFDIVPatterns"

    switch (TLI->getRecipEstimateDivEnabled(VT, *MF)) { this line checks per-function option case TLI->ReciprocalEstimate::Disabled: return false; case TLI->ReciprocalEstimate::Unspecified: if (Root.getParent()->getParent()->getTarget().Options.UnsafeFPMath) and here I check command line option if there is no per-function code
avt77 added a comment.Dec 23 2016, 9:01 AM

I'll let Simon decide but these numbers are iffy. I can't necessarily conclude your patch increases compile time but I can't conclude anything else either. In particular, the stock clang measurement have a variance of 20% between consecutive runs, so I have very little faith in the numbers collected.
Rafael recent('ish)ly published a set of suggestions/knob to turn on to get relatively stable numbers on a Linux machine. I'm also pretty sure the topic of how to get {reliable numbers, numbers you can have faith in} has been discussed multiple times (look at the archives, Sean has generally pretty informative posts/insights on the topic).

Could you give me a reference to the info? Of course I'll try to find it myself but with your help it could be faster.

I'll let Simon decide but these numbers are iffy. I can't necessarily conclude your patch increases compile time but I can't conclude anything else either. In particular, the stock clang measurement have a variance of 20% between consecutive runs, so I have very little faith in the numbers collected.
Rafael recent('ish)ly published a set of suggestions/knob to turn on to get relatively stable numbers on a Linux machine. I'm also pretty sure the topic of how to get {reliable numbers, numbers you can have faith in} has been discussed multiple times (look at the archives, Sean has generally pretty informative posts/insights on the topic).

Could you give me a reference to the info? Of course I'll try to find it myself but with your help it could be faster.

http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20161017/398831.html

avt77 added a comment.Dec 24 2016, 6:35 AM

I made new experiments but now I use a dedicated computer for it:

atischenko@ip-172-31-21-62:~/workspaces$ cat time.log
cmake -G "Ninja" -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=/home/atischenko/workspaces/step3/build/bin/clang -DCMAKE_CXX_COMPILER=/home/atischenko/workspaces/step3/build/bin/clang++ -DCMAKE_C_FLAGS="-O3 -ffast-math" -DCMAKE_CXX_FLAGS="-O3 -ffast-math" ../llvm
time ninja -j4


Compiler with patch


real 32m10.783s
user 125m19.424s
sys 3m8.456s

real 31m20.432s
user 122m2.012s
sys 3m4.444s


On dedicated computer

real 31m23.564s
user 122m13.120s
sys 3m5.340s

real 31m28.115s
user 122m30.596s
sys 3m6.588s

real 31m22.861s
user 122m8.920s
sys 3m7.236s


Trunk based compiler


cmake -G "Ninja" -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=/home/atischenko/workspaces/bootstrap-trunk/build/bin/clang -DCMAKE_CXX_COMPILER=/home/atischenko/workspaces/bootstrap-trunk/build/bin/clang++ -DCMAKE_C_FLAGS="-O3 -ffast-math" -DCMAKE_CXX_FLAGS="-O3 -ffast-math" ../llvm
time ninja -j4

real 31m46.001s
user 123m39.192s
sys 3m10.180s

real 40m6.791s
user 156m5.472s
sys 3m36.476s


On dedicated computer

real 31m24.634s
user 122m14.912s
sys 3m8.080s

real 31m22.833s
user 122m9.836s
sys 3m5.676s

real 31m22.795s
user 122m5.924s
sys 3m8.588s

Hope now we can trust my results.

avt77 updated this revision to Diff 82605.Dec 28 2016, 9:35 AM

I fixed the last issues raised by Gerolf except one related to special case of "if" because the suggested change breaks the current logic.

Gerolf added a comment.Jan 2 2017, 1:49 PM

From my perspective the implementation is close and only requires a few minor changes.

The compile-time numbers I've seen so far are meaningless (wide variation, unclear if/how many times your code actually fires etc), but I'm not too concerned about O3 fast-math ct and would give it benefit of the doubt.

I did ask the question about performance benefits twice to no avail and admit I'm still curious. I assume to get these numbers you do set your machines into perf mode rather than using some servers running some random load.

Thanks
Gerolf

lib/Target/X86/X86InstrInfo.cpp
8648

Add comment what the function does before the function

8652

it's -> it is

8720

input to this function are 7 parameters, the comment only lists 6.

8757

I keep stumbling over this comment and every time i read it: did you mean to say something like
//Execute at least one iteration.
Iternations = Max(1, Iterations);

8760

Where is that input sequence documented?

avt77 updated this revision to Diff 84535.Jan 16 2017, 3:18 AM

I fixed everything except one comment (see below). And I collected new perf numbers. Now I used the following command for bootstrap building:

time make -j 1

As result the reproducing is very well from my point of view. In addition I tried to get numbers accordingly to description in http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20161017/398831.html . The reproducing is almost the same but the required time is even longer (about 2 hours for every test). Because of that I kept my numbers. The test itself is very simple: I created 2 versions of compiler: the first one was built directly from trunk and the second one was built after applying of my patch. Then with help of every compiler I created 4 bootstraps. The results are below:

The trunk version of the compiler builds the bootstrap like here (there were 4 starts with "time make -j 1"):

real  91m7.998s
real  90m23.861s
real  90m26.154s
real  90m31.533s

The version of the compiler with my patch builds the bootstrap like here (there were the same 4 starts with "time make -j 1"):

real  90m43.970s
real 90m7.257s
real  90m6.671s
real  90m11.733s

Obviously, the compilation time does not depend on my patch.

avt77 added inline comments.Jan 16 2017, 3:22 AM
lib/Target/X86/X86InstrInfo.cpp
8720

I did not understand this comment: what should I do here?

RKSimon added inline comments.Jan 16 2017, 1:41 PM
lib/Target/X86/X86InstrInfo.cpp
8720

I think it means that while ArrayRef<int> Instrs have 7 instructions listed, the codegen in the comment only shows 6 instructions

avt77 added inline comments.Jan 18 2017, 1:03 AM
lib/Target/X86/X86InstrInfo.cpp
8720

But in fact all 7 instructions are shown but from index 0 to index 6 (maybe in "strange" order: 0,2,1,3,4,5,6). If you'd like I could change the order and/or start numbering from 1. Gerolf, should I do it or we fixed everything?

chandlerc added a subscriber: chandlerc.

I fixed everything except one comment (see below). And I collected new perf numbers. Now I used the following command for bootstrap building:

time make -j 1

As result the reproducing is very well from my point of view. In addition I tried to get numbers accordingly to description in http://lists.llvm.org/pipermail/llvm-commits/Week-of-Mon-20161017/398831.html . The reproducing is almost the same but the required time is even longer (about 2 hours for every test). Because of that I kept my numbers. The test itself is very simple: I created 2 versions of compiler: the first one was built directly from trunk and the second one was built after applying of my patch. Then with help of every compiler I created 4 bootstraps. The results are below:

So, I'm mostly lurking, but I want to point out a serious issue here: Clang and LLVM have as little floating point as we could manage in them. So I would expect them to be quite uninteresting in testing the compile time impact of a patch that is only concerned with floating point code....

And there still are no numbers around the improvement here...

I suspect you'll need to provide benchmark data from at least SPEC and/or the LLVM test suite that shows this is an improvement and compile time numbers to show that the improvement doesn't cost too much... At least, that would be my expectation. Those at least do include some floating point code. You might also try running benchmarks from the Eigen project which has a very large amount of floating point code. However, they usually don't build with any unsafe math flags, so correctness issues may dominate.

It'd also be great to hear from others invested in LLVM's FP lowering like Hal, Steve, etc...

avt77 added a comment.Jan 18 2017, 9:26 AM

What is "Eigen project"? Could you point me to it?

I'm leaning towards a LGTM since you addressed basically all my issues, but more people mushroomed and are curious about your performance data. So I think can't dodge that question anymore and need to share some good data for the benchmark(s) you are tuning for before you get the nod.

Cheers
Gerolf

lib/Target/X86/X86InstrInfo.cpp
8720

In the comments I only count 6 instructions from vmovss to vaddss, but the code handles 7. What am I missing?

Unless the change is trivially and obviously an improvement on inspection of the result, I think you need data before making it. =] I've not looked at this particular change in enough detail to say whether it satisfies that. Maybe Gerolf or Craig could.

If there are compile time concerns (which actually came up) and they deal with floating point and unsafe floating point in particular, then I don't think a bootstrap is a useful way to measure anything one way or the other (there just isn't enough floating point).

What is "Eigen project"? Could you point me to it?

http://eigen.tuxfamily.org/

Unless the change is trivially and obviously an improvement on inspection of the result, I think you need data before making it. =] I've not looked at this particular change in enough detail to say whether it satisfies that. Maybe Gerolf or Craig could.

Another way besides benchmark data to show runtime improvements are tools like IACA. This is how we curated most of the vector shuffle improvements over the last few years for example.

avt77 added a comment.Jan 19 2017, 3:04 AM

Chandler, in fact this patch should not show any improvement in generating code. If you look in changes made in tests you'll see that the newly generated code is almost identical to the previous one (only some names, order of instructions, etc.). The idea of the patch is moving of such kind of optimization from the rather high level (DAGCombiner) to the really low level (MachineCombiner), Here we see real target machine instructions and as result we can use real cost model to estimate the real cost of possible transformation (in the given case the transformation is the replacement of one instruction (div) with some sequence of instructions). The transformation itself already exists inside Clang but the patch suggests to implement it in another place and that's it. If we agree with this new place of implementation then it will be the base for future possible similar optimizations like rsqrt, etc. And in addition this (and follow up) patch(es) will allow us to remove 'fake' subtarget features like FeatureFastScalarFSQRT / FeatureFastVectorFSQRT, etc. The question from Gerolf was not about the quality of the generated code (it's the same like we have now) but about the compilation time only.

Of course I'll try to collect the required perf numbers but they should be the same like we have now. Do we really need it?

Hi All,
I found a really "stress" test for div operations (see the attachment)

(tnx to Sanjay Patel). The test shows maybe the worst case of the possible degradation because of this patch. I used the following command with 2 different compilers:

cmake -E time ./llc spill_div.ll -o /dev/null -enable-unsafe-fp-math

For "pure" trunk compiler I got: Elapsed time 2 s.
For compiler with patch I got: Elapsed time 18 s.

(I launched the test several times with the same results.)
What's now? Is it acceptable? Should I try to optimize the patch? Should I try other benchmarks?
(I tried both LNT and Eigen but unfortunately they don't work for me at the moment because of unpredictable runtime issues.)

Hi All,
I found a really "stress" test for div operations (see the attachment)

(tnx to Sanjay Patel). The test shows maybe the worst case of the possible degradation because of this patch. I used the following command with 2 different compilers:

.......

For "pure" trunk compiler I got: Elapsed time 2 s.
For compiler with patch I got: Elapsed time 18 s.

Do you have any profiling info on where the time is going please? @Gerolf might then be able to advise whether we can improve the MCCombiner mechanism before/after this patch has gone in.

Hi All,
I found a really "stress" test for div operations (see the attachment)

(tnx to Sanjay Patel). The test shows maybe the worst case of the possible degradation because of this patch. I used the following command with 2 different compilers:

.......

For "pure" trunk compiler I got: Elapsed time 2 s.
For compiler with patch I got: Elapsed time 18 s.

Do you have any profiling info on where the time is going please? @Gerolf might then be able to advise whether we can improve the MCCombiner mechanism before/after this patch has gone in.

I'll jump in here because I supplied this (hopefully degenerate and worst) case based on my earlier reassociation transforms for MachineCombiner (see D10321 where I first mentioned the potential compile-time problem). When I looked into that, the time was all going into MachineTraceMetrics::Ensemble::computeInstrDepths() and MachineTraceMetrics::Ensemble::
computeInstrHeights(). Those got very expensive for long sequences of instructions. Possible fixes would be to improve how those are computed, cache those results, and/or eliminate how often we ask for those values.

We were ok with some additional potential compile-time cost because reassociation opportunities do not appear to be that common and were limited to -ffast-math compiles . We can make similar arguments for the recip transforms in this patch?

But it is worth noting that since the time of D10321, the number of reassociation candidate opcodes that x86 matches has grown to ~200 (X86InstrInfo::isAssociativeAndCommutative()) and includes integer ops. We're probably overdue on measuring and improving the perf of MachineCombiner.

I just re-read some of the comments in this review, and back on Nov 30 I explained the possible compile-time hit. I requested a note to llvm-dev at that time. Based on the confusion in the subsequent comments (people think this patch will affect execution perf), I am making the suggestion to post to llvm-dev again. Moving from DAG heuristics to MI-scheduler-based transforms is a change in strategy for all targets and explaining that to a wider audience is an opportunity to get good feedback.

hfinkel edited edge metadata.Jan 26 2017, 8:29 AM

Hi All,
I found a really "stress" test for div operations (see the attachment)

(tnx to Sanjay Patel). The test shows maybe the worst case of the possible degradation because of this patch. I used the following command with 2 different compilers:

.......

For "pure" trunk compiler I got: Elapsed time 2 s.
For compiler with patch I got: Elapsed time 18 s.

Do you have any profiling info on where the time is going please? @Gerolf might then be able to advise whether we can improve the MCCombiner mechanism before/after this patch has gone in.

I'll jump in here because I supplied this (hopefully degenerate and worst) case based on my earlier reassociation transforms for MachineCombiner (see D10321 where I first mentioned the potential compile-time problem). When I looked into that, the time was all going into MachineTraceMetrics::Ensemble::computeInstrDepths() and MachineTraceMetrics::Ensemble::
computeInstrHeights(). Those got very expensive for long sequences of instructions. Possible fixes would be to improve how those are computed, cache those results, and/or eliminate how often we ask for those values.

Certainly seems worthwhile exploring whether those can be cached (if I understand what they're doing, we do essentially cache very-similar values during MI scheduling). This worst-case hit is definitely undesirable, and we can certainly run into lots of machine-generated straight-line code, so hitting these kinds of cases in the wild is not unthinkable.

We were ok with some additional potential compile-time cost because reassociation opportunities do not appear to be that common and were limited to -ffast-math compiles . We can make similar arguments for the recip transforms in this patch?

But it is worth noting that since the time of D10321, the number of reassociation candidate opcodes that x86 matches has grown to ~200 (X86InstrInfo::isAssociativeAndCommutative()) and includes integer ops. We're probably overdue on measuring and improving the perf of MachineCombiner.

I think the only issue that needs to be addressed is (finally!) sharing perf data. This has been raised at least 3 times. The possible compile-time implication, the speciality of the application (fast-math) etc are well understood.

Gerolf

I think the only issue that needs to be addressed is (finally!) sharing perf data. This has been raised at least 3 times. The possible compile-time implication, the speciality of the application (fast-math) etc are well understood.

Gerolf

As I understand it the idea is that by moving this to the MC, then these alternative patterns will only be used if (1) the fast-math code permits it and (2) that the target cpu's scheduler model indicates that its quicker? So what you are asking is that we time the two versions of the code on specific cpus to check if in each case the correct decision is made?

This probably means that the tests should be updated to check against a couple of specific target cpus as well - we're limited by what x86 schedulers we have as but I know Jaguar (btver2) should use the rcpps version in all cases and expect Haswell should use divps.

A quick look at the SandyBridge scheduler model suggests its latency for FDIV is too low (especially ymm as it only has a 128-bit div alu) so that will select divps when it probably shouldn't....

avt77 added a comment.Jan 27 2017, 1:07 AM

I think the only issue that needs to be addressed is (finally!) sharing perf data. This has been raised at least 3 times. The possible compile-time implication, the speciality of the application (fast-math) etc are well understood.

Gerolf

Sorry but I don't understand what means sharing in this case? I put all perf numbers here in comments. Is not it enough for sharing? If not where should I share it? Or maybe my perf numbers are not perf numbers from your point of view? Please, clarify.

And next question about profiling data. Should I collect it? I've already started the process but now I'm not sure if it's interesting for somebody.

avt77 added a comment.Jan 27 2017, 3:45 AM

I got the first profiling data. In fact it's the same that was described by Sanjay:

Samples: 1M of event 'cycles:pp', Event count (approx.): 1180464
Overhead Command Source Shared Object Source Symbol

15,65%  llc      llc                   [.] llvm::MachineTraceMetrics::Ensemble::computeInstrDepths
15,18%  llc      llc                   [.] getDataDeps
 9,23%  llc      llc                   [.] llvm::MachineTraceMetrics::Ensemble::computeInstrHeights
 8,29%  llc      llc                   [.] pushDepHeight
 8,15%  llc      llc                   [.] llvm::MachineTraceMetrics::Ensemble::invhalidate
 5,64%  llc      llc                   [.] llvm::TargetInstrInfo::defaultDefLatency
 4,89%  llc      llc                   [.] llvm::MachineTraceMetrics::getResources
 2,44%  llc      llc                   [.] llvm::X86InstrInfo::isHighLatencyDef

Should I try to cash the metrics or it's a question of a special patch?

avt77 updated this revision to Diff 86042.Jan 27 2017, 4:39 AM

I updated recip-fastmath2.ll test accordingly to Simon recommendations. Now it includes special checks for different CPUs: SandyBridge, Haswell and btver2. These new checks demonstrate that alternative sequence of instructions is being selected when it's really cheaper than the single fdiv instruction. (Obviously we should change cost numbers for SandyBridge because they are too small.)

I got the first profiling data. In fact it's the same that was described by Sanjay:

Samples: 1M of event 'cycles:pp', Event count (approx.): 1180464
Overhead Command Source Shared Object Source Symbol

15,65%  llc      llc                   [.] llvm::MachineTraceMetrics::Ensemble::computeInstrDepths
15,18%  llc      llc                   [.] getDataDeps
 9,23%  llc      llc                   [.] llvm::MachineTraceMetrics::Ensemble::computeInstrHeights
 8,29%  llc      llc                   [.] pushDepHeight
 8,15%  llc      llc                   [.] llvm::MachineTraceMetrics::Ensemble::invhalidate
 5,64%  llc      llc                   [.] llvm::TargetInstrInfo::defaultDefLatency
 4,89%  llc      llc                   [.] llvm::MachineTraceMetrics::getResources
 2,44%  llc      llc                   [.] llvm::X86InstrInfo::isHighLatencyDef

Should I try to cash the metrics or it's a question of a special patch?

I think you should look at caching them, or limiting their depth (beyond a certain point, the exact answer might not matter), or both.

I think the only issue that needs to be addressed is (finally!) sharing perf data. This has been raised at least 3 times. The possible compile-time implication, the speciality of the application (fast-math) etc are well understood.

Gerolf

Sorry but I don't understand what means sharing in this case? I put all perf numbers here in comments. Is not it enough for sharing? If not where should I share it? Or maybe my perf numbers are not perf numbers from your point of view? Please, clarify.

You shared both compile-time numbers and runtime numbers by building clang, which is "insensitive" to floating point optimization. So it was asked to motivate better your change with benchmarks that can show codegen improvements in practice.

Chandler, in fact this patch should not show any improvement in generating code. [...] The idea of the patch is moving of such kind of optimization from the rather high level (DAGCombiner) to the really low level (MachineCombiner), [....] if we agree with this new place of implementation then it will be the base for future possible similar optimizations like rsqrt, etc. And in addition this (and follow up) patch(es) will allow us to remove 'fake' subtarget features like FeatureFastScalarFSQRT / FeatureFastVectorFSQRT, etc.

At this point, without an example showing that you can outperform the DAG approach, it is quite hypothetical ("believe me it is better!").
However GlobalISel could be a motivation? It'll need to reproduce all the combine from the DAG there isn't it?

The question from Gerolf was not about the quality of the generated code (it's the same like we have now) but about the compilation time only.

I think he was asking both :)

guyblank added inline comments.
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15331 ↗(On Diff #86042)

sorry for joining the discussion so late.

IIUC this affects all X86 CPUs, but I didn't see handling for all possible types (as stated in a TODO below).

specifically, does this affect AVX-512 code?

avt77 added inline comments.Jan 31 2017, 1:46 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15331 ↗(On Diff #86042)

At the moment we support only limited set of types (see X86InstrInfo::genAlternativeCodeSequence below) but we are ready to extend it if it is necessary.

guyblank added inline comments.Feb 1 2017, 1:29 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15331 ↗(On Diff #86042)

you support the machine combiner approach for a limited set of types/instructions, but you've disabled the DAG combine approach for ALL types/instructions.

I ran CodeGen/X86/recip-fastmath.ll on knl with your changes.
the output for @f32_one_step (for example) changed from

vrcp14ss %xmm0, %xmm0, %xmm1
vfnmadd213ss .LCPI1_0(%rip), %xmm1, %xmm0
vfmadd132ss %xmm1, %xmm1, %xmm0

to
vmovss .LCPI1_0(%rip), %xmm1
vdivss %xmm0, %xmm1, %xmm0

avt77 updated this revision to Diff 86947.Feb 3 2017, 3:02 AM

I fixed the issue with compile time increasing - see usage of MinInstr->getTrace(MBB). Now we're getting the trace when we really need it only. As result the executing profile was totally changed and the compiling time is now even less than it was in DAG Combiner - about 1.5 s on my laptop (I'm speaking about our worst case test only).

avt77 added inline comments.Feb 3 2017, 3:03 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
15331 ↗(On Diff #86042)

I did not do anything with knl and as result you are right: this code generation was lost. Do you know other examples like that? If the problem in knl only then I could easily support it. But of course I could add special option like "-use_machine-combiner" or "-use-dag-combiner". What's better?

Please separate out the compile time improvements into a new patch for review - now that we have it the "compile time question" shouldn't hold up this patch any longer so we should be able to get this in as soon as the avx512 (knl/skx) issue is answered.

As for AVX512, I am ok with excluding it specifically in DAGCombiner.

But I'm also concerned about other cpus for which we don't have an accurate scheduler model (broadwell, skylake), should these be excluded as well?

lib/Target/X86/X86InstrInfo.cpp
8897

currently in Haswell and newer cpus, the generated sequence is using fma instructions.
this should be taken into account here in the patterns, right?

test/CodeGen/X86/recip-fastmath.ll
3 ↗(On Diff #86947)

please add RUN commands for specific cpus (as in the other test + AVX512 target)
also commit the test changes and rebase your patch on them so we can see the output changes in these new runs.

I've added the target cpu specific tests to trunk (rL294128) to help us track AVX512 (SKX/KNL) perf

But I'm also concerned about other cpus for which we don't have an accurate scheduler model (broadwell, skylake), should these be excluded as well?

If the cpu is using an older scheduler model then its already not necessarily being optimally scheduled - I don't see this patch being any different.

lib/Target/X86/X86InstrInfo.cpp
8897

Yes this should be taken into account, but it does mean that the number of codegen patterns is going to start to balloon if we're not careful.

But I'm also concerned about other cpus for which we don't have an accurate scheduler model (broadwell, skylake), should these be excluded as well?

If the cpu is using an older scheduler model then its already not necessarily being optimally scheduled - I don't see this patch being any different.

Currently the older schedule "only" affects the schedule, but with this patch (and future ones using the machine combiner framework) it will affect which instructions we emit. This could possibly lead to generating worse code than we do at the moment.

mkuper added a subscriber: mkuper.Feb 6 2017, 12:58 AM

But I'm also concerned about other cpus for which we don't have an accurate scheduler model (broadwell, skylake), should these be excluded as well?

If the cpu is using an older scheduler model then its already not necessarily being optimally scheduled - I don't see this patch being any different.

Currently the older schedule "only" affects the schedule, but with this patch (and future ones using the machine combiner framework) it will affect which instructions we emit. This could possibly lead to generating worse code than we do at the moment.

Recap:

This (and hopefully an equivalent sqrt/rsqrt future patch) are the stand out examples of cases that affect the actual result depending on MC and the scheduler model - minor code changes could cause it to switch between full precision divps and rcpps+nr, but this only happens with the necessary fast/unsafe flags enabled which means the user knows what to expect.

For most other possibly MC cases (e.g. constant rematerialization, shuffles, slow-vs-fast path instructions) there will be differences in the instructions used but not the final result.

avt77 updated this revision to Diff 88003.Feb 10 2017, 8:31 AM

I fixed all known issues:

  • AVX512 is now again supported by DAGCombiner
  • FMA instructions are being used when FMA is enabled

This version clearly shows the advantage of sched model usage: it selects reciprocal code when it's profit only (e.g. compare v8f32_one_step and v8f32_one_step_2_divs, etc.)

In addition I removed the speedup related code from this patch because it was included into the special review: D29627

ABataev resigned from this revision.Feb 13 2017, 11:42 AM
avt77 added a comment.Feb 14 2017, 4:20 AM

Hi All,
Do we except anything more here?
It seems I fixed all requirements. Maybe it's time for LGTM?

avt77 updated this revision to Diff 89241.Feb 21 2017, 10:29 AM

Guy Blank found a problem with PIC relocation model on SLM architecture. I fixed it and added the corresponding test.

avt77 updated this revision to Diff 90329.Mar 2 2017, 6:53 AM

I committed PIC-related test in trunk and updated this patch to be able to compare it with new code generation.

RKSimon added inline comments.Mar 9 2017, 11:27 AM
lib/CodeGen/MachineTraceMetrics.cpp
521 ↗(On Diff #90329)

Is this still true after D29627?

avt77 added inline comments.Mar 10 2017, 12:09 AM
lib/CodeGen/MachineTraceMetrics.cpp
521 ↗(On Diff #90329)

Yes, the profile shows that now (after D29627) this feature eats more time than any other

58,19%  llc      llc                   [.] llvm::MachineTraceMetrics::Ensemble::invalidate
 3,44%  llc      llc                   [.] (anonymous namespace)::TwoAddressInstructionPass::scanUses
 3,19%  llc      llc                   [.] llvm::ScheduleDAGSDNodes::ClusterNeighboringLoads
 1,59%  llc      llc                   [.] llvm::SparseMultiSet<llvm::VReg2SUnit, llvm::VirtReg2IndexFunctor, unsigned char>::find
 1,08%  llc      llc                   [.] llvm::X86InstrInfo::areLoadsFromSameBasePtr
RKSimon edited edge metadata.Nov 18 2017, 9:57 AM

@avt77 As we discussed offline, please can you strip out the debug changes and put them into a new patch?

RKSimon resigned from this revision.Feb 19 2019, 10:30 AM