This will have to sit unmerged until we're sure that the unary FNEG has not regressed from the binary psuedo-FNEG. But it's ready to go when we're satisfied. There were no surprises in this patch.
Details
- Reviewers
spatel kpn arsenm craig.topper andrew.w.kaylor - Commits
- rG20b8ed2c2b10: [IRBuilder] Update IRBuilder::CreateFNeg(...) to return a UnaryOperator
rL374782: [IRBuilder] Update IRBuilder::CreateFNeg(...) to return a UnaryOperator
rC374782: [IRBuilder] Update IRBuilder::CreateFNeg(...) to return a UnaryOperator
rG47363a148f1d: [IRBuilder] Update IRBuilder::CreateFNeg(...) to return a UnaryOperator
rL374240: [IRBuilder] Update IRBuilder::CreateFNeg(...) to return a UnaryOperator
rC374240: [IRBuilder] Update IRBuilder::CreateFNeg(...) to return a UnaryOperator
Diff Detail
- Repository
- rL LLVM
Event Timeline
Ok, I think we're pretty close to having unary FNeg optimizations in line with binary FNeg optimizations. Is anyone aware of any obvious passes that I've missed?
Also I'm looking for some test-suite guidance. I've benchmarked the change in this patch and the results appear to be in the noise range (I think?):
<scrubbed> llvm-project/test-suite-build> ../test-suite/utils/compare.py --filter-short fneg1.json fneg2.json fneg3.json vs stock1.json stock2.json stock3.json Warning: 'test-suite :: MicroBenchmarks/XRay/FDRMode/fdrmode-bench.test' has No metrics! Warning: 'test-suite :: MicroBenchmarks/XRay/ReturnReference/retref-bench.test' has No metrics! Warning: 'test-suite :: MicroBenchmarks/XRay/FDRMode/fdrmode-bench.test' has No metrics! Warning: 'test-suite :: MicroBenchmarks/XRay/ReturnReference/retref-bench.test' has No metrics! Warning: 'test-suite :: MicroBenchmarks/XRay/FDRMode/fdrmode-bench.test' has No metrics! Warning: 'test-suite :: MicroBenchmarks/XRay/ReturnReference/retref-bench.test' has No metrics! Warning: 'test-suite :: MicroBenchmarks/XRay/FDRMode/fdrmode-bench.test' has No metrics! Warning: 'test-suite :: MicroBenchmarks/XRay/ReturnReference/retref-bench.test' has No metrics! Warning: 'test-suite :: MicroBenchmarks/XRay/FDRMode/fdrmode-bench.test' has No metrics! Warning: 'test-suite :: MicroBenchmarks/XRay/ReturnReference/retref-bench.test' has No metrics! Warning: 'test-suite :: MicroBenchmarks/XRay/FDRMode/fdrmode-bench.test' has No metrics! Warning: 'test-suite :: MicroBenchmarks/XRay/ReturnReference/retref-bench.test' has No metrics! Tests: 546 Short Running: 187 (filtered out) Remaining: 359 Metric: exec_time Program lhs rhs diff test-suite...marks/CoyoteBench/lpbench.test 1.19 1.11 -6.5% test-suite...marks/Misc/matmul_f64_4x4.test 0.68 0.64 -5.8% test-suite...e/Benchmarks/McGill/chomp.test 0.67 0.63 -5.7% test-suite...BENCHMARK_ORDERED_DITHER/128/2 60.65 63.84 5.3% test-suite...HMARK_BICUBIC_INTERPOLATION/16 48.53 50.78 4.6% test-suite...mbolics-flt/Symbolics-flt.test 0.76 0.73 -4.0% test-suite...ks/Shootout/Shootout-hash.test 4.60 4.79 4.0% test-suite...ce/Benchmarks/Olden/bh/bh.test 0.97 0.93 -3.5% test-suite...oxyApps-C/miniGMG/miniGMG.test 0.65 0.63 -3.3% test-suite...-flt/LinearDependence-flt.test 1.35 1.40 3.3% test-suite...rolangs-C++/primes/primes.test 0.75 0.78 3.3% test-suite...ncils/fdtd-apml/fdtd-apml.test 0.86 0.83 -3.2% test-suite...s/Fhourstones/fhourstones.test 0.68 0.70 3.2% test-suite.../Benchmarks/nbench/nbench.test 1.09 1.12 3.1% test-suite...s/Misc/richards_benchmark.test 1.11 1.08 -3.0% Geomean difference nan% lhs rhs diff count 356.000000 359.000000 356.000000 mean 2207.107388 2187.731742 -0.000983 std 17335.321906 17252.648405 0.011480 min 0.608100 0.602200 -0.065347 25% 1.948518 1.886300 -0.002897 50% 6.446969 5.985600 -0.000070 75% 111.044515 106.027453 0.001283 max 214183.043000 214138.878333 0.052571
I've compared assembly for the notable differences, and unless I've made a horrible mistake, there are no generated asm differences. I suspect that the >5% swings are from I/O noise on my shared machine. Any thoughts/insights on using test-suite?
Another thought...
This patch is a change to the main FNeg IRBuilder entry point only! There may be value in a patch which switches *all* the FNeg entry points at once. However, this single patch alone is quite large and I'm not sure if more diff noise is a good thing.
Clang does not use the IRBuilder fneg entry point. When do we plan to switch clang over?
Sorry for the delay, Craig. I missed this comment...
So Clang does use the IRBuilder fneg entry point (at least in some places). That's why all these Clang tests have differences:
M clang/test/CodeGen/aarch64-v8.2a-fp16-intrinsics.c (2 lines) M clang/test/CodeGen/avx512f-builtins.c (56 lines) M clang/test/CodeGen/avx512vl-builtins.c (24 lines) M clang/test/CodeGen/complex-math.c (6 lines) M clang/test/CodeGen/fma-builtins.c (8 lines) M clang/test/CodeGen/fma4-builtins.c (10 lines)
There may be more places that need to be updated though, I'm not sure.
Sorry I wasn't very clear there. There are certainly places that use it, but the code for unary minus operator does not use CreateFNEG. it asks for the constant value to use for negation and the makes an fsub or sub depending on fp or int.
Ah, ok. I probably should've dug deeper through the Clang code. It makes sense that this change didn't trigger a lot of differences in test-suite now. And also why only a select few Clang tests showed IR differences.
I'll check it out...
Thanks for pointing this out, Craig. I do see VisitUnaryMinus(...) now. [I'm glad that Clang distinguished between the unary and binary fneg -- that could've been trouble. ;)]
Kind of a larger question: Do we want to include the VisitUnaryMinus change into one mega-Diff? Or should we have several separate Diffs for each of the individual CreateFNeg(...) changes?
One mega-diff would probably be better for shaking out any performance problems. But it will also make the changeset much larger with test differences.
Never mind, we don't have a choice. Calling CreateFNeg(...) as-is would just create a binary FNeg.
Updated patch to generate unary FNeg in Clang.
New perf numbers:
llvm-project/test-suite-build> ../test-suite/utils/compare.py --filter-short stock1.json stock2.json stock3.json stock4.json stock5.json vs fneg1.json fneg2.json fneg3.json fneg4.json fneg5.json Warning: 'test-suite :: MicroBenchmarks/XRay/FDRMode/fdrmode-bench.test' has No metrics! Warning: 'test-suite :: MicroBenchmarks/XRay/ReturnReference/retref-bench.test' has No metrics! Warning: 'test-suite :: MicroBenchmarks/XRay/FDRMode/fdrmode-bench.test' has No metrics! Warning: 'test-suite :: MicroBenchmarks/XRay/ReturnReference/retref-bench.test' has No metrics! Warning: 'test-suite :: MicroBenchmarks/XRay/FDRMode/fdrmode-bench.test' has No metrics! Warning: 'test-suite :: MicroBenchmarks/XRay/ReturnReference/retref-bench.test' has No metrics! Warning: 'test-suite :: MicroBenchmarks/XRay/FDRMode/fdrmode-bench.test' has No metrics! Warning: 'test-suite :: MicroBenchmarks/XRay/ReturnReference/retref-bench.test' has No metrics! Warning: 'test-suite :: MicroBenchmarks/XRay/FDRMode/fdrmode-bench.test' has No metrics! Warning: 'test-suite :: MicroBenchmarks/XRay/ReturnReference/retref-bench.test' has No metrics! Warning: 'test-suite :: MicroBenchmarks/XRay/FDRMode/fdrmode-bench.test' has No metrics! Warning: 'test-suite :: MicroBenchmarks/XRay/ReturnReference/retref-bench.test' has No metrics! Warning: 'test-suite :: MicroBenchmarks/XRay/FDRMode/fdrmode-bench.test' has No metrics! Warning: 'test-suite :: MicroBenchmarks/XRay/ReturnReference/retref-bench.test' has No metrics! Warning: 'test-suite :: MicroBenchmarks/XRay/FDRMode/fdrmode-bench.test' has No metrics! Warning: 'test-suite :: MicroBenchmarks/XRay/ReturnReference/retref-bench.test' has No metrics! Warning: 'test-suite :: MicroBenchmarks/XRay/FDRMode/fdrmode-bench.test' has No metrics! Warning: 'test-suite :: MicroBenchmarks/XRay/ReturnReference/retref-bench.test' has No metrics! Warning: 'test-suite :: MicroBenchmarks/XRay/FDRMode/fdrmode-bench.test' has No metrics! Warning: 'test-suite :: MicroBenchmarks/XRay/ReturnReference/retref-bench.test' has No metrics! Tests: 1151 Short Running: 799 (filtered out) Remaining: 352 Metric: exec_time Program lhs rhs diff test-suite...ProxyApps-C++/CLAMR/CLAMR.test 1.01 0.97 -4.3% test-suite...late.test:BENCHMARK_DILATE/128 104.11 107.70 3.4% test-suite.../Benchmarks/nbench/nbench.test 1.04 1.01 -3.2% test-suite...s/MallocBench/cfrac/cfrac.test 0.63 0.61 -3.1% test-suite.../Trimaran/enc-rc4/enc-rc4.test 0.65 0.66 3.0% test-suite...e/Benchmarks/Misc/salsa20.test 3.95 3.85 -2.7% test-suite...ut-C++/Shootout-C++-sieve.test 1.23 1.26 2.5% test-suite...hootout/Shootout-heapsort.test 1.50 1.46 -2.4% test-suite.../Applications/spiff/spiff.test 1.01 1.03 2.3% test-suite...:BM_FIND_FIRST_MIN_LAMBDA/5001 2.88 2.82 -2.2% test-suite...lications/sqlite3/sqlite3.test 1.50 1.54 2.1% test-suite...-flt/LinearDependence-flt.test 1.38 1.35 -2.1% test-suite...s/gramschmidt/gramschmidt.test 1.73 1.76 2.0% test-suite...test:BM_PRESSURE_CALC_RAW/5001 9.40 9.59 2.0% test-suite...ications/JM/lencod/lencod.test 2.89 2.94 1.8% Geomean difference nan% lhs rhs diff count 350.000000 351.000000 349.000000 mean 1998.973248 1997.490274 -0.000602 std 16776.472077 16804.318859 0.007581 min 0.604400 0.602700 -0.043298 25% 1.761900 1.753700 -0.001977 50% 6.234300 5.989400 -0.000066 75% 104.102761 104.041111 0.000735 max 215062.448333 215082.720667 0.034431
I believe these results are in the noise range for my machine. I examined the assembly for the listed tests, and some others, but no differences were found***.
- There were some IR differences from Value name uniquing, but I believe that's to be expected. I.e. there were less sub's and new fneg's. The missing sub's appear to affect alloca names and such. E.g.:
5239c5239 < %i780 = alloca i32, align 4 --- > %i779 = alloca i32, align 4
Not sure if clang / C source has any impact on their targets, but subscribing @mcberg2017 @escha in case this could make a difference for out-of-tree (GPU) hardware.
Just putting a place holder for NegateValue in Reassociate.cpp to be updated for both binary and unary context when examining the uses of the Value V. In fact the full file should be looked over for other cases as well. I tried out the current patch and we hit the first case pretty quickly internally.
@mcberg2017 Do you have a test you can share?
I'm aware of at least one place in Reassociation that *may* be making an invalid transform (correctness of sign bit on a NaN). But I assumed that Reassociation would only take place if the reassoc FMF was seen. So I put it on the back-burner for now.
Cameron here is an example with the issue:
// bin/opt -reassociate repro.ll -S
define float @test1(float %arg, float %arg1, float %arg2, float %arg3) {
%tmp1 = fsub fast float %arg3, %arg2 %tmp2 = fadd fast float %arg, -5.000000e+02 %tmp3 = fsub fast float %tmp1, %tmp2 %tmp4 = fmul fast float %tmp3, %tmp3 %tmp5 = fneg fast float %arg %tmp6 = fdiv fast float %tmp5, %arg1 %tmp7 = fadd fast float %tmp6, %tmp4 ret float %tmp7
}
Given that we may still want the binary fneg generated in Reassociate.cpp, this may be enough then, plus what was done in D66612.
Very difficult to tell. Reassociate is tricky (for me at least). :/
I'm throwing some obvious unary FNeg tests at it and they are working fine, compared against the equivalent binary FNeg tests. So I'm feeling fairly confident in general.
I could check in the tests if desired, but they're more like fuzz tests than targeting specific features of Reassoc. It would be hard to prove these new tests are not doubling coverage of some existing test.
Ping.
I've been digging through this pass and it seems to be ok AFAICT. OptimizeInst(...) canonicalizes both unary and binary FNegs to -1.0*X, if they are fast and part of a special multiply tree. Other FNegs end up as leaf nodes, so no problem there.
Anyone aware of other situations I should look at?
Do we need to enhance EarlyCSE to see this equivalence:
define float @cse_fneg(float %x, i1 %cond) { %fneg_unary = fneg float %x %fneg_binary = fsub float -0.0, %x %r = select i1 %cond, float %fneg_unary, float %fneg_binary ret float %r }
The binary fneg has the looser requirement for NaN propagation (IEEE-754 6.3: "this standard does not specify the sign bit of a NaN result [for math ops]"), but that's less important for optimization than knowing the 2 values are otherwise equivalent.
On 2nd thought, that doesn't really make sense for CSE. The real question is whether we should canonicalize the binary fneg to unary fneg in instcombine. And should that happen before/after/concurrent with this change to clang?
Tough question...
- The biggest problem I see with canonicalization of binary FNeg to unary FNeg is if DAZ/FTZ are set. With those set, a binary FNeg could be used to zero insignificant data. Otherwise, with a unary FNeg, the sign bit would just be flipped, regardless if it's a denormal or not. We have a weather code that uses an explicit binary FNeg to sanitize their noisy input, so any canonicalization there would disturb the results.
Currently, if I'm not mistaken, Clang only enables DAZ/FTZ with -Ofast and -ffast-math, so no problem there. But, there's nothing stopping a user from compiling with -O0 and flipping the DAZ/FTZ bits themselves.
Conversely, if would be easy to argue that setting DAZ/FTZ breaks IEEE-754 compliance, so it doesn't matter what we do from that point on. Although, this seems like a weird grey-area to me.
- Another concern is if LLVM intends to respect the hardware-specified sign of a NaN or not. That is, IEEE-754 does not specify the sign on a NaN result, but the hardware may. E.g. x86 always returns the a -NaN, called QNaN floating-point indefinite.
It would be easy to argue that IEEE-754 doesn't specify the sign, so the compiler can do whatever it wants. IMO, I think we can do better and should allow the hardware to choose the NaN it wants.
- @arsenm also had a concern about source modifiers on AMD GPUs. I don't know much about those, so will leave that discussion to him...
Yeah, that sounds like we may get pushback from users depending on hardware target.
- Another concern is if LLVM intends to respect the hardware-specified sign of a NaN or not. That is, IEEE-754 does not specify the sign on a NaN result, but the hardware may. E.g. x86 always returns the a -NaN, called QNaN floating-point indefinite.
It would be easy to argue that IEEE-754 doesn't specify the sign, so the compiler can do whatever it wants. IMO, I think we can do better and should allow the hardware to choose the NaN it wants.
I think we would sacrifice target-specific / implementation-defined behavior if it meant code could be optimized better, but we'd want some evidence of that improvement before making a change.
- @arsenm also had a concern about source modifiers on AMD GPUs. I don't know much about those, so will leave that discussion to him...
Ok. I have no objections/comments to this patch then. As long as clang is consistent in creating the unary fneg, the CSE concern is probably moot. LGTM, but give other reviewers another chance (1-2 days?) to comment before committing.
IIUC, the problem is for LLVM targets that are FTZ/DAZ all the time by hardware design, and they may not be in trunk. I don't know if it's explicitly stated anywhere, but we try to support those targets even though they are not IEEE-754 compliant.
For x86 - clang has this -ffast-math hack for Linux:
rL165240
See also:
https://bugs.llvm.org/show_bug.cgi?id=14024
With fast-math, anything goes, so we don't have to worry about that scenario.
And I think the case where a user changes MXCSR bits is UB for C/C++ ( https://bugs.llvm.org/show_bug.cgi?id=8100#c15 ). So x86 never has a problem in theory, but it might in practice because users believe that twiddling MXCSR bits is allowed?
clang/lib/CodeGen/CGExprScalar.cpp | ||
---|---|---|
2590 ↗ | (On Diff #215255) | This if will always evaluate the false branch now, right? So you should be able to get rid of the if statement totally. |
We will eventually have to support #pragma STDC FENV_ACCESS, so Richard's reasoning won't hold forever.
Someone could argue, though, to just use a constrained strict FSub if you care about DAZ/FTZ. That seems like a valid solution. But, that is really treating DAZ/FTZ like a rounding-mode, not underflow. It would be a heavy hammer to enforce all the side-effect concerns when the user really only cares about DAZ/FTZ. In other words, we'll be losing significant performance in an attempt to gain significant performance.
Now that I've said that, I suppose someone could argue that DAZ/FTZ *is* a side-effect. :D
Some data points...
Intel is correct at: '-O3 -fp-model fast=2'
MSVC is correct at: '-O3'
GCC is NOT correct at : '-O3'
GCC is correct at: '-O3 -ftrapping-math -fsignaling-nans'
For posterity's sake, @andrew.w.kaylor just suggested adding a nftz fast math flag:
http://lists.llvm.org/pipermail/llvm-dev/2019-September/135183.html.
That would be great, since it would make it clear when the binary->unary FNeg transform is safe to do.
Sorry for the slow response. I've been busy with other projects.
Are there any reservations about merging the Clang unary FNeg change?
Sorry, but this commit broke OCaml tests: http://lab.llvm.org:8011/builders/clang-x86_64-debian-fast/builds/19014
I reverted it in r374354. Please test before re-landing.
I reverted it in r374354. Please test before re-landing.
Hmm, how do I run those tests? I did not see that failure with check-all.
That's a pretty straightforward failure. It's just a one-line IR change:
fsub float {{.*}}0{{.*}}, %F1 -> fneg float %F1
@gribozavr I see that you also reverted @RKSimon's commit for the OCaml/core.ml failure:
Author: gribozavr Date: Thu Oct 10 07:16:58 2019 New Revision: 374357 URL: http://llvm.org/viewvc/llvm-project?rev=374357&view=rev Log: Revert "Fix OCaml/core.ml fneg check" This reverts commit r374346. It attempted to fix OCaml tests, but is does not actually fix them. Modified: llvm/trunk/test/Bindings/OCaml/core.ml
That appears to be the proper fix. Do you see something wrong with it that I'm missing?
Recommitted this patch with Ocaml test fix. I was not able to find quick-start documentation for the Ocaml bindings, so the Ocaml failure has not been tested. That said, the failure mode seems extremely low risk. Will monitor the buildbots for problems...
Hi, it looks like this patch has caused a test failure under asan:
http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux-fast/builds/35922/steps/check-llvm%20asan/logs/stdio
Can you please take a look?
Both the ASan build:
http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux-fast/builds/35926
and the OCaml build:
http://lab.llvm.org:8011/builders/clang-x86_64-debian-fast/builds/19255
have completed successfully with the latest patches. Will continue to monitor the bots for further problems.
I apologize again for the noise...