Page MenuHomePhabricator

[flang] Rework host runtime folding and enable REAL(2) folding with it.
ClosedPublic

Authored by jeanPerier on Oct 7 2020, 9:59 AM.

Details

Summary
  • Rework the host runtime table so that it is constexpr to avoid having to construct it and to store/propagate it.
  • Make the interface simpler (remove many templates and a file)
  • Enable 16bits float folding using 32bits float host runtime
  • Move StaticMultimapView into its own header to use it for host folding

Diff Detail

Event Timeline

jeanPerier created this revision.Oct 7 2020, 9:59 AM
Herald added a project: Restricted Project. · View Herald Transcript
jeanPerier requested review of this revision.Oct 7 2020, 9:59 AM
lebedev.ri retitled this revision from Rework host runtime folding and enable REAL(2) folding with it. to [flang] Rework host runtime folding and enable REAL(2) folding with it..Oct 7 2020, 10:00 AM
klausler accepted this revision.Oct 8 2020, 10:00 AM

There's a lot of work here and it's quite well done; thank you.

I suggest that you add meinersbur as a reviewer to check that this code doesn't run afoul of some MSVC bug.

flang/include/flang/Common/static-multimap-view.h
43

The argument could be const std::array<V,N> &array.

48

This could be a constexpr check in the constructor.

51

We use GetRange naming in Common.

Could GetRange be private?

86

C++20 has constexpr versions of std::equal_range, std::lower_bound, std::binary_search, &c.

If you could use them now, would this new class be necessary?

If we will eventually use them, maybe what's needed here would be just a place-holding constexpr function or two, possibly conditional on the C++ language standard version.

flang/include/flang/Evaluate/common.h
11–12

Does removing this member from the folding context make them cheap to construct again?

flang/lib/Evaluate/fold-implementation.h
88

return std::nullopt; is a little more clear, I think.

flang/lib/Evaluate/intrinsics-library.cpp
142

Please capitalize the names of these functions; thanks.

flang/lib/Lower/IntrinsicCall.cpp
507–508

I know that this isn't part of this change, but maybe this argument should be a const std::array<> &.

flang/unittests/Evaluate/folding.cpp
55–56

I wonder if this constant should be checked; I worry about C++ compilation on a target machine that likes to flush subnormals to zero. Maybe a hexadecimal floating point constant should be used, or even a reinterpreted 0x00400000.

(The value should be 5.877471754111438E-39 if you want a double constant that converts to and back from the desired value; the value you have above is slightly too large and converts to a double with a lower bit set that's lost in the conversion to float in most rounding modes, but why take chances.)

This revision is now accepted and ready to land.Oct 8 2020, 10:00 AM
PeteSteinfeld accepted this revision.Oct 9 2020, 8:30 AM

Thanks for working on this, Jean. I have to confess that I don't understand most of the code. But it all built and tested correctly for me. It would be nice to have some constant folding tests that fold REAL(2) operands.

Where do things stand on 80 bit and 128 bit REALs?

  • Simplify StaticMultimapView (use std::equal_range and removed unused member functions)
  • Add REAL(2) and REAL(3) folding test
  • Enable X87 float folding with IEEE 128 bits float runtime if available on the host.
  • Use RealKindForPrecision instead of TypeOf
  • Add a compile time verfication that StaticMultimapView is sorted
  • Capitalize some functions. std::nullopt instead of {}.
  • Use raw subnormal in flushing test instead of decimal input.
jeanPerier marked 2 inline comments as done.

Remove static_assert from StaticMultimapView

The last update added a compile time assert that the array
is sorted before constructing a StaticMultimapView. It worked well with clang
but was actually most likely illegal C++ and failed with g++ since there is
no way to guarantee that the content of the array passed in template argument
is constexpr (only its address is guaranteed to be so). Clang++ seems to be
clever enough to propagate the constexpr aspect of the array address it gets
in template argument, but this is not standard C++.
Instead Verify() has to be called next to every instantiation of the map
where it is visible to the compiler that the map was built at compile time...

Also avoid the early return since constexpr functions should only have one return.

Thanks for the reviews, I have updated the patch where I could.

flang/include/flang/Common/static-multimap-view.h
43

I do not know a way to do that without having the tables declared as std::array which I find annoying because you cannot nicely delegate the array length to the compiler (e.g. T x[]{...} vs std::array<T, /* need to compute/ guess that */> x{...}).

48

Thanks, I tried hard to make that happen in the ctor/a static Create helper, but there is no way I could find to have it constexpr check that way since the constexpr aspect does not actually guarantee that this will be called with a constexpr array at compile time, so static_assert are rejected.
Instead I added checks after each instantiation. That's still better than nothing and actually caught a bessel_y0/y1 not sorted correctly in pgmath header.

51

GetRange was removed.

86

What is mainly missing for me here is a constexpr std::is_storted (also C++20), as well as something like std::span (C++20) that (from what I understand) can "erase" the lenght from the base container type and still allow constexpr iteration on it.
GetRange used to be called in constexpr in lowering to get the runtime but that is not the case anymore, so I already switched to std::equal_range and remove some other member functions that removed ~20 lines of boilerplate.

flang/unittests/Evaluate/folding.cpp
55–56

Thanks for the tip, I used the raw evaluate::Real constructor to avoid any issue with C++ compilers here.

jeanPerier marked 5 inline comments as done.Oct 13 2020, 8:54 AM

Where do things stand on 80 bit and 128 bit REALs?

I have updated the patch to allow using IEEE 128bits to fold 80 bits (X87) floats when IEEE 128 bits is available but not X87.
However 80/128 bits folding that require host runtime is currently not expected to work on all host at all (hence I did not add tests yet).
The folding code is designed to take advantage of when 80/128bits can be safely folded using long double without assuming too much about the platform. It is also designed to allow platform where 80/128bits cannot be safely folded to still compile/run flang normally and only refuse to compile programs that require such host folding. More precisely:

  • 80 bits folding works if the host long double is the X87 or IEEE 128bits float (for this last case we would convert 80bits to 128bits before folding).
  • 128 bits folding works if the host long double is the IEEE 128 bits.
  • When the host long double type does not allow folding 80bits/128bits, Fold() will return the original expressions, and a warning message (non fatal) saying " intrinsic acos(real(kind=16)) cannot be folded on host". If the expressions must be a constant, compilation should then failed complaining it did not get a constant after Fold() calls. Otherwise, compilation will continue normally (leaving the expression unfolded since we do not have to, but warning the user that we could not fold it due to host limitation).

Hosts that needs to fold 80bits/128bits will need to modify host.h to map some non standard types (e.g. __float128) in host.h provided the c++ standard math library works with them (or they will need to provide an ad-hoc runtime with it in intrinsics-library.cpp).

I tested that 80 bits folding works OK on X86-64 linux. I expect 80bits and 128bits folding to work on Aarch64, but did not test it. On Power8/9, I think that depends on how the OS is compiled there. By default long double maps to IBM double double format, but the OS/libc/gcc... can also be compiled to map long double to IEEE 128bits in which case folding 80bits and 128bits should work. As you see, all this is very dependent on the platform.

Note that this only applies to folding intrinsics for which there is no abstract implementation in include/flang/Evaluate/real.h and for which we have to resort to host runtime, otherwise all REALs can be folded regardless of the platform.

@Meinersbur, I have added you as a reviewer since this patch is touching a lot of template/constexpr features so it is likely it will hit MSVC errors. If you have the opportunity to test it ahead that would be great, otherwise beware that it may lead to MSVC regressions.

The patch compiles successfully with msvc (with a patch to trunk that I still need to upload a patch for).

The patch compiles successfully with msvc (with a patch to trunk that I still need to upload a patch for).

Thanks for testing this @Meinersbur !

flang/include/flang/Evaluate/common.h
11–12

Yes, FoldingContext are 100 times cheaper to construct according to my measurements. This improves fcvs f18 -fparse-only time by on 12% on average.

FoldingContext ctor 100x speedup
With f18 compiled with gcc 8.3 in release mode on an Intel Xeon Gold 6148, I measured 0.05ms per FoldingContext construction before vs 0.00005ms with this patch (average of 10000 ctor calls in one run. I reproduced runs 10times and got stable results). Measurement were done by instrumenting the code (https://github.com/jeanPerier/llvm-project/commit/f511284b54805aa314c1316f9143d0d0cbaa522d).

Given FoldingContext are constructed for every function call check when an explicit interface that can translate in x4 speed-up one time f18 -fparse-only on carefully designed tests like:

 real, parameter :: x = 0.5

 ! Each following line semantic analysis end-up in 3 FoldingContext ctor call 
 real, parameter :: y1 = acos(x)
 real, parameter :: y2 = acos(x)
 ! ... repeated 9997 times 
 real, parameter :: y10000 = acos(x)
end

I measured 2s before vs 0.5s with this patch (time f18 -fparse-only real time).

Host folding 1.2x slowdown
However, there is a 20% time penalty with this patch per fold with host runtime (most likely due to the added encapsulation/decapsulation of Scalar to/from Expr<SomeExpr> in the folder). I measured the time spent in Evaluate/fold-real.cpp FoldIntrinsicFunction on the test file above. We spent 1.3usec per fold before vs 1.6usec with this patch (average of the 10000 folds, repeated 10 times). Given a for this is at the usec level, it is negligible on scalar fold since we create 3 FoldingContext per expressions. For array expressions, that can lead to overall slowdown in the compilation (that will never be bigger than 20%). For instance I could measure a 1% overall slow-down in a program folding acos( a_10000_element_array) (93ms before vs 94 now).

Conclusion: 12% overall parsing+semantics speed-up on real code
Regarding fcvs time f18 -fparse-only fm*.f real time went from 4.3s to 3.8s (ten run average). So this has a visible impact on real code.
Since scalar folding is much more widespread than huge array folding, the patch seems a win to me.

This revision was landed with ongoing or failed builds.Oct 14 2020, 7:41 AM
This revision was automatically updated to reflect the committed changes.