Page MenuHomePhabricator

[libcxx/variant] Introduce `switch`-based mechanism for `std::visit`.
ClosedPublic

Authored by mpark on Aug 6 2020, 3:31 AM.

Details

Summary

This patch introduces mechanism for std::visit backed by switch.
The switch is structured such that it's a flattened manual vtable (an n-ary array).
The switch mechanism is enabled if (1 * ... * vs.size()) < 256.

The following are performance numbers from the benchmarks added in D85419, tested on my 2017 Macbook Pro.

Before: using only the manual vtable mechanism.

$ ./projects/libcxx/benchmarks/variant_visit_1.libcxx.out
2020-08-09 23:55:14
Running ./projects/libcxx/benchmarks/variant_visit_1.libcxx.out
Run on (8 X 3100 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 262K (x4)
  L3 Unified 8388K (x1)
Load Average: 2.03, 2.36, 2.43
------------------------------------------------------------
Benchmark                 Time             CPU   Iterations
------------------------------------------------------------
BM_Visit<1, 1>        0.260 ns        0.260 ns   1000000000
BM_Visit<1, 2>         1.56 ns         1.56 ns    435925220
BM_Visit<1, 3>         1.55 ns         1.55 ns    444416228
BM_Visit<1, 4>         1.57 ns         1.57 ns    427951336
BM_Visit<1, 5>         1.57 ns         1.56 ns    444766371
BM_Visit<1, 6>         1.70 ns         1.68 ns    446639358
BM_Visit<1, 7>         1.64 ns         1.64 ns    400441630
BM_Visit<1, 8>         1.56 ns         1.56 ns    430729471
BM_Visit<1, 9>         1.58 ns         1.58 ns    449894596
BM_Visit<1, 10>        1.54 ns         1.54 ns    449660506
BM_Visit<1, 20>        1.56 ns         1.56 ns    450813074
BM_Visit<1, 30>        1.59 ns         1.59 ns    440032940
BM_Visit<1, 40>        1.59 ns         1.59 ns    443731656
BM_Visit<1, 50>        1.56 ns         1.56 ns    444709859
BM_Visit<1, 60>        1.59 ns         1.58 ns    439527320
BM_Visit<1, 70>        1.57 ns         1.57 ns    438450890
BM_Visit<1, 80>        1.58 ns         1.58 ns    443001525
BM_Visit<1, 90>        1.63 ns         1.62 ns    448456349
BM_Visit<1, 100>       1.57 ns         1.57 ns    445740630

$ ./projects/libcxx/benchmarks/variant_visit_2.libcxx.out
2020-08-09 23:59:35
Running ./projects/libcxx/benchmarks/variant_visit_2.libcxx.out
Run on (8 X 3100 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 262K (x4)
  L3 Unified 8388K (x1)
Load Average: 1.40, 1.94, 2.22
-----------------------------------------------------------
Benchmark                Time             CPU   Iterations
-----------------------------------------------------------
BM_Visit<2, 1>       0.261 ns        0.260 ns   1000000000
BM_Visit<2, 2>        1.55 ns         1.54 ns    432844219
BM_Visit<2, 3>        1.30 ns         1.30 ns    532529974
BM_Visit<2, 4>        1.54 ns         1.54 ns    446055910
BM_Visit<2, 5>        1.31 ns         1.31 ns    531099680
BM_Visit<2, 6>        1.56 ns         1.56 ns    443203475
BM_Visit<2, 7>        1.29 ns         1.29 ns    526478087
BM_Visit<2, 8>        1.56 ns         1.56 ns    439000834
BM_Visit<2, 9>        1.30 ns         1.30 ns    528756817
BM_Visit<2, 10>       1.56 ns         1.55 ns    442923039
BM_Visit<2, 20>       1.35 ns         1.35 ns    517021072
BM_Visit<2, 30>       1.60 ns         1.59 ns    419724661
BM_Visit<2, 40>       1.45 ns         1.44 ns    472137163
BM_Visit<2, 50>       1.65 ns         1.65 ns    421389743

$ ./projects/libcxx/benchmarks/variant_visit_3.libcxx.out
2020-08-10 00:01:32
Running ./projects/libcxx/benchmarks/variant_visit_3.libcxx.out
Run on (8 X 3100 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 262K (x4)
  L3 Unified 8388K (x1)
Load Average: 2.20, 2.01, 2.21
-----------------------------------------------------------
Benchmark                Time             CPU   Iterations
-----------------------------------------------------------
BM_Visit<3, 1>       0.272 ns        0.271 ns   1000000000
BM_Visit<3, 2>        1.87 ns         1.86 ns    361858090
BM_Visit<3, 3>        1.77 ns         1.77 ns    391192579
BM_Visit<3, 4>        1.84 ns         1.84 ns    374694223
BM_Visit<3, 5>        1.75 ns         1.75 ns    408270392
BM_Visit<3, 6>        1.88 ns         1.88 ns    378759185
BM_Visit<3, 7>        1.79 ns         1.79 ns    395498102
BM_Visit<3, 8>        1.85 ns         1.85 ns    371660366
BM_Visit<3, 9>        1.80 ns         1.80 ns    386872851
BM_Visit<3, 10>       1.84 ns         1.84 ns    362367606
BM_Visit<3, 15>       1.77 ns         1.77 ns    392060220
BM_Visit<3, 20>       1.85 ns         1.85 ns    379157188

After: using the switch-based mechanism.

$ ./projects/libcxx/benchmarks/variant_visit_1.libcxx.out
2020-08-12 04:17:33
Running ./projects/libcxx/benchmarks/variant_visit_1.libcxx.out
Run on (8 X 3100 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 262K (x4)
  L3 Unified 8388K (x1)
Load Average: 1.96, 2.26, 3.49
------------------------------------------------------------
Benchmark                 Time             CPU   Iterations
------------------------------------------------------------
BM_Visit<1, 1>        0.280 ns        0.279 ns   1000000000
BM_Visit<1, 2>        0.285 ns        0.282 ns   1000000000
BM_Visit<1, 3>        0.281 ns        0.279 ns   1000000000
BM_Visit<1, 4>        0.278 ns        0.277 ns   1000000000
BM_Visit<1, 5>        0.283 ns        0.282 ns   1000000000
BM_Visit<1, 6>        0.279 ns        0.278 ns   1000000000
BM_Visit<1, 7>        0.279 ns        0.278 ns   1000000000
BM_Visit<1, 8>        0.273 ns        0.272 ns   1000000000
BM_Visit<1, 9>        0.275 ns        0.274 ns   1000000000
BM_Visit<1, 10>       0.274 ns        0.273 ns   1000000000
BM_Visit<1, 20>       0.277 ns        0.276 ns   1000000000
BM_Visit<1, 30>       0.278 ns        0.277 ns   1000000000
BM_Visit<1, 40>       0.269 ns        0.268 ns   1000000000
BM_Visit<1, 50>       0.271 ns        0.271 ns   1000000000
BM_Visit<1, 60>       0.272 ns        0.270 ns   1000000000
BM_Visit<1, 70>       0.275 ns        0.274 ns   1000000000
BM_Visit<1, 80>       0.278 ns        0.276 ns   1000000000
BM_Visit<1, 90>       0.272 ns        0.271 ns   1000000000
BM_Visit<1, 100>      0.269 ns        0.268 ns   1000000000

$ ./projects/libcxx/benchmarks/variant_visit_2.libcxx.out
2020-08-12 04:09:59
Running ./projects/libcxx/benchmarks/variant_visit_2.libcxx.out
Run on (8 X 3100 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 262K (x4)
  L3 Unified 8388K (x1)
Load Average: 2.17, 4.20, 4.78
-----------------------------------------------------------
Benchmark                Time             CPU   Iterations
-----------------------------------------------------------
BM_Visit<2, 1>       0.302 ns        0.301 ns   1000000000
BM_Visit<2, 2>       0.297 ns        0.295 ns   1000000000
BM_Visit<2, 3>       0.353 ns        0.351 ns   1000000000
BM_Visit<2, 4>       0.276 ns        0.276 ns   1000000000
BM_Visit<2, 5>       0.285 ns        0.283 ns   1000000000
BM_Visit<2, 6>       0.290 ns        0.287 ns   1000000000
BM_Visit<2, 7>       0.282 ns        0.280 ns   1000000000
BM_Visit<2, 8>       0.290 ns        0.287 ns   1000000000
BM_Visit<2, 9>       0.291 ns        0.285 ns   1000000000
BM_Visit<2, 10>      0.293 ns        0.287 ns   1000000000
BM_Visit<2, 20>       1.70 ns         1.68 ns    391400375
BM_Visit<2, 30>       1.64 ns         1.63 ns    418925874
BM_Visit<2, 40>       1.63 ns         1.62 ns    423623677
BM_Visit<2, 50>       1.68 ns         1.67 ns    411687212

$ ./projects/libcxx/benchmarks/variant_visit_3.libcxx.out
2020-08-12 04:10:43
Running ./projects/libcxx/benchmarks/variant_visit_3.libcxx.out
Run on (8 X 3100 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 262K (x4)
  L3 Unified 8388K (x1)
Load Average: 1.57, 3.76, 4.59
-----------------------------------------------------------
Benchmark                Time             CPU   Iterations
-----------------------------------------------------------
BM_Visit<3, 1>       0.271 ns        0.270 ns   1000000000
BM_Visit<3, 2>       0.344 ns        0.334 ns   1000000000
BM_Visit<3, 3>       0.347 ns        0.336 ns   1000000000
BM_Visit<3, 4>       0.300 ns        0.296 ns   1000000000
BM_Visit<3, 5>       0.290 ns        0.286 ns   1000000000
BM_Visit<3, 6>       0.272 ns        0.271 ns   1000000000
BM_Visit<3, 7>        1.72 ns         1.71 ns    415765841
BM_Visit<3, 8>        1.73 ns         1.72 ns    408909555
BM_Visit<3, 9>        2.16 ns         2.04 ns    380898485
BM_Visit<3, 10>       2.45 ns         2.40 ns    295714256
BM_Visit<3, 15>       1.92 ns         1.85 ns    375990332
BM_Visit<3, 20>       1.66 ns         1.65 ns    414456233

Diff Detail

Event Timeline

mpark created this revision.Aug 6 2020, 3:31 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 6 2020, 3:31 AM
Herald added a reviewer: Restricted Project. · View Herald Transcript
mpark requested review of this revision.Aug 6 2020, 3:31 AM
xbolva00 added inline comments.
libcxx/include/variant
472

_LIBCPP_VARIANT_DISPATCH(__index);

?

mpark edited the summary of this revision. (Show Details)Aug 6 2020, 4:38 AM
mpark edited the summary of this revision. (Show Details)
mpark added inline comments.Aug 6 2020, 4:40 AM
libcxx/include/variant
472

No, we need it to be a compile-time value.

mpark planned changes to this revision.Aug 7 2020, 5:22 AM
mpark updated this revision to Diff 284259.Aug 9 2020, 11:50 PM

Used a different switch-based approach.

mpark edited the summary of this revision. (Show Details)Aug 10 2020, 12:07 AM
mpark edited the summary of this revision. (Show Details)
mpark updated this revision to Diff 284275.Aug 10 2020, 1:16 AM

Minor cleanup update.

mpark updated this revision to Diff 284280.Aug 10 2020, 1:34 AM

Use _LIBCPP_CONCAT.

ldionne requested changes to this revision.Aug 11 2020, 9:23 AM

Geez, that's some nice speedup for smaller variants! Do you have an explanation for why the code is so much better?

libcxx/include/variant
495

I don't think the ability to disable the switch implementation is useful -- it'll just lead to an unused customization point. Or do you have a use case for it? That could provide interesting information about this patch.

libcxx/test/libcxx/utilities/variant/variant.visit/visit_without_switch.pass.cpp
2 ↗(On Diff #284280)

Is there any difference between this file and libcxx/test/std/utilities/variant/variant.visit/visit.pass.cpp beyond the #define _LIBCPP_DISABLE_SWITCH_VISIT_IN_VARIANT bit? If not, we should structure the tests to avoid such duplication IMO.

This revision now requires changes to proceed.Aug 11 2020, 9:23 AM

The improvements are from enabling compilers to inline the invocation by using a switch.
Compilers currently can't seem to inline through function pointers even if it's constexpr.

libcxx/include/variant
495

The only real use case is fully testing the manual vtable approach. Perhaps we can do that through a different mechanism?

libcxx/test/libcxx/utilities/variant/variant.visit/visit_without_switch.pass.cpp
2 ↗(On Diff #284280)

No, there's no difference. I was wondering if there's something I could do with ADDITONAL_COMPILE_FLAGS or something but I couldn't find anything. Do you have a suggestion?

mpark marked an inline comment as done.Aug 11 2020, 1:08 PM

The improvements are from enabling compilers to inline the invocation by using a switch.
Compilers currently can't seem to inline through function pointers even if it's constexpr.

Interesting.

libcxx/include/variant
495

I think if we test the overall std::visit with all sizes of variants, then we'll be exercising both the switch and the manual vtable implementations to the extent they are used. I believe that should be sufficient, and it would also remove the duplication in the tests IIUC. What do you think about that?

mpark added inline comments.Aug 11 2020, 1:54 PM
libcxx/include/variant
495

Yep, fine with me.

mpark updated this revision to Diff 285033.Aug 12 2020, 4:12 AM

Addressed ldionne's comments.

mpark edited the summary of this revision. (Show Details)Aug 12 2020, 4:18 AM
mpark edited the summary of this revision. (Show Details)
ldionne added inline comments.Aug 12 2020, 9:09 AM
libcxx/test/std/utilities/variant/variant.visit/visit.pass.cpp
347

Do we have tests with sufficiently large variants?

mpark added inline comments.Aug 12 2020, 9:27 AM
libcxx/test/std/utilities/variant/variant.visit/visit.pass.cpp
347

Yeah, I adjusted the # of switch cases from 1024 to 256 (see the adjusted numbers). This test needs 5^4 = 625 cases, so it uses the vtable approach.

mpark updated this revision to Diff 285726.Aug 14 2020, 12:19 PM

Add inline _LIBCPP_INLINE_VISIBILITY to several functions.

mpark updated this revision to Diff 285730.Aug 14 2020, 12:24 PM

s/std::integral_constant/integral_constant/

ldionne accepted this revision.Aug 14 2020, 12:30 PM
This revision is now accepted and ready to land.Aug 14 2020, 12:30 PM
This revision was landed with ongoing or failed builds.Aug 14 2020, 12:55 PM
This revision was automatically updated to reflect the committed changes.

I think this change is breaking one of our uses right now:

include/c++/v1/variant:599:16: error: call to deleted constructor of 'std::__u::basic_ostream<char>'
        return __invoke_constexpr(
               ^~~~~~~~~~~~~~~~~~~
include/c++/v1/variant:618:59: note: in instantiation of function template specialization 'std::__u::__variant_detail::__visitation::__base::__visit_alt_impl(index_sequence<0UL>, std::__u::__variant_detail::__visitation::__variant::__value_visitor<YYY &> &&, const std::__u::__variant_detail::__impl<XXX, ZZZ> &)::(anonymous class)::operator()<std::__u::integral_constant<unsigned long, 1>>' requested here
        _LIBCPP_VARIANT_CASES(_LIBCPP_VARIANT_SWITCH_MAX, _LIBCPP_VARIANT_CASE)
                                                          ^
include/c++/v1/variant:586:14: note: in instantiation of function template specialization 'std::__u::__variant_detail::__visitation::__base::__visit_alt_impl<0, std::__u::__variant_detail::__visitation::__variant::__value_visitor<YYY &>, const std::__u::__variant_detail::__impl<XXX, ZZZ> &>' requested here
      return __visit_alt_impl(index_sequence_for<_Vs...>{},
             ^
include/c++/v1/variant:645:20: note: in instantiation of function template specialization 'std::__u::__variant_detail::__visitation::__base::__visit_alt<, std::__u::__variant_detail::__visitation::__variant::__value_visitor<YYY &>, const std::__u::__variant_detail::__impl<XXX, ZZZ> &>' requested here
    return __base::__visit_alt(_VSTD::forward<_Vis>(__vis),
                   ^
include/c++/v1/variant:662:12: note: in instantiation of function template specialization 'std::__u::__variant_detail::__visitation::__variant::__visit_alt<std::__u::__variant_detail::__visitation::__variant::__value_visitor<YYY &>, const std::__u::variant<XXX, ZZZ> &>' requested here
    return __visit_alt(__make_value_visitor(_VSTD::forward<_Vis>(__vis)),
           ^
include/c++/v1/variant:1685:21: note: in instantiation of function template specialization 'std::__u::__variant_detail::__visitation::__variant::__visit_value<YYY &, const std::__u::variant<XXX, ZZZ> &>' requested here
  return __variant::__visit_value(_VSTD::forward<_Vis>(__vis),
                    ^

I don't know if this trace is enough to figure is there is an issue upstream?

Hi Mehdi,

I've submitted https://reviews.llvm.org/D86006 and I believe that will be
the correct fix for this issue.
It was likely incorrectly returning the stream by value as a return type
coming out of visit.

dblaikie added inline comments.
libcxx/include/variant
877

This surprises me a bit - doesn't result need to be mangled (as __result, for instance) as in other places, like __make_vtable_impl?

ldionne added inline comments.Sep 1 2020, 8:01 AM
libcxx/include/variant
877