This is an archive of the discontinued LLVM Phabricator instance.

[SimplifyCFG] Common code sinking: drop profitability check in presence of conditional predecessors (PR30244)
Needs ReviewPublic

Authored by lebedev.ri on Apr 29 2021, 7:18 AM.

Details

Summary

This profitability check was added back in rL281160, as a means to fix PR30244.

PR30244, in essence, talks about the fact that this extra sinking is causing
register allocation issues, for example in tsan. I've retried the two-stage
reproduction approach specified there, and i can not reproduce said issue.

A lot of time has passed since then. Perhaps this is no longer a problem?
Does anyone's benchmark still say this is still bad?

Diff Detail

Event Timeline

lebedev.ri created this revision.Apr 29 2021, 7:18 AM
lebedev.ri requested review of this revision.Apr 29 2021, 7:18 AM
fhahn added a comment.Apr 30 2021, 7:30 AM

Does anyone's benchmark still say this is still bad?

I guess that's hard to say. Are there cases where the heuristic hurts optimizations currently? Or asked differently, what's the motivation to remove the heuristic?

lebedev.ri added a comment.EditedApr 30 2021, 9:14 AM

Does anyone's benchmark still say this is still bad?

I guess that's hard to say. Are there cases where the heuristic hurts optimizations currently? Or asked differently, what's the motivation to remove the heuristic?

My main problem with it is that it's pretty arbitrary, and seems pretty dependent on the ordering.
For example, given something like (pay no special attention to the ir, i just came up with it)

<...>
dispatch:
 switch to %bb0, %bb1, %bb2, %bb3

bb0:
  call foo()
  br %end

bb1:
  call foo()
  br %end

bb2:
  call foo()
  br %end

bb3:
  <...>
  br %dispatch

We'd happily sink foo().
But if the pattern is e.g.

<...>

dispatch:
 switch to %bb0, %bb1, %bb2, %bb3

bb0:
  call foo()
  br %end

bb1:
  call foo()
  br %end

bb2:
  call foo()
  br %end

bb3:
  br %bb4, %bb5

bb4:
  call foo()
  br %end

bb5:
  call foo()
  br %whatever

while we could still happily sink foo() into %end, iff we hoist foo() into %bb3 first, we'd end with

<...>

dispatch:
 switch to %bb0, %bb1, %bb2, %bb3

bb0:
  call foo()
  br %end

bb1:
  call foo()
  br %end

bb2:
  call foo()
  br %end

bb3:
  call foo()
  br %bb4, %foo

bb4:
  br %end

bb5:
  br %whatever

and after folding away empty blocks we end up with

<...>

dispatch:
 switch to %bb0, %bb1, %bb2, %bb3

bb0:
  call foo()
  br %end

bb1:
  call foo()
  br %end

bb2:
  call foo()
  br %end

bb3:
  call foo()
  br %end, %whatever

and oops, %end, has a conditional predecessor.
We visit block in the order they appear in the function,
but not in some reverse-post-order or something,
so whether we'd sink here or not would depend on whether we first encounter %end, or %bb3.

More generally, i'm somewhat interested in making this sinking even more aggressive,
and having the seemingly pretty arbitrary and unstable cut-off makes it seem a non-starter.

@dvyukov Hi! Would it please be possible for you to redo the two-stage reproduction process
you stated in https://bugs.llvm.org/show_bug.cgi?id=30244
to double-check that i'm not simply doing it wrong, and the issue is/isn't there still?

@dvyukov Hi! Would it please be possible for you to redo the two-stage reproduction process
you stated in https://bugs.llvm.org/show_bug.cgi?id=30244
to double-check that i'm not simply doing it wrong, and the issue is/isn't there still?

I trust you to re-do the check. It's not that I can do it better or would trust myself 5 years later :)
I see you added some unit tests, so that should be good enough.

@dvyukov Hi! Would it please be possible for you to redo the two-stage reproduction process
you stated in https://bugs.llvm.org/show_bug.cgi?id=30244
to double-check that i'm not simply doing it wrong, and the issue is/isn't there still?

I trust you to re-do the check. It's not that I can do it better or would trust myself 5 years later :)
I see you added some unit tests, so that should be good enough.

Oh well, i was hoping for a sanity check :S
I've already tried it twice locally, and as far as i can tell, this doesn't cause any similar regression.

So how should we proceed here?
Speculatively land and see if some other perf regression is reported against this change?

@dvyukov Hi! Would it please be possible for you to redo the two-stage reproduction process
you stated in https://bugs.llvm.org/show_bug.cgi?id=30244
to double-check that i'm not simply doing it wrong, and the issue is/isn't there still?

I trust you to re-do the check. It's not that I can do it better or would trust myself 5 years later :)
I see you added some unit tests, so that should be good enough.

Oh well, i was hoping for a sanity check :S
I've already tried it twice locally, and as far as i can tell, this doesn't cause any similar regression.

So how should we proceed here?
Speculatively land and see if some other perf regression is reported against this change?

I tried to figure out if compiler-rt/lib/tsan/check_analyze.sh is still executed as part of testing, I remember some discussions on removing it.
But I didn't where it's executed, nor where it was removed. But I see e.g. number of pop's check was removed over time. So I am not sure we can still rely on it.

@dvyukov Hi! Would it please be possible for you to redo the two-stage reproduction process
you stated in https://bugs.llvm.org/show_bug.cgi?id=30244
to double-check that i'm not simply doing it wrong, and the issue is/isn't there still?

I trust you to re-do the check. It's not that I can do it better or would trust myself 5 years later :)
I see you added some unit tests, so that should be good enough.

Oh well, i was hoping for a sanity check :S
I've already tried it twice locally, and as far as i can tell, this doesn't cause any similar regression.

So how should we proceed here?
Speculatively land and see if some other perf regression is reported against this change?

I tried to figure out if compiler-rt/lib/tsan/check_analyze.sh is still executed as part of testing, I remember some discussions on removing it.
But I didn't where it's executed, nor where it was removed. But I see e.g. number of pop's check was removed over time. So I am not sure we can still rely on it.

I was mainly talking about https://bugs.llvm.org/show_bug.cgi?id=30244#c1, the reproducible run-time slowdown.

I was mainly talking about https://bugs.llvm.org/show_bug.cgi?id=30244#c1, the reproducible run-time slowdown.

Oh, I see, did not read that far.
Generally nobody looks at this level of detail, so for anything that does not lead to loud runtime crashes, it can well take 10 years for somebody to detect and report a new bug. Relying on that is not sound strategy :)

I don't have time to redo that test right now. But again I trust your results ;)

xbolva00 added a subscriber: xbolva00.EditedMay 6 2021, 11:30 AM

Can you collect some performance stats?

  • testsuite/spec benchmarks/etc?

Speculatively land and see if some other perf regression is reported against this change?

Try to collect some data first

please add more test cases to cover the expected new sinking cases enabled with this change.

Can you collect some performance stats?

  • testsuite/spec benchmarks/etc?

Speculatively land and see if some other perf regression is reported against this change?

Try to collect some data first

Thank you for stating your opinion.
As you can read in my previous comments, i have obviously already tried that locally.
test-suite is just hopelessly noisy, does that match with your expirience?
On the benchmark this does affect, the results are bizzare to me,
because there are *no* IR, or ASM level changes on the codepath being run,
yet there's +~10% run-time regression. Unless i'm slowly (rapidly) loosing it,
i can only attribute it to some weird code alignment issue
caused by the changes this did cause to other code.

please add more test cases to cover the expected new sinking cases enabled with this change.

The test covers it quite well - if we'll only sink already-speculatable instructions,
and not all predecessors are unconditional, we will now actually sink it.
So could you please be more specific what tests would you like to see?

This comment was removed by xbolva00.

Can you collect some performance stats?

  • testsuite/spec benchmarks/etc?

Speculatively land and see if some other perf regression is reported against this change?

Try to collect some data first

Thank you for stating your opinion.
As you can read in my previous comments, i have obviously already tried that locally.
test-suite is just hopelessly noisy, does that match with your expirience?

All benchmarks are noisy.
I can only recommend gentleman's set of disabling CPU autoscaling, killing all background processes, pinning the test process to a single core, taking minimum run time of multiple runs (no affected by random slowdowns).

On the benchmark this does affect, the results are bizzare to me,
because there are *no* IR, or ASM level changes on the codepath being run,
yet there's +~10% run-time regression. Unless i'm slowly (rapidly) loosing it,
i can only attribute it to some weird code alignment issue
caused by the changes this did cause to other code.

I've seen this effect a lot. You are not losing your mind :)
But I've seen it withing +/-3% on Intel CPUs. But maybe that's combined with the noise you mentioned before.

please add more test cases to cover the expected new sinking cases enabled with this change.

The test covers it quite well - if we'll only sink already-speculatable instructions,
and not all predecessors are unconditional, we will now actually sink it.
So could you please be more specific what tests would you like to see?

Can you collect some performance stats?

  • testsuite/spec benchmarks/etc?

Speculatively land and see if some other perf regression is reported against this change?

Try to collect some data first

Thank you for stating your opinion.
As you can read in my previous comments, i have obviously already tried that locally.
test-suite is just hopelessly noisy, does that match with your expirience?

All benchmarks are noisy.
I can only recommend gentleman's set of disabling CPU autoscaling, killing all background processes, pinning the test process to a single core, taking minimum run time of multiple runs (no affected by random slowdowns).

On the benchmark this does affect, the results are bizzare to me,
because there are *no* IR, or ASM level changes on the codepath being run,
yet there's +~10% run-time regression. Unless i'm slowly (rapidly) loosing it,
i can only attribute it to some weird code alignment issue
caused by the changes this did cause to other code.

I've seen this effect a lot. You are not losing your mind :)
But I've seen it withing +/-3% on Intel CPUs. But maybe that's combined with the noise you mentioned before.

On this benchmark it is very much not noise:

raw.pixls.us-unique$ /repositories/googlebenchmark/tools/compare.py -a benchmarks ~/rawspeed/build-{old,new}/src/utilities/rsbench/rsbench --benchmark_counters_tabular=true --benchmark_repetitions=9 --benchmark_min_time=1 Nikon/*/*.NEF
RUNNING: /home/lebedevri/rawspeed/build-old/src/utilities/rsbench/rsbench --benchmark_counters_tabular=true --benchmark_repetitions=9 --benchmark_min_time=1 Nikon/D5600/2018-01-20_01-14_0792.NEF Nikon/D5600/2018-01-20_01-14_0793.NEF Nikon/D7200/DSC_0977.NEF Nikon/D7200/DSC_0978.NEF Nikon/D7200/DSC_0979.NEF Nikon/D7200/DSC_0982.NEF --benchmark_display_aggregates_only=true --benchmark_out=/tmp/tmpx4wr1xdn
2021-05-07T10:40:14+03:00
Running /home/lebedevri/rawspeed/build-old/src/utilities/rsbench/rsbench
Run on (32 X 3599.67 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 512 KiB (x16)
  L3 Unified 32768 KiB (x2)
Load Average: 0.43, 0.85, 0.97
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                               Time             CPU   Iterations  CPUTime,s CPUTime/WallTime     Pixels Pixels/CPUTime Pixels/WallTime Raws/CPUTime Raws/WallTime WallTime,s
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nikon/D5600/2018-01-20_01-14_0792.NEF/threads:32/process_time/real_time_mean          186 ms          186 ms            9   0.186167         0.999979   24.1603M        129.78M        129.777M      5.37163       5.37151    0.18617
Nikon/D5600/2018-01-20_01-14_0792.NEF/threads:32/process_time/real_time_median        186 ms          186 ms            9    0.18565          0.99998   24.1603M       130.139M        130.136M      5.38649       5.38639   0.185653
Nikon/D5600/2018-01-20_01-14_0792.NEF/threads:32/process_time/real_time_stddev      0.812 ms        0.812 ms            9   812.273u          5.0213u          0       565.779k        565.897k    0.0234177     0.0234226   812.477u
Nikon/D5600/2018-01-20_01-14_0793.NEF/threads:32/process_time/real_time_mean          223 ms          223 ms            9   0.223024         0.999975   24.1603M        108.33M        108.327M      4.48382        4.4837    0.22303
Nikon/D5600/2018-01-20_01-14_0793.NEF/threads:32/process_time/real_time_median        223 ms          223 ms            9   0.223016         0.999973   24.1603M       108.334M        108.332M      4.48398       4.48391    0.22302
Nikon/D5600/2018-01-20_01-14_0793.NEF/threads:32/process_time/real_time_stddev      0.124 ms        0.124 ms            9   124.412u         8.38225u          0       60.4342k        60.4371k     2.50139m      2.50151m   124.425u
Nikon/D7200/DSC_0977.NEF/threads:32/process_time/real_time_mean                       111 ms          111 ms            9    0.11103          0.99997   24.1603M         217.6M        217.594M      9.00654       9.00627   0.111034
Nikon/D7200/DSC_0977.NEF/threads:32/process_time/real_time_median                     111 ms          111 ms            9   0.110997         0.999969   24.1603M       217.666M        217.662M      9.00924       9.00908   0.110999
Nikon/D7200/DSC_0977.NEF/threads:32/process_time/real_time_stddev                   0.062 ms        0.062 ms            9   61.5634u         9.58593u          0       120.589k        121.033k     4.99121m       5.0096m   61.7937u
Nikon/D7200/DSC_0978.NEF/threads:32/process_time/real_time_mean                       107 ms          107 ms            9   0.107127         0.999963   24.1603M       225.529M         225.52M       9.3347       9.33436   0.107131
Nikon/D7200/DSC_0978.NEF/threads:32/process_time/real_time_median                     107 ms          107 ms            9   0.107126         0.999962   24.1603M       225.531M        225.526M      9.33478       9.33459   0.107128
Nikon/D7200/DSC_0978.NEF/threads:32/process_time/real_time_stddev                   0.061 ms        0.061 ms            9   60.9325u         12.2348u          0       128.233k        129.429k     5.30762m      5.35709m   61.5059u
Nikon/D7200/DSC_0979.NEF/threads:32/process_time/real_time_mean                       100 ms          100 ms            9   0.100101         0.999976   24.1603M        241.36M        241.354M      9.98995       9.98971   0.100103
Nikon/D7200/DSC_0979.NEF/threads:32/process_time/real_time_median                     100 ms          100 ms            9     0.1001         0.999976   24.1603M       241.361M        241.355M         9.99       9.98974   0.100103
Nikon/D7200/DSC_0979.NEF/threads:32/process_time/real_time_stddev                   0.028 ms        0.028 ms            9   28.4825u         5.57164u          0       68.6567k        67.9053k     2.84172m      2.81062m   28.1718u
Nikon/D7200/DSC_0982.NEF/threads:32/process_time/real_time_mean                       101 ms          101 ms            9   0.101409         0.999973   24.1603M       238.246M         238.24M      9.86107       9.86081   0.101412
Nikon/D7200/DSC_0982.NEF/threads:32/process_time/real_time_median                     101 ms          101 ms            9   0.101419         0.999972   24.1603M       238.221M        238.217M      9.86005       9.85985   0.101421
Nikon/D7200/DSC_0982.NEF/threads:32/process_time/real_time_stddev                   0.063 ms        0.063 ms            9   62.7106u         6.03352u          0       147.393k        147.329k     6.10066m        6.098m   62.6865u
RUNNING: /home/lebedevri/rawspeed/build-new/src/utilities/rsbench/rsbench --benchmark_counters_tabular=true --benchmark_repetitions=9 --benchmark_min_time=1 Nikon/D5600/2018-01-20_01-14_0792.NEF Nikon/D5600/2018-01-20_01-14_0793.NEF Nikon/D7200/DSC_0977.NEF Nikon/D7200/DSC_0978.NEF Nikon/D7200/DSC_0979.NEF Nikon/D7200/DSC_0982.NEF --benchmark_display_aggregates_only=true --benchmark_out=/tmp/tmp663j0gzi
2021-05-07T10:41:31+03:00
Running /home/lebedevri/rawspeed/build-new/src/utilities/rsbench/rsbench
Run on (32 X 3598.7 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x16)
  L1 Instruction 32 KiB (x16)
  L2 Unified 512 KiB (x16)
  L3 Unified 32768 KiB (x2)
Load Average: 0.92, 0.92, 0.99
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                               Time             CPU   Iterations  CPUTime,s CPUTime/WallTime     Pixels Pixels/CPUTime Pixels/WallTime Raws/CPUTime Raws/WallTime WallTime,s
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nikon/D5600/2018-01-20_01-14_0792.NEF/threads:32/process_time/real_time_mean          191 ms          191 ms            9   0.190784         0.999971   24.1603M       126.639M        126.635M      5.24161       5.24145    0.19079
Nikon/D5600/2018-01-20_01-14_0792.NEF/threads:32/process_time/real_time_median        190 ms          190 ms            9   0.190332         0.999971   24.1603M       126.937M        126.934M      5.25397       5.25385   0.190336
Nikon/D5600/2018-01-20_01-14_0792.NEF/threads:32/process_time/real_time_stddev      0.810 ms        0.810 ms            9   809.897u         7.13539u          0       537.175k        537.475k    0.0222338     0.0222462   810.397u
Nikon/D5600/2018-01-20_01-14_0793.NEF/threads:32/process_time/real_time_mean          227 ms          227 ms            9   0.227068         0.999965   24.1603M       106.401M        106.397M      4.40397       4.40382   0.227076
Nikon/D5600/2018-01-20_01-14_0793.NEF/threads:32/process_time/real_time_median        227 ms          227 ms            9   0.226982          0.99997   24.1603M       106.441M        106.437M      4.40563       4.40545   0.226991
Nikon/D5600/2018-01-20_01-14_0793.NEF/threads:32/process_time/real_time_stddev      0.339 ms        0.337 ms            9   336.924u         13.8542u          0       157.443k        158.348k     6.51661m      6.55405m   338.886u
Nikon/D7200/DSC_0977.NEF/threads:32/process_time/real_time_mean                       116 ms          116 ms            9   0.116157         0.999965   24.1603M       207.998M         207.99M      8.60908       8.60878   0.116161
Nikon/D7200/DSC_0977.NEF/threads:32/process_time/real_time_median                     116 ms          116 ms            9   0.116116          0.99997   24.1603M       208.069M        208.063M      8.61205       8.61179    0.11612
Nikon/D7200/DSC_0977.NEF/threads:32/process_time/real_time_stddev                   0.155 ms        0.155 ms            9    155.32u         13.0087u          0       277.711k         277.79k    0.0114945     0.0114978   155.374u
Nikon/D7200/DSC_0978.NEF/threads:32/process_time/real_time_mean                       114 ms          114 ms            9   0.113982         0.999978   24.1603M       211.965M        211.961M       8.7733       8.77311   0.113985
Nikon/D7200/DSC_0978.NEF/threads:32/process_time/real_time_median                     114 ms          114 ms            9   0.114006         0.999977   24.1603M        211.92M        211.915M      8.77144       8.77123   0.114009
Nikon/D7200/DSC_0978.NEF/threads:32/process_time/real_time_stddev                   0.050 ms        0.050 ms            9   50.2011u         4.69043u          0       93.3813k        93.0933k     3.86508m      3.85316m   50.0482u
Nikon/D7200/DSC_0979.NEF/threads:32/process_time/real_time_mean                       110 ms          110 ms            9   0.110074         0.999979   24.1603M        219.49M        219.486M      9.08477       9.08458   0.110077
Nikon/D7200/DSC_0979.NEF/threads:32/process_time/real_time_median                     110 ms          110 ms            9   0.110077         0.999978   24.1603M       219.484M        219.481M      9.08452       9.08438   0.110079
Nikon/D7200/DSC_0979.NEF/threads:32/process_time/real_time_stddev                   0.016 ms        0.015 ms            9   15.4706u         4.40786u          0       30.8509k        31.0139k     1.27693m      1.28367m   15.5531u
Nikon/D7200/DSC_0982.NEF/threads:32/process_time/real_time_mean                       111 ms          111 ms            9   0.110613         0.999974   24.1603M       218.422M        218.416M      9.04053        9.0403   0.110616
Nikon/D7200/DSC_0982.NEF/threads:32/process_time/real_time_median                     111 ms          111 ms            9   0.110607         0.999976   24.1603M       218.433M        218.429M      9.04103       9.04083   0.110609
Nikon/D7200/DSC_0982.NEF/threads:32/process_time/real_time_stddev                   0.022 ms        0.022 ms            9   22.0003u         5.90713u          0       43.4417k        43.2301k     1.79807m      1.78931m   21.8942u
Comparing /home/lebedevri/rawspeed/build-old/src/utilities/rsbench/rsbench to /home/lebedevri/rawspeed/build-new/src/utilities/rsbench/rsbench
Benchmark                                                                                        Time             CPU      Time Old      Time New       CPU Old       CPU New
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nikon/D5600/2018-01-20_01-14_0792.NEF/threads:32/process_time/real_time_pvalue                 0.0004          0.0004      U Test, Repetitions: 9 vs 9
Nikon/D5600/2018-01-20_01-14_0792.NEF/threads:32/process_time/real_time_mean                  +0.0248         +0.0248           186           191           186           191
Nikon/D5600/2018-01-20_01-14_0792.NEF/threads:32/process_time/real_time_median                +0.0252         +0.0252           186           190           186           190
Nikon/D5600/2018-01-20_01-14_0792.NEF/threads:32/process_time/real_time_stddev                -0.0025         -0.0029             1             1             1             1
Nikon/D5600/2018-01-20_01-14_0793.NEF/threads:32/process_time/real_time_pvalue                 0.0004          0.0004      U Test, Repetitions: 9 vs 9
Nikon/D5600/2018-01-20_01-14_0793.NEF/threads:32/process_time/real_time_mean                  +0.0181         +0.0181           223           227           223           227
Nikon/D5600/2018-01-20_01-14_0793.NEF/threads:32/process_time/real_time_median                +0.0178         +0.0178           223           227           223           227
Nikon/D5600/2018-01-20_01-14_0793.NEF/threads:32/process_time/real_time_stddev                +1.7234         +1.7085             0             0             0             0
Nikon/D7200/DSC_0977.NEF/threads:32/process_time/real_time_pvalue                              0.0004          0.0004      U Test, Repetitions: 9 vs 9
Nikon/D7200/DSC_0977.NEF/threads:32/process_time/real_time_mean                               +0.0462         +0.0462           111           116           111           116
Nikon/D7200/DSC_0977.NEF/threads:32/process_time/real_time_median                             +0.0461         +0.0461           111           116           111           116
Nikon/D7200/DSC_0977.NEF/threads:32/process_time/real_time_stddev                             +1.5145         +1.5236             0             0             0             0
Nikon/D7200/DSC_0978.NEF/threads:32/process_time/real_time_pvalue                              0.0004          0.0004      U Test, Repetitions: 9 vs 9
Nikon/D7200/DSC_0978.NEF/threads:32/process_time/real_time_mean                               +0.0640         +0.0640           107           114           107           114
Nikon/D7200/DSC_0978.NEF/threads:32/process_time/real_time_median                             +0.0642         +0.0642           107           114           107           114
Nikon/D7200/DSC_0978.NEF/threads:32/process_time/real_time_stddev                             -0.1859         -0.1763             0             0             0             0
Nikon/D7200/DSC_0979.NEF/threads:32/process_time/real_time_pvalue                              0.0004          0.0004      U Test, Repetitions: 9 vs 9
Nikon/D7200/DSC_0979.NEF/threads:32/process_time/real_time_mean                               +0.0996         +0.0996           100           110           100           110
Nikon/D7200/DSC_0979.NEF/threads:32/process_time/real_time_median                             +0.0997         +0.0997           100           110           100           110
Nikon/D7200/DSC_0979.NEF/threads:32/process_time/real_time_stddev                             -0.4470         -0.4562             0             0             0             0
Nikon/D7200/DSC_0982.NEF/threads:32/process_time/real_time_pvalue                              0.0004          0.0004      U Test, Repetitions: 9 vs 9
Nikon/D7200/DSC_0982.NEF/threads:32/process_time/real_time_mean                               +0.0908         +0.0908           101           111           101           111
Nikon/D7200/DSC_0982.NEF/threads:32/process_time/real_time_median                             +0.0906         +0.0906           101           111           101           111
Nikon/D7200/DSC_0982.NEF/threads:32/process_time/real_time_stddev                             -0.6509         -0.6493             0             0             0             0

i.e. U-Test confirms that these are an actual changes, and RSD is <=~1% as-is

That is why i'm asking if someone's else benchmarks regress, like actually regress, not this weird knock-on effect.

please add more test cases to cover the expected new sinking cases enabled with this change.

The test covers it quite well - if we'll only sink already-speculatable instructions,
and not all predecessors are unconditional, we will now actually sink it.
So could you please be more specific what tests would you like to see?