Possibly vectorized operations are extended to the power-of-2 number with UndefValues to allow to use regular vector operations.
For SPEC CPU2017 it gives ~7% perf gain for 526.blender_r (AVX512,
O3+LTOi, -march=native), ~2% gain for 538.imagick_r and 638.imagick_s,
~2% gain for 525.x264_r and 625.x264_s, ~2% gain for 526.blender_r (AVX2
, O3+LTO, -march=native), ~11% gain 526.blender_r, ~3% gain for
544.nab_r and 644.nab_s (AVX512, O3+LTO), ~3% gain
for 526.blender_r, ~2% gain for 544.nab_r and 644.nab_s (AVX2, O3+LTO).
Compile and link time are pretty the same:
AVX512, O3+LTO, -march=native
Metric: compile_time
Geomean difference -0.1% (-1.85 sec)
Metric: link_time
Geomean difference +1.2% (+14.46 sec)
AVX512, O3+LTO
Metric: compile_time
Geomean difference -0.2% (-4.71 sec)
Metric: link_time
Geomean difference -3.6% (-54.53 sec)
AVX2, O3+LTO, -march=native
Metric: compile_time
Geomean difference +0.3% (+10.56 sec)
Metric: link_time
Geomean difference -0.1% (-2.18 sec)
AVX2, O3+LTO
Metric: compile_time
Geomean difference +0.2% (+5.73 sec)
Metric: link_time
Geomean difference -3.4% (-67.45 sec)
Should this be a separate patch?