Use O(nlogn) instead of O(N2) (N <= 32) sorting approach and do not try
to revectorize all possible combinations of stores, if they
definitely cannot be combined because of mem/data dependencies.
Compile time (O3 + lto, skylake_avx512):
External/SPEC/CINT2006/483.xalancbmk/483.xalancbmk.test 117.15 120.11 2.5%
External/SPEC/CINT2017speed/623.xalancbmk_s/623.xalancbmk_s.test 203.67 207.42 1.8%
External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 232.43 235.01 1.1%
External/SPEC/CINT2017rate/523.xalancbmk_r/523.xalancbmk_r.test 205.49 207.25 0.9%
External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test 310.46 306.23 -1.4%
Link time (O3+lto, skylake_avx512):
External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 1383.69 1475.94 6.7%
Other changes are too small, cannot rely on them.
size..text
Program size..text
results results0 diff test-suite :: SingleSource/Regression/C/Regression-C-sumarray.test 392.00 1439.00 267.1% test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test 394258.00 394818.00 0.1% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 846355.00 847075.00 0.1% test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test 782816.00 783360.00 0.1% test-suite :: External/SPEC/CFP2017rate/508.namd_r/508.namd_r.test 779667.00 779923.00 0.0% test-suite :: MultiSource/Benchmarks/mafft/pairlocalalign.test 224398.00 224446.00 0.0% test-suite :: MultiSource/Applications/oggenc/oggenc.test 185019.00 185035.00 0.0% test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12487610.00 12488010.00 0.0% test-suite :: MultiSource/Benchmarks/7zip/7zip-benchmark.test 1051772.00 1051804.00 0.0% test-suite :: MultiSource/Applications/SPASS/SPASS.test 529586.00 529602.00 0.0% test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test 1084684.00 1084716.00 0.0% test-suite :: MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4.test 1014245.00 1014261.00 0.0% test-suite :: MultiSource/Benchmarks/MallocBench/espresso/espresso.test 223494.00 223478.00 -0.0% test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test 660843.00 660795.00 -0.0% test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test 660843.00 660795.00 -0.0% test-suite :: MultiSource/Applications/ClamAV/clamscan.test 568824.00 568760.00 -0.0%
espresso - 2 more stores vectorized
x264 - small number of changes in 3-4 functions, generated a bit more
vector stores (2 4x zeroinitializer stores + some other small variations).
clamscan - emitted 32xi8 store instead of several scalar stores + several 4x-8x stores.
Please add a description of what is stored in the set.