lowerShuffleAsUNPCKAndPermute requires the shuffle mask element to be
in the same lane in both the input and output vectors. This prevents it from
matching certain patterns for example in GHI
61964. Removing the lane
requirement fixes the issue.
The change I'm targeting is in the test llvm/test/CodeGen/X86/pr61964.ll. The
codegen has improved notably with this patch. Otherwise, looks like some
broadcast instructions are replaced with unpck and perm. To check if there's
any other performance change, I ran llvm-test-suite benchmarks from the
SingleSource, MultiSource, and MicroBenchmarks directories:
Tests: 2665 Short Running: 2009 (filtered out) Same hash: 140 (filtered out) In Blacklist: 513 (filtered out) Remaining: 3 Metric: exec_time Program exec_time lhs rhs diff test-suite :: MultiSource/Benchmarks/ASCI_Purple/SMG2000/smg2000.test 1.64 1.64 0.1% test-suite :: SingleSource/Benchmarks/Adobe-C++/loop_unroll.test 1.06 1.06 0.0% test-suite :: MultiSource/Applications/JM/lencod/lencod.test 5.25 5.25 0.0% Geomean difference nan nan 0.0% exec_time l/r lhs rhs diff count 3.000000 3.000000 3.000000 mean 2.648300 2.649100 0.000462 std 2.269035 2.268849 0.000415 min 1.055500 1.055900 0.000095 25% 1.349300 1.350250 0.000237 50% 1.643100 1.644600 0.000379 75% 3.444700 3.445700 0.000646 max 5.246300 5.246800 0.000913
The patch only hits three cases and the result is neutral. (The 513 blacklisted
benchmarks are the ones under MicroBenchmarks, which --filter-hash does
not work and I manually verified their code did not change).
I'm not certain this will work properly for 512-bit vectors?