This is an archive of the discontinued LLVM Phabricator instance.

[SLP][X86] Add 32-bit vector stores to help vectorization opportunities
ClosedPublic

Authored by RKSimon on Jun 12 2022, 10:31 AM.

Details

Summary

Building on the work on D124284, this patch tags v4i8 and v2i16 vector loads as custom, enabling SLP to try to vectorize these types ending in a partial store (using the SSE MOVD instruction) - we already do something similar for 64-bit vector types.

I haven't had time to properly test these (my last testing was with D103925 which attempted something similar), so if anyone has a working test suite instance to hand that'd be very useful!

Diff Detail

Event Timeline

RKSimon created this revision.Jun 12 2022, 10:31 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 12 2022, 10:31 AM
RKSimon requested review of this revision.Jun 12 2022, 10:31 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 12 2022, 10:31 AM
dtemirbulatov added a comment.EditedJun 13 2022, 9:08 AM

I haven't had time to properly test these (my last testing was with D103925 which attempted something similar), so if anyone has a working test suite instance to hand that'd be very useful!

I have one testsuite instance with i7-6700HQ fixed frequancy at 2.6GHz, following LNT flags were used "--threads 1 --use-perf=all --cflags '-O3 -mavx2'"
Here is columns names: 1) Test name 2) Hash sum of a binary before 3) Hash sum of a binary after 4) Run-time before 5) Run-time after 6) Cycles before 7) Cycles after

MultiSource/Applications/ALAC/decode/alacconvert-decode 9ea454d8e30a2ae8b90b6083a69455b2 1c522b3ecdb16b62e8031344f0b43a6e from 1.8080 to 1.8010 cycles 46 877 690 to 46 714 665
MultiSource/Applications/ALAC/encode/alacconvert-encode 9ea454d8e30a2ae8b90b6083a69455b2 1c522b3ecdb16b62e8031344f0b43a6e from 3.3780 to 3.3810 cycles 87 608 023 to 87 683 603
MultiSource/Applications/ClamAV/clamscan 1e4ff4bbe17058784446880fb1cbdbc3 6317ae08724b50b102cec3820aec1504 from 12.5110 to 12.5020 cycles 324 175 926 to 323 787 667
MultiSource/Applications/JM/ldecod/ldecod eb714519f2cbc75429558b56fc2526ba ff629bdfd346f7d43f156ed36b51c23e from 5.3510 to 5.3560 cycles 138 764 852 to 138 902 747
MultiSource/Applications/JM/lencod/lencod 271b2504c6fed9507b50267549d4913c 7df1e610fee456e9b6f21203b64bc1f6 from 0.0040 to 0.0040 cycles 10 687 719 213 to 11 102 500 468
MultiSource/Applications/sqlite3/sqlite3 01e54145dfcbb5fcaf59a7b633cd1eb1 8c89b8ab52017c5f7be8fd5a74424d2e from 0.0020 to 0.0020 cycles 6 083 405 311 to 6 086 694 618
MultiSource/Benchmarks/7zip/7zip-benchmark c22a0b720a5d3ae828b967865ed7a5c8 b9be55c3b6675c269fb6a38f7e1d984d from 0.0080 to 0.0080 cycles 22 662 222 432 to 22 768 754 448
MultiSource/Benchmarks/Bullet/bullet 725b4974cd846b7b902e8f0fe602935a 2c46c0aded63f473610748bfec1ef5cd from 0.0030 to 0.0030 cycles 9 137 473 246 to 9 075 514 873
MultiSource/Benchmarks/DOE-ProxyApps-C++/CLAMR/CLAMR 3c995d9e1ed43a9b9343f0bf6cd5d006 3cc8fb2fd96528b9b5fffccb74bdb10e from 0.0010 to 0.0010 cycles 3 395 491 185 to 3 398 163 861
MultiSource/Benchmarks/MallocBench/gs/gs f8bc7dc74490b2b7da6dc4db4219bdc8 d838ca23ddf4b83cebc775d336bd77df from 3.0370 to 2.9790 cycles 78 655 771 to 77 251 300
MultiSource/Benchmarks/MiBench/consumer-jpeg/consumer-jpeg 97aaf98dbf91379e22c223c997d65b18 ec84ca9a186c2343d393c46b834c1df9 from 0.4150 to 0.4160 cycles 10 764 699 to 10 782 001
MultiSource/Benchmarks/MiBench/consumer-typeset/consumer-typeset b686e79fa11d74e19c1e25fc8b1b657d bff9b7302b95e2404e0a4d03776cc004 from 10.7630 to 10.7590 cycles 278 964 258 to 278 907 150
MultiSource/Benchmarks/MiBench/telecomm-gsm/telecomm-gsm 5b5eb8b3110e7e44faba9a39886f7eb9 3576c209ee3a3d7b5e61078ffe8bff8a from 7.5110 to 7.5190 cycles 194 825 194 to 195 024 904
MultiSource/Benchmarks/PAQ8p/paq8p a05ee284cc2aafc353525716ead34a46 271e3657c4254b44130ce1084ea3c31b from 0.0200 to 0.0210 cycles 54 418 140 186 to 54 629 177 762
MultiSource/Benchmarks/Prolangs-C/agrep/agrep 499c7b076644e0f05f509b8c9284ebbb cd4e690f68fb6f1de9180c712e6a1fcf from 0.3180 to 0.3160 cycles 8 222 873 to 8 192 390
MultiSource/Benchmarks/Prolangs-C/assembler/assembler 91ad4757272564f23c7ae5ccc82cf2cb e6ad7cda9bb5205bb3c16fb0c37eff9b from 0.1050 to 0.1050 cycles 2 727 287 to 2 709 236
MultiSource/Benchmarks/mediabench/g721/g721encode/encode 561896ca63930f7d23008d877670acb3 cd5779008318f623938dfa79c927b9d9 from 4.0160 to 3.9910 cycles 104 151 746 to 103 508 623
MultiSource/Benchmarks/mediabench/gsm/toast/toast 3882c68e752c5e4cd5d2ed73124603cb 3bb4cb4aa6ea339948820afbdd63ee66 from 1.0230 to 1.0220 cycles 26 530 366 to 26 480 722
MultiSource/Benchmarks/nbench/nbench 123c00a5e35f543e947426940c3b0f4b 12332b6a280ca58ea23af17223380627 from 0.0010 to 0.0010 cycles 3 427 928 721 to 3 483 881 476
MultiSource/Benchmarks/tramp3d-v4/tramp3d-v4 c1cfe1bbcaa5a44502a2ce1dc8ad1990 a606ae7ed6888288475e7b14b506f698 from 15.2060 to 15.2050 cycles 394 425 741 to 394 390 513
SingleSource/Benchmarks/SmallPT/smallpt a55897d3f1b34a808cf8579ebb82856d 57116ccef43a438580537031893e70cf from 0.0060 to 0.0060 cycles 15 898 258 261 to 15 791 900 523

Matt added a subscriber: Matt.Jun 13 2022, 3:40 PM

@dtemirbulatov thanks, looks neutral.

maybe @asbirlea has some results as well?

Does anybody see any major regressions?

If not, feel free to land. We can always revert if any problems.

This revision is now accepted and ready to land.Jun 29 2022, 5:47 AM
This revision was landed with ongoing or failed builds.Jun 30 2022, 12:46 PM
This revision was automatically updated to reflect the committed changes.