Use load32_zero instead of load32_splat to load the low 32 bits from memory to v128. Test cases are added to cover this change.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Event Timeline
This change extends an optimization at Legalize vector types by widening on load scalars to 32 bits. Since we have seen i8x4 to i32x4 widen in a hot function of PhotoShop Web.
llvm/test/CodeGen/WebAssembly/simd-offset.ll | ||
---|---|---|
1196–1199 | It looks like there is some room for improvement here. These shifts aren't necessary, are they? It would be good to at least add a TODO about cleaning them up. | |
1210 | Do you know why the zero vector is materialized here while the previous function uses a local.get to get an implicit zero vector? |
Thanks Thomas for reviewing this change!
llvm/test/CodeGen/WebAssembly/simd-offset.ll | ||
---|---|---|
1196–1199 | Yes, a TODO can be added if further change is needed. But I'm not sure if I fully understand what to do to remove the shifts. Does it mean by using two sign extend? e.g. i16x8.extend_low_i8x16_s i32x4.extend_low_i16x8_s | |
1210 | The difference seems to be introduced by the vector legalizer, where the sign extend is expanded by VectorLegalizer::ExpandSIGN_EXTEND_VECTOR_INREG, followed with VectorLegalizer::ExpandANY_EXTEND_VECTOR_INREG. The later one may introduce an undef node. While the zero extend is expanded by VectorLegalizer::ExpandZERO_EXTEND_VECTOR_INREG, where an explicit zero vector is created. |
Thanks! Do you need me to land this?
llvm/test/CodeGen/WebAssembly/simd-offset.ll | ||
---|---|---|
1196–1199 | Oh, I see, we need the shifts because they implement the sign extend part. Using the sequence of extend_low instructions is also a good idea. How would the native code from that solution compare? |
Yes, would you please help me land this change? The author name and mail could be "Fanchen Kong <fanchen.kong@intel.com>". Thanks!
llvm/test/CodeGen/WebAssembly/simd-offset.ll | ||
---|---|---|
1196–1199 | On x64, for the shuffle + shifts solution, the native code maybe, shuffle byte (or packed zero extend if zero vector is detectable) packed shift left packed shift right arithmetic For sequence of extend_low, the expected code can be packed byte to dword sign extend The current solution seems not bad if shuffle byte can be reduced at VM. The extend_low sequence seems to be a little better on x64, but I'm not sure if thats the case for all platforms. |
llvm/test/CodeGen/WebAssembly/simd-offset.ll | ||
---|---|---|
1196–1199 | Thanks for the detail. I'll land this as-is because it is an improvement over the status quo, but we should keep that other possibility in mind. |
It looks like there is some room for improvement here. These shifts aren't necessary, are they? It would be good to at least add a TODO about cleaning them up.