This is an archive of the discontinued LLVM Phabricator instance.

[WebAssembly] Improve codegen for loading scalars from memory to v128
ClosedPublic

Authored by fanchenkong1 on Sep 19 2022, 11:56 PM.

Details

Summary

Use load32_zero instead of load32_splat to load the low 32 bits from memory to v128. Test cases are added to cover this change.

Diff Detail

Event Timeline

fanchenkong1 created this revision.Sep 19 2022, 11:56 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 19 2022, 11:56 PM
fanchenkong1 requested review of this revision.Sep 19 2022, 11:56 PM
Herald added a project: Restricted Project. · View Herald TranscriptSep 19 2022, 11:56 PM

minor change

This change extends an optimization at Legalize vector types by widening on load scalars to 32 bits. Since we have seen i8x4 to i32x4 widen in a hot function of PhotoShop Web.

tlively added inline comments.Sep 20 2022, 11:18 AM
llvm/test/CodeGen/WebAssembly/simd-offset.ll
1196–1199

It looks like there is some room for improvement here. These shifts aren't necessary, are they? It would be good to at least add a TODO about cleaning them up.

1210

Do you know why the zero vector is materialized here while the previous function uses a local.get to get an implicit zero vector?

Thanks Thomas for reviewing this change!

llvm/test/CodeGen/WebAssembly/simd-offset.ll
1196–1199

Yes, a TODO can be added if further change is needed. But I'm not sure if I fully understand what to do to remove the shifts. Does it mean by using two sign extend? e.g.

i16x8.extend_low_i8x16_s
i32x4.extend_low_i16x8_s
1210

The difference seems to be introduced by the vector legalizer, where the sign extend is expanded by VectorLegalizer::ExpandSIGN_EXTEND_VECTOR_INREG, followed with VectorLegalizer::ExpandANY_EXTEND_VECTOR_INREG. The later one may introduce an undef node. While the zero extend is expanded by VectorLegalizer::ExpandZERO_EXTEND_VECTOR_INREG, where an explicit zero vector is created.

tlively accepted this revision.Sep 21 2022, 7:44 AM

Thanks! Do you need me to land this?

llvm/test/CodeGen/WebAssembly/simd-offset.ll
1196–1199

Oh, I see, we need the shifts because they implement the sign extend part. Using the sequence of extend_low instructions is also a good idea. How would the native code from that solution compare?

This revision is now accepted and ready to land.Sep 21 2022, 7:44 AM

Thanks! Do you need me to land this?

Yes, would you please help me land this change? The author name and mail could be "Fanchen Kong <fanchen.kong@intel.com>". Thanks!

llvm/test/CodeGen/WebAssembly/simd-offset.ll
1196–1199

On x64, for the shuffle + shifts solution, the native code maybe,

shuffle byte (or packed zero extend if zero vector is detectable)
packed shift left
packed shift right arithmetic

For sequence of extend_low, the expected code can be

packed byte to dword sign extend

The current solution seems not bad if shuffle byte can be reduced at VM. The extend_low sequence seems to be a little better on x64, but I'm not sure if thats the case for all platforms.

tlively added inline comments.Sep 21 2022, 8:56 PM
llvm/test/CodeGen/WebAssembly/simd-offset.ll
1196–1199

Thanks for the detail. I'll land this as-is because it is an improvement over the status quo, but we should keep that other possibility in mind.

This revision was landed with ongoing or failed builds.Sep 21 2022, 9:05 PM
This revision was automatically updated to reflect the committed changes.