This is an archive of the discontinued LLVM Phabricator instance.

[WebAssembly] v8x16.swizzle and rewrite BUILD_VECTOR lowering
ClosedPublic

Authored by tlively on Oct 4 2019, 4:49 PM.

Details

Summary

Adds the new v8x16.swizzle SIMD instruction as specified at
https://github.com/WebAssembly/simd/blob/master/proposals/simd/SIMD.md#swizzling-using-variable-indices.
In addition to adding swizzles as a candidate lowering in
LowerBUILD_VECTOR, also rewrites and simplifies the lowering to
minimize the number of replace_lanes necessary rather than trying to
minimize code size. This leads to more uses of v128.const instead of
splats, which is expected to increase performance.

The new code will be easier to tune once V8 implements all the vector
construction operations, and it will also be easier to add new
candidate instructions in the future if necessary.

Event Timeline

tlively created this revision.Oct 4 2019, 4:49 PM
Herald added a project: Restricted Project. · View Herald TranscriptOct 4 2019, 4:49 PM
  • I remember before we had a somewhat complicated logic to calculate the number of bytes of total instructions of each case of the case we use v128.const and vs. when we use splats. Don't we need that anymore? Can we make the decision solely based the number of swizzles / consts / and splats?
  • Is the performance of v128.const better than splats? How is performance of swizzles compared to v128.const?
llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp
1350

Would using count_if in place of find_if be simpler?

1384

Nit: Variable names for the same things in GetSwizzleSrcs are SrcVec and IndexVec. Making the variable names same in the two places might make reading easier.

1388

Is using forward_as_tuple any different from using tie again in this case, given that this is not passed as an argument to a function?

1426

It's not in this CL, but is there a case this condition is not satisfied?

tlively updated this revision to Diff 223922.Oct 8 2019, 12:28 PM
tlively marked 3 inline comments as done.
  • Address variable naming comment
  • I remember before we had a somewhat complicated logic to calculate the number of bytes of total instructions of each case of the case we use v128.const and vs. when we use splats. Don't we need that anymore? Can we make the decision solely based the number of swizzles / consts / and splats?

Minimizing code size is not as important for SIMD as maximizing performance, so I dumped that complicated logic. I am led to believe that v128.const will be faster than splats once it is implemented, but we have no way to measure yet. We can make the decision based on whatever heuristic we want, but minimizing number of instructions seems like a good metric for now until we can run experiments to tune the selection algorithm.

  • Is the performance of v128.const better than splats? How is performance of swizzles compared to v128.const?

Yes, I believe v128.const will be faster than splats. I don't know how swizzles and v128.const compare, but I do know that emulating swizzles requires a lot of instructions per lane but emulating a v128.const only requires a single replace_lane and constant per lane. So it makes sense to prefer swizzles over v128.consts for now.

llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp
1350

No, I need to get the iterator to the proper entry since I'm using a vector as an associative array here. If I used count_if and it returned 1, I would still need to find the entry to increment the count, so find_if is simpler because it gives me that entry directly.

1388

It turns out you can't nest std::tie. I have no idea why, but I got this solution from https://stackoverflow.com/questions/21298732/can-we-do-deep-tie-with-a-c1y-stdtie-like-function.

1426

Yes, for example when doing a sign extending load of an i8 to an i32 then splatting that i32.

aheejin accepted this revision.EditedOct 8 2019, 10:59 PM
  • I remember before we had a somewhat complicated logic to calculate the number of bytes of total instructions of each case of the case we use v128.const and vs. when we use splats. Don't we need that anymore? Can we make the decision solely based the number of swizzles / consts / and splats?

Minimizing code size is not as important for SIMD as maximizing performance, so I dumped that complicated logic. I am led to believe that v128.const will be faster than splats once it is implemented, but we have no way to measure yet. We can make the decision based on whatever heuristic we want, but minimizing number of instructions seems like a good metric for now until we can run experiments to tune the selection algorithm.

Wouldn't minimizing the number of instruction be the same thing as minimizing the number of bytes, only more inaccurate?

  • Is the performance of v128.const better than splats? How is performance of swizzles compared to v128.const?

Yes, I believe v128.const will be faster than splats. I don't know how swizzles and v128.const compare, but I do know that emulating swizzles requires a lot of instructions per lane but emulating a v128.const only requires a single replace_lane and constant per lane. So it makes sense to prefer swizzles over v128.consts for now.

I don't understand this part well. If swizzles are a lot more complicated that v128.const in execution, doesn't that mean swizzles will likely to take longer to execute in wasm? Why the opposite?

The above are just some passing questions, but I'm not suggesting we restore the byte computation logic or another more complicated logic at this point, given that we don't have measurable any performance data at hand. Optimizing at this point seems too premature anyway. LGTM.

llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp
1384

In GetSwizzleSrcs, IndexVec is still IndexVec, while SrcVec was changed to `SwizzleSrc. Was that intentional?

This revision is now accepted and ready to land.Oct 8 2019, 10:59 PM

Wouldn't minimizing the number of instruction be the same thing as minimizing the number of bytes, only more inaccurate?

It's true that minimizing instructions approximates minimizing bytes, but it also stands on its own as a reasonable metric. In this case minimizing instructions makes more sense than minimizing bytes.

If swizzles are a lot more complicated that v128.const in execution, doesn't that mean swizzles will likely to take longer to execute in wasm? Why the opposite?

Swizzles lower directly to hardware instructions so they are fast for engines to execute. But doing the same operation without a swizzle instruction would require a long sequence of other wasm instructions and therefore be slow to execute. Because this difference is large for swizzles it is a good idea to prefer to use them when possible.

tlively marked an inline comment as done.Oct 9 2019, 10:37 AM
tlively added inline comments.
llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp
1384

Not intentional! Thanks.

This revision was automatically updated to reflect the committed changes.

If swizzles are a lot more complicated that v128.const in execution, doesn't that mean swizzles will likely to take longer to execute in wasm? Why the opposite?

Swizzles lower directly to hardware instructions so they are fast for engines to execute. But doing the same operation without a swizzle instruction would require a long sequence of other wasm instructions and therefore be slow to execute. Because this difference is large for swizzles it is a good idea to prefer to use them when possible.

We are deciding which one among const/swizzle/splat to use based on the number of lanes hit by the instruction. The rest is the number of replace_lanes, so I don't think swizzles are more expensive to emulate than others, because after a single const/swizzle/splat, all emulation cost is down to the number of replace_lanes...? Anyway, not really related to the CL itself