This is an archive of the discontinued LLVM Phabricator instance.

[WebAssembly] Add wasm-specific vector shuffle builtin and intrinsic
ClosedPublic

Authored by tlively on Aug 29 2019, 5:15 PM.

Details

Summary

Although using __builtin_shufflevector and the shufflevector
instruction works fine, they are not opaque to the optimizer. As a
result, DAGCombine can potentially reduce the number of shuffles and
change the shuffle masks. This is unexpected behavior for users of the
WebAssembly SIMD intrinsics who have crafted their shuffles to
optimize the code generated by engines. This patch solves the problem
by adding a new shuffle intrinsic that is opaque to the optimizers in
line with the decision of the WebAssembly SIMD contributors at
https://github.com/WebAssembly/simd/issues/196#issuecomment-622494748. In
the future we may implement custom DAG combines to properly optimize
shuffles and replace this solution.

Diff Detail

Event Timeline

tlively created this revision.Aug 29 2019, 5:15 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptAug 29 2019, 5:15 PM

Can you say what leads wasm users to maintain such an expectation when, for example, ARM users and x86 users don't?

I think the expectation is that if you use an intrinsics header that has an intrinsic for each machine instruction, that each intrinsic call should result in exactly one machine instruction with the same arguments you passed to it. If the vector operation is created some other way (e.g. autovectorization or even __builtin_shufflevector) then I agree that there doesn't necessarily have to be that expectation.

Oh, interesting I didn't notice that those are implementations of the target-specific intrinsics. I wonder if they do that so they can implement the intrinsics on hardware that doesn't support the corresponding instruction? In a situation other than that I think I'd be surprised if I used a builtin for a particular instruction and got something else.

x86 is aggressive about optimizing shuffles no matter where they came from. FWIW, InstCombine has a general rule that its not supposed to create a shuffle mask that didn't already exist in the IR except for special things like identity masks that would allow the shuffle to be removed. DAG combine is supposed to check with TargetLowering::isShuffleMaskLegal.

Oh, interesting I didn't notice that those are implementations of the target-specific intrinsics. I wonder if they do that so they can implement the intrinsics on hardware that doesn't support the corresponding instruction? In a situation other than that I think I'd be surprised if I used a builtin for a particular instruction and got something else.

One of the reasons is a belief that LLVM's optimizations are desirable, and that if there are cases where LLVM's optimizations make code worse, it's a bug in LLVM which should be fixed. Do you have any such cases?

I should say, I myself am sympathetic to the argument that if you write _mm_shuffle_ps, you might really want SHUFPS, and not __builtin_shufflevector, but I'm not aware of any target in LLVM that works this way, which is something to consider.

In fact, the argument that __mm_shuffle_ps is SHUFPS seems like it ought to be stronger for x86 than wasm, because x86 has about 100 different shuffle instructions each with their own opinion, while wasm just has one shuffle instruction that just does everything.

The context for this CL is https://github.com/emscripten-core/emscripten/issues/9340. The code that does the undesirable optimization is around llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp:18776. I think it is a reasonable assumption that what you put in is what you get out, especially if you're trying to trigger particular WebAssembly engine behavior that the backend should not really be reasoning about. Now perhaps users should not be reasoning about it either, but given that they are I think this patch is reasonable.

The code that does the undesirable optimization is around llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp:18776.

Now line 18776 is a blank line :) You can take a permalink from the https://github.com/llvm/llvm-project repo.

DAG combine is supposed to check with TargetLowering::isShuffleMaskLegal.

In @tlively's example, it is DAGCombine, and it does check isShuffleMaskLegal. However for wasm, it appears that's not enough -- in wasm, all shuffle masks are legal, because you can do any shuffle in a single wasm instruction. This makes it tricky, because the user here is aiming for an x86-like cost model, but the LLVM wasm backend doesn't have any x86-specific knowledge, so it just tells DAGCombine to form any shuffle it sees fit.

I wonder if it would make sense to introduce a counterpart to isShuffleMaskLegal, which instead of returning a bool returned a cost value. And then, we could teach the wasm backend about certain shuffle patterns which are known to be fast across multiple architectures. Then we could teach DAGComine to check whether the new shuffle it wants to create is actually cheaper than the one it's replacing. Thoughts?

@sunfish That sounds like a useful mechanism to me. I'd be happy to work on that next week. Is the consensus that we should not merge this change and instead pursue that idea?

Yeah I think that can be a more generalized solution too.

tlively abandoned this revision.Sep 3 2019, 11:35 AM

Abandoning in favor of @sunfish's idea for introducing a cost mechanism for shuffle masks in DAGCombiner.

tlively edited the summary of this revision. (Show Details)Oct 17 2019, 10:32 AM
tlively reclaimed this revision.Oct 17 2019, 10:45 AM

I'm re-opening this revision. After discussion on https://github.com/WebAssembly/simd/issues/118, there is clear consensus that we do not want to break WebAssembly's abstraction and consider underlying platforms, so shuffles should not be merged without some sort of modification to the spec to serve as a platform-agnostic indication that the merged shuffle will be better.

aheejin added inline comments.Oct 18 2019, 4:11 PM
llvm/include/llvm/IR/IntrinsicsWebAssembly.td
143

i32 is bigger than ImmLaneIdx32. Should we model this into something smaller, like i8? What happens if we specify an index grater than 31? (I think this question also applies to other intrinsics and builtins. I don't think it matters a lot given than all integers are larger than lane types though.)

llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp
1369

This looks rather straightforward... Can't we do this in TableGen?

tlively planned changes to this revision.Dec 13 2019, 11:03 AM

This issue is still not resolved.

tlively edited the summary of this revision. (Show Details)May 7 2020, 5:07 PM
tlively updated this revision to Diff 262795.May 7 2020, 5:44 PM
tlively marked an inline comment as done.
tlively edited the summary of this revision. (Show Details)
  • Rebase and update intrinsics header
tlively updated this revision to Diff 262798.May 7 2020, 6:02 PM
tlively marked an inline comment as done.
  • Rebase, update intrinsics header, and address comment

Ok, this is ready for review for real this time. The WebAssembly SIMD contributors have decided that this is an appropriate direction to go in, and we are leaving the door open for future improvements.

llvm/include/llvm/IR/IntrinsicsWebAssembly.td
143

It turns out that it would have been an ISel failure. I fixed this to replace an out-of-bounds indices with 0.

llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp
1369

No, I don't know of a simple way to handle undef lanes in TableGen. I could look into using custom patterns and transform nodes, but in the end this code is probably simpler the way it is.

aheejin accepted this revision.May 8 2020, 6:33 PM
This revision is now accepted and ready to land.May 8 2020, 6:33 PM
This revision was automatically updated to reflect the committed changes.