Page MenuHomePhabricator

[WebAssembly] Add wasm-specific vector shuffle builtin and intrinsic
Needs ReviewPublic

Authored by tlively on Aug 29 2019, 5:15 PM.

Details

Reviewers
aheejin
dschuff
Summary

Although using __builtin_shufflevector and the shufflevector
instruction works fine, they are not opaque to the optimizer. As a
result, DAGCombine can potentially reduce the number of shuffles and
change the shuffle masks. This is unexpected behavior for users of the
WebAssembly SIMD intrinsics, since it is impossible to tell if combining
shuffles will be a performance win without breaking the WebAssembly
abstraction and reasoning about the underlying platforms. This CL solves
the problem by adding a new shuffle intrinsic that is opaque to the
optimizers.

Event Timeline

tlively created this revision.Aug 29 2019, 5:15 PM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptAug 29 2019, 5:15 PM

Can you say what leads wasm users to maintain such an expectation when, for example, ARM users and x86 users don't?

I think the expectation is that if you use an intrinsics header that has an intrinsic for each machine instruction, that each intrinsic call should result in exactly one machine instruction with the same arguments you passed to it. If the vector operation is created some other way (e.g. autovectorization or even __builtin_shufflevector) then I agree that there doesn't necessarily have to be that expectation.

Oh, interesting I didn't notice that those are implementations of the target-specific intrinsics. I wonder if they do that so they can implement the intrinsics on hardware that doesn't support the corresponding instruction? In a situation other than that I think I'd be surprised if I used a builtin for a particular instruction and got something else.

x86 is aggressive about optimizing shuffles no matter where they came from. FWIW, InstCombine has a general rule that its not supposed to create a shuffle mask that didn't already exist in the IR except for special things like identity masks that would allow the shuffle to be removed. DAG combine is supposed to check with TargetLowering::isShuffleMaskLegal.

Oh, interesting I didn't notice that those are implementations of the target-specific intrinsics. I wonder if they do that so they can implement the intrinsics on hardware that doesn't support the corresponding instruction? In a situation other than that I think I'd be surprised if I used a builtin for a particular instruction and got something else.

One of the reasons is a belief that LLVM's optimizations are desirable, and that if there are cases where LLVM's optimizations make code worse, it's a bug in LLVM which should be fixed. Do you have any such cases?

I should say, I myself am sympathetic to the argument that if you write _mm_shuffle_ps, you might really want SHUFPS, and not __builtin_shufflevector, but I'm not aware of any target in LLVM that works this way, which is something to consider.

In fact, the argument that __mm_shuffle_ps is SHUFPS seems like it ought to be stronger for x86 than wasm, because x86 has about 100 different shuffle instructions each with their own opinion, while wasm just has one shuffle instruction that just does everything.

The context for this CL is https://github.com/emscripten-core/emscripten/issues/9340. The code that does the undesirable optimization is around llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp:18776. I think it is a reasonable assumption that what you put in is what you get out, especially if you're trying to trigger particular WebAssembly engine behavior that the backend should not really be reasoning about. Now perhaps users should not be reasoning about it either, but given that they are I think this patch is reasonable.

The code that does the undesirable optimization is around llvm/lib/CodeGen/SelectionDAG/DAGCombiner.cpp:18776.

Now line 18776 is a blank line :) You can take a permalink from the https://github.com/llvm/llvm-project repo.

DAG combine is supposed to check with TargetLowering::isShuffleMaskLegal.

In @tlively's example, it is DAGCombine, and it does check isShuffleMaskLegal. However for wasm, it appears that's not enough -- in wasm, all shuffle masks are legal, because you can do any shuffle in a single wasm instruction. This makes it tricky, because the user here is aiming for an x86-like cost model, but the LLVM wasm backend doesn't have any x86-specific knowledge, so it just tells DAGCombine to form any shuffle it sees fit.

I wonder if it would make sense to introduce a counterpart to isShuffleMaskLegal, which instead of returning a bool returned a cost value. And then, we could teach the wasm backend about certain shuffle patterns which are known to be fast across multiple architectures. Then we could teach DAGComine to check whether the new shuffle it wants to create is actually cheaper than the one it's replacing. Thoughts?

@sunfish That sounds like a useful mechanism to me. I'd be happy to work on that next week. Is the consensus that we should not merge this change and instead pursue that idea?

Yeah I think that can be a more generalized solution too.

tlively abandoned this revision.Sep 3 2019, 11:35 AM

Abandoning in favor of @sunfish's idea for introducing a cost mechanism for shuffle masks in DAGCombiner.

tlively edited the summary of this revision. (Show Details)Thu, Oct 17, 10:32 AM
tlively reclaimed this revision.Thu, Oct 17, 10:45 AM

I'm re-opening this revision. After discussion on https://github.com/WebAssembly/simd/issues/118, there is clear consensus that we do not want to break WebAssembly's abstraction and consider underlying platforms, so shuffles should not be merged without some sort of modification to the spec to serve as a platform-agnostic indication that the merged shuffle will be better.

aheejin added inline comments.Fri, Oct 18, 4:11 PM
llvm/include/llvm/IR/IntrinsicsWebAssembly.td
117

i32 is bigger than ImmLaneIdx32. Should we model this into something smaller, like i8? What happens if we specify an index grater than 31? (I think this question also applies to other intrinsics and builtins. I don't think it matters a lot given than all integers are larger than lane types though.)

llvm/lib/Target/WebAssembly/WebAssemblyISelLowering.cpp
1251

This looks rather straightforward... Can't we do this in TableGen?