We seem to be inheriting the cost from sse4.1. But if we have 256-bit registers we should be able to do this with just one extract to split the 16i16 and two v8i16->v8i32 operations so our cost should be 3 not 4.
Details
Details
Diff Detail
Diff Detail
Event Timeline
Comment Actions
I guess the cost should be 3 if we count the extract we need to split the 16i16 input. New patch coming in a moment.