This is an alternative to D59669 which more aggressively extracts i1 elements from vXi1 bool vectors using a MOVMSK.
|55 ↗||(On Diff #196851)|
Annoyingly this doesn't drop through the code, skipping zero-element gathers, instead repeating tests+branches.
|393 ↗||(On Diff #196851)|
Repeated PACKSS+MOVMSK instructions - I assume due to it having a undef argument in the xmm0 slot.
|1024 ↗||(On Diff #196851)|
we should be able to replace both shifts and the cmp with a single BT $32, %RAX ?
|51 ↗||(On Diff #196851)|
More repeated PACKSS+MOVMSK
|4376 ↗||(On Diff #196851)|
We should be able to reduce this to a TEST
|4655 ↗||(On Diff #196851)|
|10 ↗||(On Diff #196851)|
I think this is effectively a NOT that we should be able to handle somehow.
Updated version, which actually looks pretty similar to D59669.....
I'm performing all the extractions at the same time and only combining if (a) there are only extractions using the source vector and (b) there's more than 1 extract (I'd like to remove this limitation in the future).
The big difference is D59669 tries to limit to setcc usage only, which with our new SimplifyDemandedBits support is probably unnecessary.
This looks like a good refinement of the earlier patch, so I'm happy to abandon D59669 and move forward here.
|34919 ↗||(On Diff #197084)|
Matter of taste, but don't need to explicitly use "llvm::" here.
|34925–34926 ↗||(On Diff #197084)|
Could use a formula comment of the transform around here such as:
// extelt vXi1 X, MaskIdx --> ((movmsk X) & Mask) == Mask
|34931 ↗||(On Diff #197084)|
Did framing this as:
(x & Mask == Mask)
(x & Mask != 0)
make a difference in the output? If so, add a TODO comment about trying to avoid that problem.