Consider the following code:
struct S { int32_t a; int32_t b; int64_t c; int32_t d; }; S PartialCopy(const S& s) { S result; result.a = s.a; result.b = s.b; return result; }
The two load/stores do not vectorize:
mov eax, dword ptr [rsi] mov dword ptr [rdi], eax mov eax, dword ptr [rsi + 4] mov dword ptr [rdi + 4], eax mov rax, rdi ret
This is because the SLP vectorizer only considers 4xi32=i128 as a candidate,
because there exists such a vector register. It never considers 2xi32=i64,
because the only register that exists for this is a GPR.
However, all operations that only manipulate values as arrays of
bits (e.g. Load, Store, Bitcast, and potentially Xor/And/Or) do not
strictly require vector registers. Let's call these bit-parallel
operations.
This change lets the SLP vectorizer vectorize trees composed of only bit-parallel operations using the native GPR size.
The example above will vectorize to:
mov rax, qword ptr [rsi] mov qword ptr [rdi], rax mov rax, rdi ret
For now this only handles the most trivial bit-parallel instructions (Load, Store, Bitcast), and only homogeneous types (it will not vectorize <4xi8, 1xi32>), but this can be added later.