For loads, pre-SSE41 we can't perform NT loads at all, and after that we can only perform vector aligned loads so if the alignment is less than for a xmm we'll just end up using the regular unaligned vector loads anyway.
First step towards fixing PR42026 - the next step for stores will be to use SSE4A movntsd where possible and to avoid the stack spill on SSE2 targets.
So this says that if this is a non-temporal load of a vector,
either from a pointer that is aligned so little that we can't even make 128-bit aligned load from,
or we do not even have SSE41 (so no aligned non-temporal vector loads at all),
the underaligned loading is allowed, correct?
This kinda looks backwards to me?
I expected something like