PR32743 - Avoid folding of non-temporal aligned loads (when supported by the CPU) into instructions which will cause them to become temporal vector loads and pollute the caches.
PR32744 - Split 256-bit vector non-temporal aligned loads on AVX1 targets to keep them non-temporal.
These can be committed separately but are so inter-related I thought it better to get them reviewed together.
"pre-AVX2" (or "targets without AVX2"), no? I'd expect this to also happen on SSE4.1 (also has 128bit NT loads).