PR32743 - Avoid folding of non-temporal aligned loads (when supported by the CPU) into instructions which will cause them to become temporal vector loads and pollute the caches.
PR32744 - Split 256-bit vector non-temporal aligned loads on AVX1 targets to keep them non-temporal.
These can be committed separately but are so inter-related I thought it better to get them reviewed together.