I recently discovered a performance regression in
test-suite/Multisource/Benchmarks/ptrdist/ks/ks.test solely due to
changing alignment. If you add 16 nops to the top of the
FindMaxGpAndSwap, performance goes down by about 10%. If you add an
aligment directive of 32 bytes at the top of the first loop of the
function, the performance comes back.
Verified on both Haswell and Sandybridge.
The settings just above differs between OptSize or not, if there is an impact on code size it makes sense to change the way the PreLoopAlignment is handle as well.