This is a continuation of the fix from:
http://reviews.llvm.org/D10662
and discussion in:
http://reviews.llvm.org/D12154
Here, we distinguish slow unaligned SSE (128-bit) accesses from slow unaligned scalar (64-bit and under) accesses. Other lowering (eg, getOptimalMemOpType) assumes that unaligned scalar accesses are always ok, so this changes allowsMisalignedMemoryAccesses() to match that behavior.
The test case changes show that we'll now use unaligned 8-byte load/store with a 64-bit CPU where before we settled for unaligned 4-byte ops.
The overlapping accesses may be a separate bug.