These callbacks are used for SSE vector accesses.
In some computational programs these accesses dominate.
Currently we do 2 uninlined 8-byte accesses to handle them.
Inline and optimize them similarly to unaligned accesses.
This reduces the vector access benchmark time from 8 to 3 seconds.
Depends on D112603.
This code mostly duplicates the above. What if you wrote it as a 2-iteration for-loop? Will it generate worse or better code?