This dramatically improves memset for aligned buffers (no changes for
unaligned buffers).
For example: On Haswell, throughput is roughly doubled and nearly maxes out the
bandwidth (30 B/cycle instead of 15 B/cycle before this change, with a max
bandwidth of 32 B/cycle).
See the graph here:
https://docs.google.com/spreadsheets/d/1bbT5Oqj3e5SFNh_5oKpwghEQuLazHI95E0-htGrADZ4/pubchart?oid=1858075526&format=interactive
Can this be done in a parent commit?