Based on experimentation on gfx906,908,90a and 1030, wider global loads / stores are more performant than multiple narrower ones independent of alignment -- this is especially true when combining 8 bit loads / stores, in which case speedup was usually 2x across all alignments.
Change-Id: I1713c6edfc189052b8a71dc1135f9a436c1042e0
I do not like the name here. It suggests it is global address space only.