Load extra bits if suitably aligned. This allows using widened
3-vector loads on SI, and fixes legalization for <9 x s32> (which LSV
apparently forms frequently on lowered kernel argument lists).
Fix incorrectly treating these as legal on SI. This should emit a
64-bit store and a 32-bit store.
I think all of the load and store rules are just about complete, but
due for a rewrite.
As long as the alignment and size are both at least 32, I believe we can always support it. E.g., a dwordx4 load from a pointer that's 4-byte aligned is okay.
Though... admittedly in that case you may end up loading an additional cache line which you otherwise wouldn't load, because you cross a cacheline boundary. So maybe the right decision is to keep the test as is?
Either way, the decision ought to be documented as a comment.