Currently we fail to lower non-termporal stores for 256+ bit vectors
to STNPQ, because type legalization will split them up to 128 bit stores
and because there are no single non-temporal stores, creating STPNQ
in the Load/Store optimizer would be quite tricky.
This patch adds custom lowering for 256 bit non-temporal vector stores
to improve the generated code.
Should this be simm7s16? Same for am_indexed7s128