We need to use it to handle <16 x double> indirect indexes
in the AMDGPU BE.
The only visible change from adding it is in ARM cost model.
To me it looks reasonable. With doubling a vector size it
quadruples the cost up to the size 8 and then it did only
double it. Now it also quadruples, which seems a logical
progression to me.
Actual AMDGPU code is to follow, this is a common part, plus
load/store legalization in the AMDGPU BE not to break what