General approach is to use power of unmerge and existing combines together with build_vector combines starting with D109240. Unmerge to vector elements, avoid unmerge to subvectors (getLCMTy) and use of INSERT or EXTRACT.
Vector element manipulation that we expect to combine will be done using merge/unmerge of each element (no subvectors).
The key difference is Unmerge to vector elements which can always be combined.
About the nice cases where subvector unmerges could be combined: <4 x s16> -> <2 x s16> would give better code that unmerge to each element atm. These are covered in build_vector combines (see stack of build_vector patches starting with D109240, test file add.vNi16.ll.mir).
About the cases where we can't combine subvector unmerges:
For example amdgpu wants to fewer_vector_elements <3 x s16> to <2 x s16>, but first it has to do more_vector_elements to multiple of <2 x s16> (<4 x s16>)
getLCMTy approach is not combiner friendly since it does:
%LCMTy(<12 x s16>) = G_CONCAT_VECTOR %a(<3 x s16>), %undef0(<3 x s16>), %undef1(<3 x s16>), %undef2(<3 x s16>) %b(<4 x s16>), %(<4 x s16>), %undef1(<4 x s16>) = G_UNMERGE_VALUES %LCMTy(<12 x s16>)
Here %b takes some elements from %a and some from %undef0 but combiner has no way to reference those elements (they are not 'named' using VReg), its best chance is to extract_vector_elt or unmerge %a and %undef0 and use build_vector for %b but this creates more artifacts that won't be able to combine and maybe are not legal which may results in infinite loops. It is also extra step compared to the proposal in this patch.
INSERT or EXTRACT combines also have same problems since they only work for specific (compatible) types.
List of changes:
moreElementsVectorDst moreElementsVectorSrc: will first unmerge input to each element; Dst than builds vector by leaving out a few trailing elements, Src builds vector padded with a few undef elements.
CallLowering: argument lowering uses getCoverTy (cover v5 with least amount of v2 this is v6(3xv2)) instead of LCMTy style (v2 and v5 give v10 a least common multiple). This works better with the way arguments are passed in registers (no need to pad input in physical registers with undef).
LegalizerHelper/AMDGPULegalizerInfo: Lower G_EXTRACT_VECTOR_ELT, G_INSERT_VECTOR_ELT and G_SHUFFLE_VECTOR using unmerge/build_vector. Lower for vector loads into power-of-2 load and remainder (until we get legal size - 1 byte load/store at worst) no longer uses more/fewer elts.
AMDGPURegisterBankInfo: Use Unmerge in load lowering
Tested on internal test suite, fixes vector related legalizer fails, see: vector-legalizer.ll for some examples.
MIR tests require changes since they contain old argument lowering approach. There is a hack in legalizer for this that detects infinite loops (I did not use it during testing).