This is an archive of the discontinued LLVM Phabricator instance.

GlobalISel[RFC]: Avoid use of G_INSERT and G_EXTRACT in Legalizer
Needs ReviewPublic

Authored by Petar.Avramovic on Aug 6 2021, 8:13 AM.

Details

Summary

General approach is to use power of unmerge and existing combines together with build_vector combines starting with D109240. Unmerge to vector elements, avoid unmerge to subvectors (getLCMTy) and use of INSERT or EXTRACT.
Vector element manipulation that we expect to combine will be done using merge/unmerge of each element (no subvectors).
The key difference is Unmerge to vector elements which can always be combined.
About the nice cases where subvector unmerges could be combined: <4 x s16> -> <2 x s16> would give better code that unmerge to each element atm. These are covered in build_vector combines (see stack of build_vector patches starting with D109240, test file add.vNi16.ll.mir).
About the cases where we can't combine subvector unmerges:
For example amdgpu wants to fewer_vector_elements <3 x s16> to <2 x s16>, but first it has to do more_vector_elements to multiple of <2 x s16> (<4 x s16>)
getLCMTy approach is not combiner friendly since it does:

%LCMTy(<12 x s16>) = G_CONCAT_VECTOR %a(<3 x s16>), %undef0(<3 x s16>), %undef1(<3 x s16>), %undef2(<3 x s16>)
%b(<4 x s16>), %(<4 x s16>), %undef1(<4 x s16>) = G_UNMERGE_VALUES %LCMTy(<12 x s16>)

Here %b takes some elements from %a and some from %undef0 but combiner has no way to reference those elements (they are not 'named' using VReg), its best chance is to extract_vector_elt or unmerge %a and %undef0 and use build_vector for %b but this creates more artifacts that won't be able to combine and maybe are not legal which may results in infinite loops. It is also extra step compared to the proposal in this patch.
INSERT or EXTRACT combines also have same problems since they only work for specific (compatible) types.

List of changes:
moreElementsVectorDst moreElementsVectorSrc: will first unmerge input to each element; Dst than builds vector by leaving out a few trailing elements, Src builds vector padded with a few undef elements.
CallLowering: argument lowering uses getCoverTy (cover v5 with least amount of v2 this is v6(3xv2)) instead of LCMTy style (v2 and v5 give v10 a least common multiple). This works better with the way arguments are passed in registers (no need to pad input in physical registers with undef).
LegalizerHelper/AMDGPULegalizerInfo: Lower G_EXTRACT_VECTOR_ELT, G_INSERT_VECTOR_ELT and G_SHUFFLE_VECTOR using unmerge/build_vector. Lower for vector loads into power-of-2 load and remainder (until we get legal size - 1 byte load/store at worst) no longer uses more/fewer elts.
AMDGPURegisterBankInfo: Use Unmerge in load lowering

Tested on internal test suite, fixes vector related legalizer fails, see: vector-legalizer.ll for some examples.
MIR tests require changes since they contain old argument lowering approach. There is a hack in legalizer for this that detects infinite loops (I did not use it during testing).

Diff Detail

Event Timeline

Petar.Avramovic created this revision.Aug 6 2021, 8:13 AM
Petar.Avramovic requested review of this revision.Aug 6 2021, 8:13 AM
Herald added a project: Restricted Project. · View Herald TranscriptAug 6 2021, 8:13 AM

My general feeling is new opcodes should not be necessary, and that you're just trying to hide deficiencies in how we happen to lower some types now. I believe it is possible to fix legalization for every type with the existing set of opcodes (excluding cases that require an intermediate type that exceeds the range of LLT)

Use unmerge/build_vector instead of new artifacts.

This requires more work during combine since we need to check if unmerged elements are not merged in shuffled order for a simple copy.

Ping. Also see build_vector combines starting with D109240.

Petar.Avramovic edited the summary of this revision. (Show Details)Sep 16 2021, 5:59 AM

Haven't finished grokking this yet, some comments so far.

llvm/include/llvm/CodeGen/GlobalISel/MachineIRBuilder.h
501
514

ditto

llvm/lib/CodeGen/GlobalISel/Legalizer.cpp
229–234

?

llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp
2876

Why is this code needed? Can this be a follow up change?

Haven't finished grokking this yet, some comments so far.

Do you have some test cases you are interested in? High level summary would be that by doing unmerge to each element before changing number of elements in a vector we are able to reference each element using vreg. This should be enough since we don't deal with anything smaller. This fixes main problem I see in combines "referencing an element in subvector" required by "merge subvector with subvector(or single element) of another subvector".

llvm/lib/CodeGen/GlobalISel/Legalizer.cpp
229–234

This stops infinite loops that happen on many mir tests that I did not update yet. I would like to get some feedback first and update tests according to strategy we are going to implement (see irtranslator-function-args.ll for example I need to change all g_concats like that into unmerge in all mir tests).
Infinite loop is caused by legalization of artifact that cannot be combined. Basically lower into same thing that did not combine in the first place (eventually runs out of memory).

llvm/lib/CodeGen/GlobalISel/LegalizerHelper.cpp
2876

This was easiest way to get all loads/stores to work for amdgpu (complicated fewerElementsIf is used at the moment).

Can this be a follow up change?

I don't thnik so. We need all places that deal with vectors to use new approach (including reduceLoadStoreWidth which is on todo list). I expect to be able to deal with vectors of any numberOfElements in artifact combiner after this patch. Here is a brief explanation of the code bellow :

  • load lowering will load smaller pieces of legal sizes, unmerge each piece to vector_elts and create large merge
  • store lowering unmerges large store to vector_elts, them merges elements into pieces of legal store size
llvm/test/CodeGen/AMDGPU/GlobalISel/inst-select-fract.f64.mir