Page MenuHomePhabricator

[AMDGPU][GISel] Combine V2S16 G_EXTRACT/INSERT_VECTOR_ELT
AbandonedPublic

Authored by Pierre-vh on Sep 29 2022, 3:41 AM.

Details

Reviewers
arsenm
foad
Summary
  • Add a pre-legalizer combine to rewrite G_INSERT_VECTOR_ELT and G_EXTRACT_VECTOR_ELT for V2S16 into another form using G_UNMERGE_VALUE and G_BUILD_VECTOR.

Depends on D134953

Diff Detail

Event Timeline

Pierre-vh created this revision.Sep 29 2022, 3:41 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 29 2022, 3:41 AM
Pierre-vh requested review of this revision.Sep 29 2022, 3:41 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 29 2022, 3:41 AM
Pierre-vh edited the summary of this revision. (Show Details)Sep 29 2022, 3:44 AM
Pierre-vh edited the summary of this revision. (Show Details)

The trunc combine LGTM. The vector combine I'm not sure about

llvm/lib/Target/AMDGPU/AMDGPUPostLegalizerCombiner.cpp
309–320 ↗(On Diff #463830)

This part LGTM

322–326 ↗(On Diff #463830)

Can use replaceSingleDefInstWithReg

llvm/lib/Target/AMDGPU/AMDGPUPreLegalizerCombiner.cpp
160

This combine is too targeted (and I'm pretty sure it's redundant). The generic combiner already turns G_INSERT_VECTOR_ELT with constant chains into G_BUILD_VECTOR.

209

We don't really want to have 16-bit unmerges. I'm not sure how this combine helps you with the current problem

Pierre-vh added inline comments.Sep 29 2022, 7:38 AM
llvm/lib/Target/AMDGPU/AMDGPUPreLegalizerCombiner.cpp
160

This combine is the most important of the two though
I see that there's indeed a matchCombineInsertVecElts/matchExtractVecEltBuildVec but it didn't work at all in the mad_mix cases, I think it's because the source isn't a BUILD_VECTOR in those cases but instead it's operations such as G_FPTRUNC <2 x 16> so the combine isn't redundant as far as I can see, it's just more focused on V2S16s

209

It seems that in all of those cases, the unmerge ends up removed because the source value gets scalarized.
e.g. we have G_UNMERGE_VALUE (v2s16 (G_FPTRUNC ...)) but the trunc gets split into two scalar operations, removing the unmerge.
Why can't we have 16 bit unmerges? Won't the legalizer deal with them if they shouldn't be there?

Pierre-vh added inline comments.Sep 29 2022, 8:03 AM
llvm/lib/Target/AMDGPU/AMDGPUPreLegalizerCombiner.cpp
160

I'd really like to keep this combine because it makes D134354 work just as well without needing to change the extract/insert vector element legalization rules

Perhaps I can add some code to check if the operation that sources the vector is illegal, and only do it when the source operation isn't legal (with the vector type)?
I could maybe even make the combine less targeted, not 2x16 specific but let it work on any operation as long as its source is illegal (and probably going to be scalarized) ? Would that be better ?

arsenm added inline comments.Sep 29 2022, 2:50 PM
llvm/lib/Target/AMDGPU/AMDGPUPreLegalizerCombiner.cpp
160

It's more important to make progress on the vector handling strategy than trying to get mad_mix to work immediately. Most of these should work for general types, not just this one case.

matchCombineInsertVecElts is primarily looking for G_BUILD_VECTOR sources that it created itself, so the source not being a build_vector should not be an issue. It also looks weirdly limited; for example the loop seems to not accept undef elements.

I've also been thinking it's likely a mistake to be doing vector combines in terms of the merge/unmerge artifacts. Vector operation combines should probably aim to use other vector operations, and leave the merge/unmerge for the legalizer

Can you compare to what happens in the DAG? It looks to me like <2 x i16> insertelements are turned into shuffles by combineInsertEltToShuffle, which later lower into a build_vector of the extracts of the individual elements

209

Because we don't really have 16-bit registers. An extract of 16-bit values implies a cast and shift of some kind. As I mentioned above, I also think using unmerge outside of the legalizer should probably be avoided

Pierre-vh planned changes to this revision.Sep 30 2022, 3:44 AM
Pierre-vh added inline comments.Sep 30 2022, 3:57 AM
llvm/lib/Target/AMDGPU/AMDGPUPreLegalizerCombiner.cpp
160

In the DAG: insert_vector_elt -> vector_shuffle<2,1> -> BUILD_VECTOR # D:1 t69, t44
So what we need in the end is a BUILD_VECTOR for the matching to be done.

What would be a good way forward then if we remove this combine?

  • Leave the regressions in for now
  • Add some combine to match G_BITCAST (G_OR (G_AND ..., ...), ...) into G_BUILD_VECTOR? (aka undoing the legalizer's transform, not especially a fan of that)
  • Move this combine to the custom legalizer: customIf V2S16+ add a code path to handle V2S16 manually.
Pierre-vh requested review of this revision.Sep 30 2022, 3:57 AM
Pierre-vh retitled this revision from [AMDGPU][GISel] Better support for V2S16 G_EXTRACT/INSERT_VECTOR_ELT to [AMDGPU][GISel] Combine V2S16 G_EXTRACT/INSERT_VECTOR_ELT.Sep 30 2022, 5:30 AM
Pierre-vh edited the summary of this revision. (Show Details)
Pierre-vh edited the summary of this revision. (Show Details)
Pierre-vh updated this revision to Diff 464228.Sep 30 2022, 5:30 AM

Rebase, pull the trunc combine into D134953 instead

arsenm added inline comments.Sep 30 2022, 6:38 AM
llvm/lib/Target/AMDGPU/AMDGPUPreLegalizerCombiner.cpp
160

Without a compelling reason to do something else, I'm inclined to start by following what the DAG is doing. That would mean:

  1. Adding a combine for insert_vector_elt -> shufflevector
  2. Teaching the default lowering for insert_vector_elt to use a shuffle for a constant index before trying to go to the stack
arsenm added inline comments.Sep 30 2022, 6:41 AM
llvm/lib/Target/AMDGPU/AMDGPUPreLegalizerCombiner.cpp
160

Probably also worth porting all the low hanging fruit in visitINSERT_VECTOR_ELT/visitEXTRACT_VECTOR_ELT

Pierre-vh added inline comments.Oct 3 2022, 12:27 AM
llvm/lib/Target/AMDGPU/AMDGPUPreLegalizerCombiner.cpp
160

The current DAG combine does it only if the inserted element is a bitcast of a subvector, no?

// insert_vector_elt V, (bitcast X from vector type), IdxC -->
// bitcast(shuffle (bitcast V), (extended X), Mask)

In one of the regressions (caused by reverting this patch) we get the following gMIR:

# *** IR Dump Before Legalizer (legalizer) ***:
# Machine code for function v_mad_mix_v2f32_clamp_postcvt_lo: IsSSA, TracksLiveness

bb.1 (%ir-block.0):
  liveins: $vgpr0, $vgpr1, $vgpr2
  %0:_(<2 x s16>) = COPY $vgpr0
  %1:_(<2 x s16>) = COPY $vgpr1
  %2:_(<2 x s16>) = COPY $vgpr2
  %3:_(<2 x s32>) = G_FPEXT %0:_(<2 x s16>)
  %4:_(<2 x s32>) = G_FPEXT %1:_(<2 x s16>)
  %5:_(<2 x s32>) = G_FPEXT %2:_(<2 x s16>)
  %6:_(<2 x s32>) = G_FMA %3:_, %4:_, %5:_
  %7:_(<2 x s16>) = G_FPTRUNC %6:_(<2 x s32>)
  %9:_(s32) = G_CONSTANT i32 0
  %8:_(s16) = G_EXTRACT_VECTOR_ELT %7:_(<2 x s16>), %9:_(s32)
  %10:_(s16) = G_FCONSTANT half 0xH0000
  %11:_(s16) = G_FMAXNUM %8:_, %10:_
  %12:_(s16) = G_FCONSTANT half 0xH3C00
  %13:_(s16) = G_FMINNUM %11:_, %12:_
  %14:_(<2 x s16>) = G_INSERT_VECTOR_ELT %7:_, %13:_(s16), %9:_(s32)
  $vgpr0 = COPY %14:_(<2 x s16>)
  SI_RETURN implicit $vgpr0

Ideally we need to get rid of both insert/extract vector elt and/or match them to BUILD_VECTOR.
The first extract should be easy enough, maybe worst case I'll need to add some new lowering logic to use bitcast + trunc for elt 0.

For the second case, how would lowering/combining look like? Do I do a BUILD_VECTOR from %13 & a undef, and then shuffle the result?
Do you think this is more fitting for a new custom lowering for INSERT_VECTOR_ELT (constrained to V2S16?) or for a pre-legal combine?

Pierre-vh added inline comments.Oct 3 2022, 1:08 AM
llvm/lib/Target/AMDGPU/AMDGPUPreLegalizerCombiner.cpp
160

Did a "quick and dirty" implementation in our legalizeInsertVectorElt (+ customFor({V2S16, S16}), and it seems to work like a charm for mad-mix tests (other tests have crashes because it's hacked-in for this specific case, ofc) - no regressions, only improvements (without this patch and on top of D134967).

const auto Undef = MRI.createGenericVirtualRegister(EltTy);
B.buildUndef(Undef);

const auto OtherVec = MRI.createGenericVirtualRegister(VecTy);
B.buildBuildVector(OtherVec, { Ins, Undef });

// 2 == Ins in OtherVec
// 0/1 = Original Elts
if(IdxVal == 0) {
  B.buildShuffleVector(Dst, Vec, OtherVec, {2, 1});
} else {
  assert(IdxVal == 1);
  B.buildShuffleVector(Dst, Vec, OtherVec, {0, 2});
}

Would it be fire to just implement it as custom lowering? Or should it be a pre-legalizer combine?

Also, what should the heuristics be for this transformation to apply (on top of the index being constant). Should we only do it for small vectors (<2 elt?), V2S16 only? If we do it all the time it'll just replace the current custom legalization (not sure it's what we want)

Pierre-vh added inline comments.Oct 3 2022, 5:14 AM
llvm/lib/Target/AMDGPU/AMDGPUPreLegalizerCombiner.cpp
160

For insert, after a bit more looking, I think it doesn't fit in the legalizer because we can only do this in some cases, and when we can't, we need to follow the normal legalizer flow (bitcast/clamp/lower), so a pre-legalizer combine would probably make more sense here.

For extract, it seems like we also need to change something. To get strictly better codegen in the mad-mix tests, I need to add:

getActionDefinitionsBuilder(G_EXTRACT_VECTOR_ELT)
  .customFor({{V2S16, S16}, {S16, V2S16}});

to the legalizer, otherwise it uses a series of shifts to lower the extract (even when it's index zero). Some improvements in the lowering logic and/or more combines may be needed I think

arsenm added inline comments.Oct 3 2022, 10:11 AM
llvm/lib/Target/AMDGPU/AMDGPUPreLegalizerCombiner.cpp
160

Yes, this is an optimization combine that doesn't belong in the legalizer. The only cases the legalizer should really be "optimizing" is in the rare case where we can totally change the lowering strategy based on the inputs.

Plus eventually we'll want to move towards a canonical vector form pre-legalizer for the benefit of other vector combines.

arsenm added inline comments.Oct 3 2022, 10:13 AM
llvm/lib/Target/AMDGPU/AMDGPUPreLegalizerCombiner.cpp
160

I'd expect extract_vector_element X:v2i16, 0 to fold into a truncate of bitcast. This could arguably be both a lowering and a combine

Pierre-vh abandoned this revision.Oct 4 2022, 12:01 AM