Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
The trunc combine LGTM. The vector combine I'm not sure about
llvm/lib/Target/AMDGPU/AMDGPUPostLegalizerCombiner.cpp | ||
---|---|---|
309–320 ↗ | (On Diff #463830) | This part LGTM |
322–326 ↗ | (On Diff #463830) | Can use replaceSingleDefInstWithReg |
llvm/lib/Target/AMDGPU/AMDGPUPreLegalizerCombiner.cpp | ||
160 | This combine is too targeted (and I'm pretty sure it's redundant). The generic combiner already turns G_INSERT_VECTOR_ELT with constant chains into G_BUILD_VECTOR. | |
209 | We don't really want to have 16-bit unmerges. I'm not sure how this combine helps you with the current problem |
llvm/lib/Target/AMDGPU/AMDGPUPreLegalizerCombiner.cpp | ||
---|---|---|
160 | This combine is the most important of the two though | |
209 | It seems that in all of those cases, the unmerge ends up removed because the source value gets scalarized. |
llvm/lib/Target/AMDGPU/AMDGPUPreLegalizerCombiner.cpp | ||
---|---|---|
160 | I'd really like to keep this combine because it makes D134354 work just as well without needing to change the extract/insert vector element legalization rules Perhaps I can add some code to check if the operation that sources the vector is illegal, and only do it when the source operation isn't legal (with the vector type)? |
llvm/lib/Target/AMDGPU/AMDGPUPreLegalizerCombiner.cpp | ||
---|---|---|
160 | It's more important to make progress on the vector handling strategy than trying to get mad_mix to work immediately. Most of these should work for general types, not just this one case. matchCombineInsertVecElts is primarily looking for G_BUILD_VECTOR sources that it created itself, so the source not being a build_vector should not be an issue. It also looks weirdly limited; for example the loop seems to not accept undef elements. I've also been thinking it's likely a mistake to be doing vector combines in terms of the merge/unmerge artifacts. Vector operation combines should probably aim to use other vector operations, and leave the merge/unmerge for the legalizer Can you compare to what happens in the DAG? It looks to me like <2 x i16> insertelements are turned into shuffles by combineInsertEltToShuffle, which later lower into a build_vector of the extracts of the individual elements | |
209 | Because we don't really have 16-bit registers. An extract of 16-bit values implies a cast and shift of some kind. As I mentioned above, I also think using unmerge outside of the legalizer should probably be avoided |
llvm/lib/Target/AMDGPU/AMDGPUPreLegalizerCombiner.cpp | ||
---|---|---|
160 | In the DAG: insert_vector_elt -> vector_shuffle<2,1> -> BUILD_VECTOR # D:1 t69, t44 What would be a good way forward then if we remove this combine?
|
llvm/lib/Target/AMDGPU/AMDGPUPreLegalizerCombiner.cpp | ||
---|---|---|
160 | Without a compelling reason to do something else, I'm inclined to start by following what the DAG is doing. That would mean:
|
llvm/lib/Target/AMDGPU/AMDGPUPreLegalizerCombiner.cpp | ||
---|---|---|
160 | Probably also worth porting all the low hanging fruit in visitINSERT_VECTOR_ELT/visitEXTRACT_VECTOR_ELT |
llvm/lib/Target/AMDGPU/AMDGPUPreLegalizerCombiner.cpp | ||
---|---|---|
160 | The current DAG combine does it only if the inserted element is a bitcast of a subvector, no? // insert_vector_elt V, (bitcast X from vector type), IdxC --> // bitcast(shuffle (bitcast V), (extended X), Mask) In one of the regressions (caused by reverting this patch) we get the following gMIR: # *** IR Dump Before Legalizer (legalizer) ***: # Machine code for function v_mad_mix_v2f32_clamp_postcvt_lo: IsSSA, TracksLiveness bb.1 (%ir-block.0): liveins: $vgpr0, $vgpr1, $vgpr2 %0:_(<2 x s16>) = COPY $vgpr0 %1:_(<2 x s16>) = COPY $vgpr1 %2:_(<2 x s16>) = COPY $vgpr2 %3:_(<2 x s32>) = G_FPEXT %0:_(<2 x s16>) %4:_(<2 x s32>) = G_FPEXT %1:_(<2 x s16>) %5:_(<2 x s32>) = G_FPEXT %2:_(<2 x s16>) %6:_(<2 x s32>) = G_FMA %3:_, %4:_, %5:_ %7:_(<2 x s16>) = G_FPTRUNC %6:_(<2 x s32>) %9:_(s32) = G_CONSTANT i32 0 %8:_(s16) = G_EXTRACT_VECTOR_ELT %7:_(<2 x s16>), %9:_(s32) %10:_(s16) = G_FCONSTANT half 0xH0000 %11:_(s16) = G_FMAXNUM %8:_, %10:_ %12:_(s16) = G_FCONSTANT half 0xH3C00 %13:_(s16) = G_FMINNUM %11:_, %12:_ %14:_(<2 x s16>) = G_INSERT_VECTOR_ELT %7:_, %13:_(s16), %9:_(s32) $vgpr0 = COPY %14:_(<2 x s16>) SI_RETURN implicit $vgpr0 Ideally we need to get rid of both insert/extract vector elt and/or match them to BUILD_VECTOR. For the second case, how would lowering/combining look like? Do I do a BUILD_VECTOR from %13 & a undef, and then shuffle the result? |
llvm/lib/Target/AMDGPU/AMDGPUPreLegalizerCombiner.cpp | ||
---|---|---|
160 | Did a "quick and dirty" implementation in our legalizeInsertVectorElt (+ customFor({V2S16, S16}), and it seems to work like a charm for mad-mix tests (other tests have crashes because it's hacked-in for this specific case, ofc) - no regressions, only improvements (without this patch and on top of D134967). const auto Undef = MRI.createGenericVirtualRegister(EltTy); B.buildUndef(Undef); const auto OtherVec = MRI.createGenericVirtualRegister(VecTy); B.buildBuildVector(OtherVec, { Ins, Undef }); // 2 == Ins in OtherVec // 0/1 = Original Elts if(IdxVal == 0) { B.buildShuffleVector(Dst, Vec, OtherVec, {2, 1}); } else { assert(IdxVal == 1); B.buildShuffleVector(Dst, Vec, OtherVec, {0, 2}); } Would it be fire to just implement it as custom lowering? Or should it be a pre-legalizer combine? Also, what should the heuristics be for this transformation to apply (on top of the index being constant). Should we only do it for small vectors (<2 elt?), V2S16 only? If we do it all the time it'll just replace the current custom legalization (not sure it's what we want) |
llvm/lib/Target/AMDGPU/AMDGPUPreLegalizerCombiner.cpp | ||
---|---|---|
160 | For insert, after a bit more looking, I think it doesn't fit in the legalizer because we can only do this in some cases, and when we can't, we need to follow the normal legalizer flow (bitcast/clamp/lower), so a pre-legalizer combine would probably make more sense here. For extract, it seems like we also need to change something. To get strictly better codegen in the mad-mix tests, I need to add: getActionDefinitionsBuilder(G_EXTRACT_VECTOR_ELT) .customFor({{V2S16, S16}, {S16, V2S16}}); to the legalizer, otherwise it uses a series of shifts to lower the extract (even when it's index zero). Some improvements in the lowering logic and/or more combines may be needed I think |
llvm/lib/Target/AMDGPU/AMDGPUPreLegalizerCombiner.cpp | ||
---|---|---|
160 | Yes, this is an optimization combine that doesn't belong in the legalizer. The only cases the legalizer should really be "optimizing" is in the rare case where we can totally change the lowering strategy based on the inputs. Plus eventually we'll want to move towards a canonical vector form pre-legalizer for the benefit of other vector combines. |
llvm/lib/Target/AMDGPU/AMDGPUPreLegalizerCombiner.cpp | ||
---|---|---|
160 | I'd expect extract_vector_element X:v2i16, 0 to fold into a truncate of bitcast. This could arguably be both a lowering and a combine |
This combine is too targeted (and I'm pretty sure it's redundant). The generic combiner already turns G_INSERT_VECTOR_ELT with constant chains into G_BUILD_VECTOR.