Page MenuHomePhabricator

[AMDGPU][GISel] Legalize V2S16 G_BUILD_VECTOR
ClosedPublic

Authored by Pierre-vh on Sep 22 2022, 5:45 AM.

Details

Summary

Preparation patch for D134354 to make V2S16 G_BUILD_VECTOR legal.
Also removes RegBankInfo's scalarization of small BUILD_VECTORs,
replacing it with InstructionSelector logic instead.

This allows for V2S16 BUILD_VECTOR instructions to survive
all the way to ISel so we can select FMA/MAD_MIX instructions
in D134354.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
Pierre-vh requested review of this revision.Sep 22 2022, 5:45 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 22 2022, 5:45 AM
foad added a comment.Sep 22 2022, 6:20 AM

Looks like there are some code quality regressions to address?

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
1534–1535

Isn't this redundant with line 1529?

llvm/test/CodeGen/AMDGPU/GlobalISel/andn2.ll
521–522 ↗(On Diff #462152)

Regression here.

llvm/test/CodeGen/AMDGPU/GlobalISel/clamp-minmax-const-combine.ll
49–51 ↗(On Diff #462152)

Regression here.

Pierre-vh updated this revision to Diff 462165.Sep 22 2022, 6:37 AM
Pierre-vh marked 3 inline comments as done.

Thank you for the very quick review, indeed the patch isn't perfect yet (hence why I didn't add reviewers yet :) )
I fixed the regressions you pointed out, there's probably a lot more since the patch is
pretty large so I'll do a few more passes over the diff before I add reviewers.

arsenm added inline comments.Sep 23 2022, 5:48 AM
llvm/lib/Target/AMDGPU/AMDGPUCombine.td
91–95 ↗(On Diff #462152)

This combine can be a separate patch

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
729

This shouldn't change

1476–1493

This mostly looks like unwelcome clang-format changes. I think the only thing that really changed here was the S32->S16?

llvm/lib/Target/AMDGPU/AMDGPURegBankCombiner.cpp
341–342 ↗(On Diff #462152)

In the future we may want to handle <2 x i32> and <4 x i16>, at least for gfx940

368–369 ↗(On Diff #462152)

Probably should have an asserting getRegBank to go along with getRegClass instead of littering asserts like this

373 ↗(On Diff #462152)

You can do auto Const = B.buildConstant(S32,...); MRI.setRegBank(Const.getReg(0))

Pierre-vh updated this revision to Diff 462832.Sep 26 2022, 1:03 AM
Pierre-vh marked 6 inline comments as done.

Update, address comments

arsenm added inline comments.Sep 26 2022, 7:21 AM
llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
635–638

We don't want to die on AGPR vectors. Should return false if not handling them for now

642–644

We could technically do this in tablegen. Not sure why this was manual in the DAG path

687

D134463 switches to using v_perm_b32 here. Most everything in this function should be handled by tablegen though

Pierre-vh marked an inline comment as done.Sep 26 2022, 7:44 AM

For this to land, are there things I must migrate to TableGen or is the diff in good shape and just needs a bit of polishing ?

llvm/lib/Target/AMDGPU/AMDGPUCombine.td
91–95 ↗(On Diff #462152)

Actually it was a leftover from a previous attempt. I totally removed the combine now and refactored the InstructionSelection logic to fix the testcases the combine intended to fix. Seems more stable/cleaner.

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
642–644

I think there are some annoying cases like a copy in-between the constant and the build_vector that make it annoying to handle in TableGen

Is it fine to leave this in Cpp or do I need to migrate this to TableGen for this to land? The diff is already quite large and I wanted to avoid moving things between Cpp/TableGen to avoid adding more complexity to it

687

Do you mean that D134463 will make this code path obsolete/dead? Do I need to rebase on top of it?
My understanding is that this won't interfere with D134463 as that diff adds tablegen patterns (which are matched above) and this code path only triggers when tablegen doesn't match

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
729

This shouldn't change

Without this, we get the following IR post-legalization:

`
  %86:_(<2 x s16>) = G_FCANONICALIZE %126:_
  %87:_(<2 x s16>) = G_FCANONICALIZE %124:_
  %82:_(<2 x s16>) = G_FMAXNUM_IEEE %86:_, %87:_
  %84:_(<2 x s16>) = G_FCANONICALIZE %127:_
  %85:_(<2 x s16>) = G_FCANONICALIZE %125:_
  %83:_(<2 x s16>) = G_FMAXNUM_IEEE %84:_, %85:_
  %151:_(s32) = G_BITCAST %82:_(<2 x s16>)
  %74:_(s16) = G_TRUNC %151:_(s32)
  %152:_(s32) = G_LSHR %151:_, %135:_(s32)
  %75:_(s16) = G_TRUNC %152:_(s32)
  %153:_(s32) = G_BITCAST %83:_(<2 x s16>)
  %76:_(s16) = G_TRUNC %153:_(s32)
  %130:_(<2 x s16>) = G_BUILD_VECTOR %74:_(s16), %75:_(s16)
  %131:_(<2 x s16>) = G_BUILD_VECTOR %76:_(s16), %40:_(s16)
  %128:_(<2 x s16>) = G_BUILD_VECTOR %33:_(s16), %33:_(s16)
  %129:_(<2 x s16>) = G_BUILD_VECTOR %33:_(s16), %40:_(s16)
  %63:_(<2 x s16>) = G_FCANONICALIZE %130:_
  %64:_(<2 x s16>) = G_FCANONICALIZE %128:_
  %59:_(<2 x s16>) = G_FMINNUM_IEEE %63:_, %64:_
  %61:_(<2 x s16>) = G_FCANONICALIZE %131:_
  %62:_(<2 x s16>) = G_FCANONICALIZE %129:_
  %60:_(<2 x s16>) = G_FMINNUM_IEEE %61:_, %62:_

in the following test:

; GCN-LABEL: {{^}}v_mad_mix_v3f32_clamp_postcvt:
; GCN: s_waitcnt
; GFX900-DAG: v_mad_mixlo_f16 v{{[0-9]+}}, v0, v2, v4 op_sel_hi:[1,1,1] clamp
; GFX900-DAG: v_mad_mixhi_f16 v{{[0-9]+}}, v0, v2, v4 op_sel:[1,1,1] op_sel_hi:[1,1,1] clamp
; GFX900-DAG: v_mad_mixlo_f16 v{{[0-9]+}}, v1, v3, v5 op_sel_hi:[1,1,1] clamp

; GFX906-DAG: v_fma_mixlo_f16 v{{[0-9]+}}, v0, v2, v4 op_sel_hi:[1,1,1] clamp
; GFX906-DAG: v_fma_mixhi_f16 v{{[0-9]+}}, v0, v2, v4 op_sel:[1,1,1] op_sel_hi:[1,1,1] clamp
; GFX906-DAG: v_fma_mixlo_f16 v{{[0-9]+}}, v1, v3, v5 op_sel_hi:[1,1,1] clamp

; GFX9: v_mov_b32_e32 v0, v{{[0-9]+}}
; GFX9-NEXT: s_setpc_b64
define <3 x half> @v_mad_mix_v3f32_clamp_postcvt(<3 x half> %src0, <3 x half> %src1, <3 x half> %src2) #0 {
  %src0.ext = fpext <3 x half> %src0 to <3 x float>
  %src1.ext = fpext <3 x half> %src1 to <3 x float>
  %src2.ext = fpext <3 x half> %src2 to <3 x float>
  %result = tail call <3 x float> @llvm.fmuladd.v3f32(<3 x float> %src0.ext, <3 x float> %src1.ext, <3 x float> %src2.ext)
  %cvt.result = fptrunc <3 x float> %result to <3 x half>
  %max = call <3 x half> @llvm.maxnum.v3f16(<3 x half> %cvt.result, <3 x half> zeroinitializer)
  %clamp = call <3 x half> @llvm.minnum.v3f16(<3 x half> %max, <3 x half> <half 1.0, half 1.0, half 1.0>)
  ret <3 x half> %clamp
}

I can see three possibilities why:

  • The legalizer rule should go (easiest, it's why I did that for now)
  • The FPMinMadToClamp combine should be pre-legalizer instead of post-regbankcombiner
  • This test is wrong/shouldn't fold

What do you prefer?

Pierre-vh updated this revision to Diff 462904.Sep 26 2022, 7:46 AM

(Some) comments

foad added inline comments.Sep 27 2022, 3:32 AM
llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i16.ll
9 ↗(On Diff #463112)

Using buffer_store/load instead of shifts for non-constant indices is a big regression.

Pierre-vh planned changes to this revision.Sep 27 2022, 6:57 AM
Pierre-vh added inline comments.Sep 27 2022, 7:48 AM
llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i16.ll
9 ↗(On Diff #463112)

This seems to be coming from the change in the legalizer of G_EXTRACT_VECTOR_ELT at line 1445. Reverting it fixes it.
It's because with that change, it gets lowered to the stack, so a G_STORE is emitted and it causes what we're seeing here.

However if I revert the change, it becomes impossible to match mad_mix in v_mad_mix_v2f32_clamp_postcvt_lo, its G_INSERT_VECTOR_ELT/G_EXTRACT_VECTOR_ELT get lowered to the point that we can't even match the CLAMP.

Perhaps I can add a condition to not bitcast if it's a 2x16 vector (vectorWiderThan(VecTypeIdx, 32)) ? It works, but seems like a bandaid fix though, no?
@arsenm thoughts?

foad added inline comments.Sep 27 2022, 8:19 AM
llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i16.ll
9 ↗(On Diff #463112)

I would hope that it only gets lowered to memory accesses if the index is non-constant, and for matching mad_mix you are only interested in constant indices. Does that give you a way to fix both cases?

Pierre-vh added inline comments.Sep 27 2022, 10:02 AM
llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i16.ll
9 ↗(On Diff #463112)

LegalityQuery does not seem to contain any information about whether the indice is constant unfortunately, so it doesn't look like I can make that distinction in the legalizer unless I rewrite some of the legalization logic in a custom handler. I'd like to get @arsenm 's input first before implementing that though

I would hope that it only gets lowered to memory accesses if the index is non-constant

That seems a bit orthogonal to the current issue though, the legalizer currently doesn't make a distinction based on the const-ness of the indice (I think) so maybe it's best to not start taking it into account here, as it may have other implications. I feel like the vectorWiderThan(VecTypeIdx, 32) solution would be fine for now even if it's a bit hacky
I'll take another look at the situation tomorrow and see if I can come up with another fix

Pierre-vh updated this revision to Diff 463519.Sep 28 2022, 6:09 AM

I reverted the extract_element legalizer changes because it's too sensitive to changes. I tried to do custom legalization or just applying the new rule to v2s16, but it still leaves huge regressions.
It messes up one test case in D134354 (v_mad_mix_v2f32_clamp_postcvt_lo). The changes are pretty bad and I'm not sure it's doable to pattern-match the MIR to get the same result as the DAG (I'll keep looking into it if I have time though)

arsenm added inline comments.Sep 28 2022, 7:05 AM
llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i16.ll
9 ↗(On Diff #463112)

Legality cannot be context dependent. You can't have some cases be legal or not based on regular value operands.

I think treating the legalization different isn't the right strategy. In the DAG path we have combines to turn 16-bit vector operations into 32-bit vector operations with some shifts to extract of the elements. With optimizations, you wouldn't really ever use the 16->32 vector path

Pierre-vh added inline comments.Sep 28 2022, 7:12 AM
llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i16.ll
9 ↗(On Diff #463112)

So the latest changes I made to the legalizer are good, and the remaining changes need to be in a pre-legalizer combine for V2S16 INSERT/EXTRACT_VECTOR_ELT ?

arsenm added inline comments.Sep 28 2022, 7:40 AM
llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i16.ll
9 ↗(On Diff #463112)

So, the combine I was thinking of seems to be already done as the bitcast lowering for packed vectors (as is done in the old code). You should continue to use the bitcast lowering for vectors. You only want G_BUILD_VECTOR to be legal specifically for <2 x s16>

Pierre-vh added inline comments.Sep 28 2022, 10:15 AM
llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i16.ll
9 ↗(On Diff #463112)

G_BUILD_VECTOR legalization isn't a pain point for now, the current rules seem to be working but I can surely restrict it to 2x16 only.
The current problem is with G_EXTRACT_VECTOR_ELT/G_INSERT_VECTOR_ELT, for matching the mad_mix stuff they need to be lowered (.lowerFor({V2S16, S16}) does the trick), but it introduces huge regressions in insertelement.i16 where some bitwise ops become stack operations with load/stores/etc. What can be done there?

arsenm added inline comments.Sep 28 2022, 12:50 PM
llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i16.ll
9 ↗(On Diff #463112)

Yes, lower for these uses the stack (We could also add a compare and select path). The 16-bit vectors should continue to use the bitcast lowering (I'm not actually sure why you are touching these)

Petar.Avramovic added inline comments.
llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i16.ll
9 ↗(On Diff #463112)

There are too many changes in this patch. I am stuck on legalizer changes.
Maybe something from this stack of patches D109242 would be helpful.
Can you add test with G_EXTRACT_VECTOR_ELT/G_INSERT_VECTOR_ELT you mentioned and how you would like that to look like?

@Petar.Avramovic In the latest diffs, I don't change legalization rules from G_EXTRACT/INSERT_VECTOR_ELEMENT anymore, so it doesn't affect this patch
Instead I started D134870 which adds combines to achieve the fma/mad_mix selection in D134354

You don't want (<2 x s16>) but only s16 G_FMAXNUM_IEEEwhen lowering larger vectors, also you remove G_BUILD_VECTOR_TRUNC but use G_BUILD_VECTOR instead.
Can you point me to the test case that requires those changes? (You didn't change legalization rules for other <2 x s16> instructions).
To be more precise what D134354 requires from Legalizer (and pre/post legalizer combiner). Maybe there is simpler way to achieve it.

arsenm added inline comments.Sep 29 2022, 9:13 AM
llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
626–630

Theoretically the type checks should be unnecessary. If the legality rules are correct the verification should reject the illegally typed operations

674–675

Should look into dropping this next

687

All of this code should really go through the same tablegen patterns. A future change should try to get rid of this custom code. The only custom selected case in the DAG path is for constants (and even that could be moved to tablegen)

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
729

The goal is to produce the packed minnum/maxnum. The mess around it is for other combines

You don't want (<2 x s16>) but only s16 G_FMAXNUM_IEEEwhen lowering larger vectors, also you remove G_BUILD_VECTOR_TRUNC but use G_BUILD_VECTOR instead.
Can you point me to the test case that requires those changes? (You didn't change legalization rules for other <2 x s16> instructions).
To be more precise what D134354 requires from Legalizer (and pre/post legalizer combiner). Maybe there is simpler way to achieve it.

D134354 requires G_BUILD_VECTOR V2S16 to be legal, we tried other options first that didn't involve legalizer changes but we discussed with @arsenm and concluded that legalizer changes were needed to support G_BUILD_VECTOR V2S16, which is what this diff is doing.
The G_FMINNUM change seems to be needed else the following:

; GISEL-GFX900-LABEL: v_mad_mix_v3f32_clamp_postcvt:
; GISEL-GFX900:       ; %bb.0:
; GISEL-GFX900-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GISEL-GFX900-NEXT:    v_mad_mixlo_f16 v6, v0, v2, v4 op_sel_hi:[1,1,1] clamp
; GISEL-GFX900-NEXT:    v_mad_mixhi_f16 v6, v0, v2, v4 op_sel:[1,1,1] op_sel_hi:[1,1,1] clamp
; GISEL-GFX900-NEXT:    v_mad_mixlo_f16 v1, v1, v3, v5 op_sel_hi:[1,1,1] clamp
; GISEL-GFX900-NEXT:    v_mov_b32_e32 v0, v6
; GISEL-GFX900-NEXT:    s_setpc_b64 s[30:31]

becomes

; GISEL-GFX900-LABEL: v_mad_mix_v3f32_clamp_postcvt:
; GISEL-GFX900:       ; %bb.0:
; GISEL-GFX900-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GISEL-GFX900-NEXT:    v_mad_mixlo_f16 v6, v0, v2, v4 op_sel_hi:[1,1,1]
; GISEL-GFX900-NEXT:    v_mad_mixhi_f16 v6, v0, v2, v4 op_sel:[1,1,1] op_sel_hi:[1,1,1]
; GISEL-GFX900-NEXT:    v_pk_max_f16 v0, v6, 0
; GISEL-GFX900-NEXT:    v_lshrrev_b32_e32 v2, 16, v0
; GISEL-GFX900-NEXT:    v_and_b32_e32 v0, 0xffff, v0
; GISEL-GFX900-NEXT:    v_mad_mixlo_f16 v1, v1, v3, v5 op_sel_hi:[1,1,1]
; GISEL-GFX900-NEXT:    v_lshl_or_b32 v0, v2, 16, v0
; GISEL-GFX900-NEXT:    v_pk_max_f16 v1, v1, v1
; GISEL-GFX900-NEXT:    v_pk_max_f16 v0, v0, v0
; GISEL-GFX900-NEXT:    v_pk_max_f16 v1, v1, 0
; GISEL-GFX900-NEXT:    v_pk_min_f16 v0, v0, 1.0 op_sel_hi:[1,0]
; GISEL-GFX900-NEXT:    v_pk_max_f16 v1, v1, v1
; GISEL-GFX900-NEXT:    v_lshrrev_b32_e32 v2, 16, v0
; GISEL-GFX900-NEXT:    v_and_b32_e32 v0, 0xffff, v0
; GISEL-GFX900-NEXT:    v_pk_min_f16 v1, v1, 1.0
; GISEL-GFX900-NEXT:    v_lshl_or_b32 v0, v2, 16, v0
; GISEL-GFX900-NEXT:    s_setpc_b64 s[30:31]
`

If it's really a blocker ( @arsenm ?) I can take another look at it but it seems like leaving it in causes a lot of trouble

Pierre-vh added inline comments.Sep 29 2022, 9:19 AM
llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
674–675

Do you mean moving it into a combine? In this patch or a future patch?

687

Is it fine to keep as-is for now, or do I need to move it all to tablegen for this to land?
I don't mind going back to it later but currently there's a lot of diffs open for mad_mix and it's becoming more and more difficult to differentiate essential changes from "nice improvements". If the current version is acceptable I'd rather keep it for now, is that ok?

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
729

I'm not sure I understand, is the legalizer change wrong and I need to look into adding more combines to remove the extra instructions?
What kind of combine can be done there? I don't see anything obvious

arsenm added inline comments.Sep 29 2022, 9:31 AM
llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
687

Yes, moving to tablegen is a separate change for later

llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
729

<2 x s16> minnum/maxnum are legal with VOP3P and the legalizer rules should express this and try to legalize wider vectors to <2 x s16> pieces. The MIR here does correctly produce the <2 x s16> operations. The additional context in this particular test isn't relevant to what the legalizer rules here should be (I'm not seeing what the problem is here, other than the legalized MIR has some vector conversion mess)

G_BUILD_VECTOR (TRUNC (BITCAST x)), (TRUNC (LSHR (BITCAST x), 16)) is an identity that can fold out

Does D134354 work for non vector cases?
My best guess is that it does not work for vectors of two elements where it would need to create two instructions (missing some build vector combine/legalizer rule). You would expect that all those build vectors in between are gone after legalizer?

Just to point out most notable regression:

define amdgpu_ps <3 x half> @min3(<3 x half> %src0, <3 x half> %src1) {
  %min3 = call <3 x half> @llvm.minnum.v3f16(<3 x half> %src0, <3 x half> %src1)
  ret <3 x half> %min3
}

declare <3 x half> @llvm.minnum.v3f16(<3 x half>, <3 x half>)

goes from

	v_pk_min_f16 v0, v0, v2
	v_pk_min_f16 v1, v1, v3

to

	v_min_f16_e32 v4, v0, v2
	v_min_f16_sdwa v0, v0, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1 src1_sel:WORD_1
	v_min_f16_e32 v1, v1, v3
	v_lshl_or_b32 v0, v0, 16, v4

D134354 does mad_mix combine for straightforward test cases same as sdag. The test you mentioned here v_mad_mix_v3f32_clamp_postcvt
does the mad_mix selection in the same way in both cases, the difference you pointed out is clamp that failed to combine, i.e you don't need this patch in order to select mad_mix.

to get same result as sdag you will need:
combine for clamp for v2f16 where one element is undef
maybe to not lower build_vector_trunc in regbankselect so that it is easier to look through, it is easy to select.

Overall it is probably best to go with D134354 first and then optimize code.

Pierre-vh planned changes to this revision.Sep 30 2022, 3:44 AM
Pierre-vh updated this revision to Diff 464224.Sep 30 2022, 5:21 AM

Remove G_FMIN/MAXNUM legalizer changes.
This now only touches build_vector. I renamed the diff to reflect that.

Pierre-vh retitled this revision from [AMDGPU][GISel] Enable Matching of V2S16 G_BUILD_VECTOR to [AMDGPU][GISel] Legalize V2S16 G_BUILD_VECTOR.Sep 30 2022, 5:21 AM
Pierre-vh edited the summary of this revision. (Show Details)
arsenm accepted this revision.Sep 30 2022, 6:10 AM

LGTM

This revision is now accepted and ready to land.Sep 30 2022, 6:10 AM
This revision was landed with ongoing or failed builds.Sep 30 2022, 7:05 AM
This revision was automatically updated to reflect the committed changes.