This is an archive of the discontinued LLVM Phabricator instance.

llvm/test/CodeGen/AMDGPU/GlobalISel/mad-mix.ll
2–5	I'd work to eliminate those differences. If it turns out to be difficult, I would probably switch to generated checks and have them share the same file

Pierre-vh mentioned this in D134433: [AMDGPU][GISel] Legalize V2S16 G_BUILD_VECTOR.Sep 22 2022, 5:45 AM

Pierre-vh added a parent revision: D134433: [AMDGPU][GISel] Legalize V2S16 G_BUILD_VECTOR.Sep 22 2022, 5:45 AM

Rebase on D134433, address some comments
I still have some work left to do but I think the majority can already be reviewed

Harbormaster completed remote builds in B188171: Diff 462170.Sep 22 2022, 6:59 AM

Pierre-vh added inline comments.Sep 22 2022, 7:01 AM

llvm/test/CodeGen/AMDGPU/GlobalISel/mad-mix.ll
2–5	The remaining difference fall in the following categories: isCanonicalized has different behaviour between GISel/DAG, so there's v_mac instead of v_mad in a few places since SIFoldOperand doesn't fold there's also v_madak_f32 that's no longer present, I think it's the same cause but I haven't looked into it yet op_sel is on the second v_mad_mix instead of the first, seems like a harmless difference due to how DAG/GISel work? A G_LSHR + G_SHUFFLEVECTOR (1,0) pair isn't folded out. I think those operations negate each other, perhaps a combine should be added for that? Some unfinished things like -[v0\| not being picked up yet. The last 2 definitely have to be fixed, I'll look into them ASAP, but are the first 2 important as well? I'm not sure of what to do with `isCanonicalized`, is there a place where I can find the list of operations that should go in there?
llvm/utils/TableGen/GlobalISelEmitter.cpp
2526–2530	Actually it isn't, it looks like the FMA/MAD patterns expose a bug in GISel. Without that there's a crash (segfault) in `executeMatchTable` because the number of renderer fns is incorrectly reported and it doesn't allocate enough entries in the vector that holds them. It seems like we rarely go above 2 renderers but here there's 4 IIRC

Pierre-vh added inline comments.Sep 23 2022, 12:13 AM

llvm/test/CodeGen/AMDGPU/GlobalISel/mad-mix.ll

2–5

For -|v0|, the gMIR looks like this:

bb.1 (%ir-block.0):
  liveins: $vgpr0, $vgpr1, $vgpr2
  %0:_(s32) = COPY $vgpr0
  %3:_(s32) = COPY $vgpr1
  %1:_(s16) = G_TRUNC %3:_(s32)
  %4:_(s32) = COPY $vgpr2
  %2:_(s16) = G_TRUNC %4:_(s32)
  %8:_(s16) = G_FCONSTANT half 0xH8000
  %7:_(<2 x s16>) = G_BUILD_VECTOR %8:_(s16), %8:_(s16)
  %5:_(<2 x s16>) = G_BITCAST %0:_(s32)
  %6:_(<2 x s16>) = G_FABS %5:_
  %18:_(<2 x s16>) = G_FNEG %6:_
  %9:_(<2 x s16>) = G_FADD %7:_, %18:_
  %19:_(s32) = G_BITCAST %9:_(<2 x s16>)
  %20:_(s32) = G_CONSTANT i32 16
  %21:_(s32) = G_LSHR %19:_, %20:_(s32)
  %17:_(s16) = G_TRUNC %21:_(s32)
  %12:_(s32) = G_FPEXT %17:_(s16)
  %13:_(s32) = G_FPEXT %1:_(s16)
  %14:_(s32) = G_FPEXT %2:_(s16)
  %15:_(s32) = G_FMA %12:_, %13:_, %14:_
  $vgpr0 = COPY %15:_(s32)
  SI_RETURN implicit $vgpr0

Could we add a combine to fold G_FADD (+-)0.0, x into just x?
If we add that and another one to fold G_LSHR + G_SHUFFLEVECTOR (1,0), it should address most of the remaining differences.

arsenm added inline comments.Sep 23 2022, 6:05 AM

llvm/test/CodeGen/AMDGPU/GlobalISel/mad-mix.ll
2–5	fadd 0 can only be folded out if you don't care about signed zeros and don't care about canonicalizing (I think the existing DAG combine fails to consider this second point). Which test is this in? I don't see how this would have ever been able to fold into an fmax_mix operand. We should have the shift + shuffle combine. isCanonicalized is essentially a list of opcodes with floating point semantics.

Pierre-vh added inline comments.Sep 26 2022, 2:27 AM

llvm/test/CodeGen/AMDGPU/GlobalISel/mad-mix.ll

2–5

the gMIR was from v_mad_mix_f32_preextractfabsfneg_f16hi_f16lo_f16lo. I'll remove the comment saying it should be -|v0| then.

For isCanonicalized, do I just need to add opcodes like in the DAG version? e.g.:

switch (Opcode) {
case AMDGPU::G_FADD:
case AMDGPU::G_FSUB:
case AMDGPU::G_FMUL:
case AMDGPU::G_FMA:
case AMDGPU::G_FMAD:
case AMDGPU::G_FDIV:
case AMDGPU::G_FREM:
case AMDGPU::G_FPOW:
case AMDGPU::G_FPEXT:
case AMDGPU::G_FPTRUNC:
  return true;
case AMDGPU::G_FNEG:
case AMDGPU::G_FMINNUM_IEEE:
case AMDGPU::G_FMAXNUM_IEEE:

If yes, I get this result in one of the tests:

v_mad_mixlo_f16_f16lo_f16lo_f32_clamp_pre_cvt: ; @v_mad_mixlo_f16_f16lo_f16lo_f32_clamp_pre_cvt
; %bb.0:
	s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
	v_cvt_f32_f16_e32 v0, v0
	v_cvt_f32_f16_e32 v1, v1
	v_mac_f32_e32 v2, v0, v1
	v_med3_f32 v0, v2, 0, 1.0
	v_cvt_f16_f32_e32 v0, v0
	s_setpc_b64 s[30:31]

Compared to this for the DAG:

; v_mad_f32 v0, v0, v1, v2 clamp
; v_cvt_f16_f32_e32 v0, v0
; v_cvt_f32_f16_e32 v0, v0

Which is a really big difference.

Pierre-vh edited parent revisions, added: D134635: [AMDGPU][GlobalISel] Add Shift/Shufflevector Combine; removed: D134433: [AMDGPU][GISel] Legalize V2S16 G_BUILD_VECTOR.Sep 26 2022, 3:55 AM

Rebase, update tests/uncomment broken test

Harbormaster completed remote builds in B188681: Diff 462872.Sep 26 2022, 6:46 AM

arsenm added inline comments.Sep 26 2022, 7:40 AM

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
580	eraseFromParent
3735–3747	This should definitely be a combine and should not be in the selector. This should only directly interpret fabs and fneg
llvm/lib/Target/AMDGPU/AMDGPURegBankCombiner.cpp
248	Don't see what happened here?
260	Leftover debugging

Comments

Harbormaster completed remote builds in B188705: Diff 462903.Sep 26 2022, 8:50 AM

Address comments
Fix failing tablegen test
Change tests to use fneg instead of fsub, as discussed.
- v_mad_mix_f32_preextractfabsfneg_f16hi_f16lo_f16lo now uses -[v0| because of it, some other tests also use -v0 instead now.
Update Value Tracking functions to fix some additional test cases

foad added inline comments.Sep 27 2022, 3:48 AM

llvm/lib/CodeGen/GlobalISel/Utils.cpp
665–667 ↗	(On Diff #463165)	This isn't true. You can get a NaN result even if none of the inputs are NaNs, e.g. from +inf + -inf.

Pierre-vh added inline comments.Sep 27 2022, 3:52 AM

llvm/lib/CodeGen/GlobalISel/Utils.cpp

665–667 ↗

(On Diff #463165)

I just copied the SDag implementation so that one would be wrong too then

case ISD::FMA:
case ISD::FMAD: {
  if (SNaN)
    return true;
  return isKnownNeverNaN(Op.getOperand(0), SNaN, Depth + 1) &&
         isKnownNeverNaN(Op.getOperand(1), SNaN, Depth + 1) &&
         isKnownNeverNaN(Op.getOperand(2), SNaN, Depth + 1);
}

What is the alternative? Can this function not handle FMA/FMAD at all then?

foad added inline comments.Sep 27 2022, 3:55 AM

llvm/lib/CodeGen/GlobalISel/Utils.cpp
665–667 ↗	(On Diff #463165)	D50804 fixed the sdag implementation for fadd/fmul etc but apparently not for fma/fmad :(

Harbormaster completed remote builds in B188904: Diff 463165.Sep 27 2022, 4:12 AM

Fix tests affected by isCanonicalized/isKnownNeverNaN changes
I minimized the addition to those functions to minimize the amount of tests impacted.
They both need more love + the FADD to FNEG combine would be nice too, but not in this patch.

In D134354#3817591, @Pierre-vh wrote:

Change tests to use fneg instead of fsub, as discussed.

This should be done as a separate precommit

In D134354#3818053, @arsenm wrote:

In D134354#3817591, @Pierre-vh wrote:

Change tests to use fneg instead of fsub, as discussed.

This should be done as a separate precommit

I didn't change existing tests, I mean in the tests I added (in the GISel folder)

Harbormaster completed remote builds in B188934: Diff 463207.Sep 27 2022, 6:59 AM

In D134354#3818056, @Pierre-vh wrote:

I didn't change existing tests, I mean in the tests I added (in the GISel folder)

But it's still a copy pasted copy of an existing test, so they're diverging. The goal is still to move towards a shared test

Pierre-vh mentioned this in D134793: [AMDGPU] Update `mad-mix*` CodeGen tests.Sep 27 2022, 11:50 PM

Pierre-vh added a parent revision: D134793: [AMDGPU] Update `mad-mix*` CodeGen tests.Sep 27 2022, 11:50 PM

Pierre-vh removed a parent revision: D134635: [AMDGPU][GlobalISel] Add Shift/Shufflevector Combine.

Switch to common, generated tests for DAG/GISel (Using D134793)

Note: I had to reduce the amount of check prefixes in the tests because it seem to be confusing update_llc_test_checks, it kept generating non-working tests if I had too many overlapping check prefixes across all run lines.

Harbormaster completed remote builds in B189106: Diff 463453.Sep 28 2022, 1:47 AM

Rebase, there's a few regressions due to legalizer changes now

Harbormaster completed remote builds in B189154: Diff 463523.Sep 28 2022, 6:59 AM

arsenm added inline comments.Sep 28 2022, 8:29 AM

llvm/lib/CodeGen/GlobalISel/Utils.cpp
665–667 ↗	(On Diff #463165)	Can you fix the DAG version? Plus this should be its own patch with its own testing

Pierre-vh mentioned this in rG682c7c77f59a: [AMDGPU] Update `mad-mix*` CodeGen tests.Sep 29 2022, 12:11 AM

Pierre-vh edited parent revisions, added: D134635: [AMDGPU][GlobalISel] Add Shift/Shufflevector Combine; removed: D134793: [AMDGPU] Update `mad-mix*` CodeGen tests.Sep 29 2022, 12:13 AM

Pierre-vh edited parent revisions, added: D134857: [GISel] Add more cases to isKnownNeverNaN; removed: D134635: [AMDGPU][GlobalISel] Add Shift/Shufflevector Combine.Sep 29 2022, 1:12 AM

Pierre-vh added a parent revision: D134635: [AMDGPU][GlobalISel] Add Shift/Shufflevector Combine.

Rebase on top of D134861
Splitting up the patch quite a bit so I can commit the straightforward changes sooner

Pierre-vh added a parent revision: D134861: [TableGen] Add `countRendererFns` to `InstructionOperandMatcher`.Sep 29 2022, 1:18 AM

Harbormaster completed remote builds in B189337: Diff 463790.Sep 29 2022, 1:18 AM

Rebase on top of D134862
Another set of changes removed on the diff, now it only focuses on mad_mix

Pierre-vh edited parent revisions, added: D134862: [AMDGPU][GISel] Update `isCanonicalized`; removed: D134857: [GISel] Add more cases to isKnownNeverNaN.Sep 29 2022, 1:35 AM

Harbormaster completed remote builds in B189340: Diff 463795.Sep 29 2022, 1:36 AM

Pierre-vh mentioned this in D134870: [AMDGPU][GISel] Combine V2S16 G_EXTRACT/INSERT_VECTOR_ELT.Sep 29 2022, 3:41 AM

Pierre-vh added a parent revision: D134870: [AMDGPU][GISel] Combine V2S16 G_EXTRACT/INSERT_VECTOR_ELT.Sep 29 2022, 3:42 AM

Pierre-vh removed a parent revision: D134870: [AMDGPU][GISel] Combine V2S16 G_EXTRACT/INSERT_VECTOR_ELT.Sep 29 2022, 3:46 AM

Pierre-vh removed a parent revision: D134635: [AMDGPU][GlobalISel] Add Shift/Shufflevector Combine.Sep 29 2022, 3:51 AM

Pierre-vh edited the summary of this revision. (Show Details)

Rebase on tree that includes D134870.

All big regressions have been addressed, there's still a few minor differences between the DAG/GISel variants but I think they're not critical (I'll still look at them but please review in the meantime)
-> v_mad_mixhi_f16_f16lo_f16lo_f16lo_undeflo_clamp_precvt

-> weird use of shl but it's just one extra inst

-> v_mad_mixhi_f16_f16lo_f16lo_f16lo_intpack

-> constant rematerialization issue

Harbormaster completed remote builds in B189371: Diff 463841.Sep 29 2022, 4:02 AM

Fix v_mad_mixhi_f16_f16lo_f16lo_f16lo_undeflo_clamp_precvt

Harbormaster completed remote builds in B189382: Diff 463856.Sep 29 2022, 5:46 AM

Pierre-vh mentioned this in rG6886f094e8af: [TableGen] Add `countRendererFns` to `InstructionOperandMatcher`.Sep 30 2022, 12:26 AM

arsenm accepted this revision.Sep 30 2022, 5:23 AM

arsenm added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
5101–5108	I'd expect this to be dead code. 16-bit vector extracts are lowered to 32-bit bit ops
5164–5165	Asserting on s16 is adequate

This revision is now accepted and ready to land.Sep 30 2022, 5:23 AM

Rebase, there's a few very small regressions due to the change in FMIN/MAXNUM legalization rules

Pierre-vh edited the summary of this revision. (Show Details)Sep 30 2022, 5:35 AM

Harbormaster completed remote builds in B189658: Diff 464232.Sep 30 2022, 5:36 AM

Comments

Needs re-review of the latest 2 changes because there's a few codegen changes due to the rebase

Harbormaster completed remote builds in B189661: Diff 464235.Sep 30 2022, 5:40 AM

LGTM

Pierre-vh mentioned this in rG9a67a6b72af1: [AMDGPU][GISel] Legalize V2S16 G_BUILD_VECTOR.Sep 30 2022, 7:05 AM

Pierre-vh added a child revision: D134967: [AMDGPU] Always lower SHUFFLE_VECTOR.Sep 30 2022, 7:56 AM

arsenm mentioned this in D117765: [AMDGPU][GlobalISel] Select source modifiers for VOP3Opsel.Sep 30 2022, 8:34 AM

Pierre-vh removed a child revision: D134967: [AMDGPU] Always lower SHUFFLE_VECTOR.Oct 4 2022, 2:19 AM

Pierre-vh mentioned this in D135147: [GISel] Handle G_TRUNC in `matchExtractVecEltBuildVec`.Oct 4 2022, 4:56 AM

Pierre-vh mentioned this in D135148: [GISel] Add Trunc/Lshr/BuildVector Folding.

Pierre-vh added parent revisions: D135148: [GISel] Add Trunc/Lshr/BuildVector Folding, D135147: [GISel] Handle G_TRUNC in `matchExtractVecEltBuildVec`, D135146: [GISel] Add redundant bitcast folding combine, D135145: [GISel] Combine G_INSERT_VECTOR_ELT to G_SHUFFLE_VECTOR.Oct 4 2022, 4:58 AM

Rebase

Harbormaster completed remote builds in B190177: Diff 464968.Oct 4 2022, 5:00 AM

Needs another quick re-review because of the shuffle/shift combine removal - there's some small regressions I was unable to fix for now but they only affect VI/CI

Pierre-vh mentioned this in rGa34977c4d010: [GISel] Handle G_TRUNC in `matchExtractVecEltBuildVec`.Oct 7 2022, 1:37 AM

Pierre-vh mentioned this in rG36c3833783f0: [GISel] Add Trunc/Lshr/BuildVector Folding.Oct 7 2022, 1:44 AM

rebase

Harbormaster completed remote builds in B191675: Diff 467050.Oct 12 2022, 1:14 AM

arsenm accepted this revision.Oct 12 2022, 8:43 AM

This revision is now accepted and ready to land.Oct 12 2022, 8:43 AM

foad added inline comments.Oct 17 2022, 3:59 AM

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
534–535	Personally I prefer to write this as `(IsFMA ? Subtarget->hasMadMixInsts() : Subtarget->hasFmaMixInsts())`. Also I don't understand what the `(!Subtarget->hasMadMixInsts() && !Subtarget->hasFmaMixInsts())` part is for. Why can't the whole thing just be: `(IsFMA ? Subtarget->!hasFmaMixInsts() : !Subtarget->hasMadMixInsts())`?
753	Don't need the braces.
759	Don't need the braces.
llvm/lib/Target/AMDGPU/SIInstructions.td
2752 ↗	(On Diff #467050)	This looks weird. Why would we every want to generate V_LSHLREV_B32 with an sgpr operand?

Comments

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
534–535	That condition was imported from selectFMAD_FMA in the DAGISel and yeah, after looking at it more carefully it's indeed redundant, I rewrote it.
llvm/lib/Target/AMDGPU/SIInstructions.td
2752 ↗	(On Diff #467050)	I just changed the pattern matching to also allow VGPR operands (which, IIRC, will be copied to the right regbank in any case). Not sure why the pattern uses a SGPR output though. I did this change because it helped codegen in one test go from: v_mad_mix_f32 v0, v0, v1, v2 op_sel_hi:[1,1,1] clamp v_cvt_f16_f32_e32 v0, v0 v_and_b32_e32 v1, 0xffff, v0 v_lshl_or_b32 v0, v0, 16, v1 to v_mad_mix_f32 v0, v0, v1, v2 op_sel_hi:[1,1,1] clamp v_cvt_f16_f32_sdwa v0, v0 dst_sel:WORD_1 dst_unused:UNUSED_PAD src0_sel:DWORD Perhaps it'd be more adequate to duplicate the pattern and have another one that uses VReg input/output?

Harbormaster completed remote builds in B192921: Diff 468798.Oct 19 2022, 12:29 AM

foad added inline comments.Oct 19 2022, 1:44 AM

llvm/lib/Target/AMDGPU/AMDGPUGISel.td
156	Is this required? I am confused about whether you are now matching VOP3PMadMixMods in TableGen patterns, or only doing it from the C++ code in selectG_FMA.
llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
525	Call it selectG_FMA_FMAD.
534	The usual convention seems to be to return false (with no braces) here, and have the caller do: if (selectG_FMA(I)) return true; return selectImpl(I, *CoverageInfo); Would that work? Or is there something special about the return false on line 575?
558	Return false? No braces.
574	No braces.
751	Is this related to the current patch? Could it be split out?
llvm/lib/Target/AMDGPU/SIInstructions.td
2752 ↗	(On Diff #467050)	Maybe split this into a separate patch and we can discuss it there? It doesn't seem to be directly related to selecting the mad_mix instruction.

Pierre-vh mentioned this in D136235: [AMDGPU][GISel] Constrain selected operands in selectG_BUILD_VECTOR.Oct 19 2022, 2:01 AM

Comments, splitting up stuff in other diffs

Pierre-vh added a parent revision: D136235: [AMDGPU][GISel] Constrain selected operands in selectG_BUILD_VECTOR.Oct 19 2022, 2:11 AM

Pierre-vh added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUGISel.td
156	I'm doing it exclusively in C++. This is needed to tell the GlobalISelEmitter that, when it sees "VOP3PMadMixMods" it needs to use the "selectVOP3PMadMixMods" function from AMDGPUInstructionSelector. Without that, it wouldn't import patterns that use it.
llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
751	It's required. D136235
llvm/lib/Target/AMDGPU/SIInstructions.td
2752 ↗	(On Diff #467050)	v_mad_mixhi_f16_f16lo_f16lo_f16lo_undeflo_clamp_precvt regressed due to splitting the change into D136236

Harbormaster completed remote builds in B192939: Diff 468822.Oct 19 2022, 2:11 AM

foad added inline comments.Oct 19 2022, 2:18 AM

llvm/lib/Target/AMDGPU/AMDGPUGISel.td
156	I'm still confused. If we are now importing patterns that use VOP3PMadMixMods, why do you need to do the selection in C++ code?

Pierre-vh marked an inline comment as done.Oct 19 2022, 2:21 AM

Pierre-vh added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUGISel.td
156	It's a ComplexPattern def VOP3PMadMixMods : ComplexPattern<untyped, 2, "SelectVOP3PMadMixMods">; Those functions work with DAG nodes and must be rewritten for GISel so the patterns can be imported https://llvm.org/docs/GlobalISel/InstructionSelect.html#complexpatterns

foad added inline comments.Oct 19 2022, 2:26 AM

llvm/lib/Target/AMDGPU/AMDGPUGISel.td
156	Right, but why do you need to write code in selectG_FMA_FMAD to manually select MAD_MIX/FMA_MIX? Why can't the imported patterns do this automatically?

Pierre-vh marked 2 inline comments as done.Oct 19 2022, 2:31 AM

Pierre-vh added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUGISel.td
156	Because there are no patterns for MIX, only for MIXHI/MIXLO. MIX selection is still manual in the DAG as well (see SelectFMAD_FMA)

foad added inline comments.Oct 19 2022, 2:38 AM

llvm/lib/Target/AMDGPU/AMDGPUGISel.td
156	OK, thanks for explaining. Out of curiosity, is there a good reason why mix selection cannot be done with patterns?
llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp
3729	I've also done this as part of D136238.

foad added inline comments.Oct 19 2022, 3:15 AM

llvm/lib/Target/AMDGPU/AMDGPUGISel.td
157	Why is this "s64"?

Rebase

Harbormaster completed remote builds in B192955: Diff 468842.Oct 19 2022, 3:16 AM

Pierre-vh added inline comments.Oct 19 2022, 3:18 AM

llvm/lib/Target/AMDGPU/AMDGPUGISel.td
157	I'm not sure about this actually, I just followed the examples above. It's supposed to be "the expected type at the root of the match" but IIRC it doesn't look like it matter much in this case at least

arsenm accepted this revision.Oct 19 2022, 8:27 AM

arsenm added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUGISel.td
157	I'm surprised this doesn't do anything but should probably be untyped

Pierre-vh marked an inline comment as done.Oct 20 2022, 3:01 AM

Pierre-vh added inline comments.

llvm/lib/Target/AMDGPU/AMDGPUGISel.td

157

untyped doesn't work

[build] /home/pvanhout/work/trunk/llvm-project/llvm/lib/Target/AMDGPU/AMDGPUGISel.td:157:5: error: Value specified for template argument 'GIComplexOperandMatcher:type' (#0) is of type ValueType; expected type LLT: untyped
[build]     GIComplexOperandMatcher<untyped, "selectVOP3PMadMixMods">,

There's a TODO for it

  // The expected type of the root of the match.
  //
  // TODO: We should probably support, any-type, any-scalar, and multiple types
  //       in the future.
  LLT Type = type;
`

Pierre-vh mentioned this in rG1809414fe19a: [AMDGPU][GISel] Constrain selected operands in selectG_BUILD_VECTOR.Oct 20 2022, 11:50 PM

Rebase

Harbormaster completed remote builds in B193430: Diff 469480.Oct 21 2022, 12:34 AM

DAG -> SDAG for FileCheck prefix
Still blocked by parent diff

Harbormaster completed remote builds in B193929: Diff 470142.Oct 24 2022, 8:10 AM

Pierre-vh mentioned this in D135145: [GISel] Combine G_INSERT_VECTOR_ELT to G_SHUFFLE_VECTOR.Oct 25 2022, 4:30 AM

Pierre-vh added a parent revision: D136922: [AMDGPU][GISel] Widen s16 SHUFFLE_VECTOR where there are no scalar pack insts.Oct 28 2022, 1:18 AM

Rebase

There are some codegen changes (due to D136922) but it looks like a net positive everywhere

Harbormaster completed remote builds in B194852: Diff 471432.Oct 28 2022, 1:24 AM

LGTM

Rebase without D136922, some regressions are back but it's relatively small.

Harbormaster completed remote builds in B196100: Diff 473175.Nov 4 2022, 2:29 AM

Abandoning D135145 as we can't really decide whether it's a good or bad thing.
There's a couple of regressions then but nothing blocking landing afaik.
@arsenm can you confirm this is good to land? For the remainingg cases I'll add them to my notes and try to come up with something when time allows

Pierre-vh requested review of this revision.Nov 6 2022, 11:48 PM

Harbormaster completed remote builds in B196419: Diff 473579.Nov 6 2022, 11:49 PM

Pierre-vh mentioned this in D136236: [AMDGPU][GISel] Allow VReg srcs in (build_vector undef, i16) pattern.Nov 6 2022, 11:53 PM

Should work to fix these regressions vs. the DAG but they certainly aren't directly related to this

This revision is now accepted and ready to land.Nov 7 2022, 9:33 AM

This revision was landed with ongoing or failed builds.Nov 8 2022, 12:02 AM

Closed by commit rG767999fca848: [AMDGPU][GlobalISel] Support mad/fma_mix selection (authored by Pierre-vh). · Explain Why

This revision was automatically updated to reflect the committed changes.

Pierre-vh added a commit: rG767999fca848: [AMDGPU][GlobalISel] Support mad/fma_mix selection.

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

AMDGPUGISel.td

4 lines

AMDGPUInstructionSelector.h

5 lines

AMDGPUInstructionSelector.cpp

244 lines

AMDGPURegBankCombiner.cpp

3 lines

VOP3PInstructions.td

4 lines

test/

CodeGen/

AMDGPU/

GlobalISel/

combine-fma-add-ext-fma.ll

214 lines

combine-fma-add-ext-mul.ll

87 lines

combine-fma-sub-ext-mul.ll

50 lines

combine-fma-sub-ext-neg-mul.ll

114 lines

88 lines

157 lines

375 lines

581 lines

utils/

TableGen/

GlobalISelEmitter.cpp

6 lines

Diff 462903

llvm/lib/Target/AMDGPU/AMDGPUGISel.td

	Show First 20 Lines • Show All 147 Lines • ▼ Show 20 Lines
	def gi_smrd_buffer_imm32 :			def gi_smrd_buffer_imm32 :
	GIComplexOperandMatcher<s64, "selectSMRDBufferImm32">,			GIComplexOperandMatcher<s64, "selectSMRDBufferImm32">,
	GIComplexPatternEquiv<SMRDBufferImm32>;			GIComplexPatternEquiv<SMRDBufferImm32>;

	def gi_smrd_buffer_sgpr_imm :			def gi_smrd_buffer_sgpr_imm :
	GIComplexOperandMatcher<s64, "selectSMRDBufferSgprImm">,			GIComplexOperandMatcher<s64, "selectSMRDBufferSgprImm">,
	GIComplexPatternEquiv<SMRDBufferSgprImm>;			GIComplexPatternEquiv<SMRDBufferSgprImm>;

				def gi_vop3_mad_mix_mods :
				foadUnsubmitted Done Reply Inline Actions Is this required? I am confused about whether you are now matching VOP3PMadMixMods in TableGen patterns, or only doing it from the C++ code in selectG_FMA. foad: Is this required? I am confused about whether you are now matching VOP3PMadMixMods in TableGen…
				Pierre-vhAuthorUnsubmitted Done Reply Inline Actions I'm doing it exclusively in C++. This is needed to tell the GlobalISelEmitter that, when it sees "VOP3PMadMixMods" it needs to use the "selectVOP3PMadMixMods" function from AMDGPUInstructionSelector. Without that, it wouldn't import patterns that use it. Pierre-vh: I'm doing it exclusively in C++. This is needed to tell the GlobalISelEmitter that, when it…
				foadUnsubmitted Done Reply Inline Actions I'm still confused. If we are now importing patterns that use VOP3PMadMixMods, why do you need to do the selection in C++ code? foad: I'm still confused. If we are now importing patterns that use VOP3PMadMixMods, why do you need…
				Pierre-vhAuthorUnsubmitted Done Reply Inline Actions It's a ComplexPattern def VOP3PMadMixMods : ComplexPattern<untyped, 2, "SelectVOP3PMadMixMods">; Those functions work with DAG nodes and must be rewritten for GISel so the patterns can be imported https://llvm.org/docs/GlobalISel/InstructionSelect.html#complexpatterns Pierre-vh: It's a ComplexPattern ``` def VOP3PMadMixMods : ComplexPattern<untyped, 2…
				foadUnsubmitted Done Reply Inline Actions Right, but why do you need to write code in selectG_FMA_FMAD to manually select MAD_MIX/FMA_MIX? Why can't the imported patterns do this automatically? foad: Right, but why do you need to write code in selectG_FMA_FMAD to manually select MAD_MIX/FMA_MIX?
				Pierre-vhAuthorUnsubmitted Done Reply Inline Actions Because there are no patterns for MIX, only for MIXHI/MIXLO. MIX selection is still manual in the DAG as well (see SelectFMAD_FMA) Pierre-vh: Because there are no patterns for MIX, only for MIXHI/MIXLO. MIX selection is still manual in…
				foadUnsubmitted Not Done Reply Inline Actions OK, thanks for explaining. Out of curiosity, is there a good reason why mix selection cannot be done with patterns? foad: OK, thanks for explaining. Out of curiosity, is there a good reason why mix selection cannot be…
				GIComplexOperandMatcher<s64, "selectVOP3PMadMixMods">,
				foadUnsubmitted Not Done Reply Inline Actions Why is this "s64"? foad: Why is this "s64"?
				Pierre-vhAuthorUnsubmitted Done Reply Inline Actions I'm not sure about this actually, I just followed the examples above. It's supposed to be "the expected type at the root of the match" but IIRC it doesn't look like it matter much in this case at least Pierre-vh: I'm not sure about this actually, I just followed the examples above. It's supposed to be "the…
				arsenmUnsubmitted Done Reply Inline Actions I'm surprised this doesn't do anything but should probably be untyped arsenm: I'm surprised this doesn't do anything but should probably be untyped
				Pierre-vhAuthorUnsubmitted Done Reply Inline Actions untyped doesn't work [build] /home/pvanhout/work/trunk/llvm-project/llvm/lib/Target/AMDGPU/AMDGPUGISel.td:157:5: error: Value specified for template argument 'GIComplexOperandMatcher:type' (#0) is of type ValueType; expected type LLT: untyped [build] GIComplexOperandMatcher<untyped, "selectVOP3PMadMixMods">, There's a TODO for it // The expected type of the root of the match. // // TODO: We should probably support, any-type, any-scalar, and multiple types // in the future. LLT Type = type; ` Pierre-vh: untyped doesn't work ``` [build] /home/pvanhout/work/trunk/llvm…
				GIComplexPatternEquiv<VOP3PMadMixMods>;

	// Separate load nodes are defined to glue m0 initialization in			// Separate load nodes are defined to glue m0 initialization in
	// SelectionDAG. The GISel selector can just insert m0 initialization			// SelectionDAG. The GISel selector can just insert m0 initialization
	// directly before selecting a glue-less load, so hide this			// directly before selecting a glue-less load, so hide this
	// distinction.			// distinction.

	def : GINodeEquiv<G_LOAD, AMDGPUld_glue> {			def : GINodeEquiv<G_LOAD, AMDGPUld_glue> {
	let CheckMMOIsNonAtomic = 1;			let CheckMMOIsNonAtomic = 1;
	let IfSignExtend = G_SEXTLOAD;			let IfSignExtend = G_SEXTLOAD;
	▲ Show 20 Lines • Show All 211 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h

Show First 20 Lines • Show All 91 Lines • ▼ Show 20 Lines	private:
bool selectG_CONSTANT(MachineInstr &I) const;		bool selectG_CONSTANT(MachineInstr &I) const;
bool selectG_FNEG(MachineInstr &I) const;		bool selectG_FNEG(MachineInstr &I) const;
bool selectG_FABS(MachineInstr &I) const;		bool selectG_FABS(MachineInstr &I) const;
bool selectG_AND_OR_XOR(MachineInstr &I) const;		bool selectG_AND_OR_XOR(MachineInstr &I) const;
bool selectG_ADD_SUB(MachineInstr &I) const;		bool selectG_ADD_SUB(MachineInstr &I) const;
bool selectG_UADDO_USUBO_UADDE_USUBE(MachineInstr &I) const;		bool selectG_UADDO_USUBO_UADDE_USUBE(MachineInstr &I) const;
bool selectG_AMDGPU_MAD_64_32(MachineInstr &I) const;		bool selectG_AMDGPU_MAD_64_32(MachineInstr &I) const;
bool selectG_EXTRACT(MachineInstr &I) const;		bool selectG_EXTRACT(MachineInstr &I) const;
		bool selectG_FMA(MachineInstr &I) const;
bool selectG_MERGE_VALUES(MachineInstr &I) const;		bool selectG_MERGE_VALUES(MachineInstr &I) const;
bool selectG_UNMERGE_VALUES(MachineInstr &I) const;		bool selectG_UNMERGE_VALUES(MachineInstr &I) const;
bool selectG_BUILD_VECTOR(MachineInstr &I) const;		bool selectG_BUILD_VECTOR(MachineInstr &I) const;
bool selectG_PTR_ADD(MachineInstr &I) const;		bool selectG_PTR_ADD(MachineInstr &I) const;
bool selectG_IMPLICIT_DEF(MachineInstr &I) const;		bool selectG_IMPLICIT_DEF(MachineInstr &I) const;
bool selectG_INSERT(MachineInstr &I) const;		bool selectG_INSERT(MachineInstr &I) const;
bool selectG_SBFX_UBFX(MachineInstr &I) const;		bool selectG_SBFX_UBFX(MachineInstr &I) const;

▲ Show 20 Lines • Show All 184 Lines • ▼ Show 20 Lines	private:

InstructionSelector::ComplexRendererFns		InstructionSelector::ComplexRendererFns
selectMUBUFAddr64Atomic(MachineOperand &Root) const;		selectMUBUFAddr64Atomic(MachineOperand &Root) const;

ComplexRendererFns selectSMRDBufferImm(MachineOperand &Root) const;		ComplexRendererFns selectSMRDBufferImm(MachineOperand &Root) const;
ComplexRendererFns selectSMRDBufferImm32(MachineOperand &Root) const;		ComplexRendererFns selectSMRDBufferImm32(MachineOperand &Root) const;
ComplexRendererFns selectSMRDBufferSgprImm(MachineOperand &Root) const;		ComplexRendererFns selectSMRDBufferSgprImm(MachineOperand &Root) const;

		std::pair<Register, unsigned> selectVOP3PMadMixModsImpl(MachineOperand &Root,
		bool &Matched) const;
		ComplexRendererFns selectVOP3PMadMixMods(MachineOperand &Root) const;

void renderTruncImm32(MachineInstrBuilder &MIB, const MachineInstr &MI,		void renderTruncImm32(MachineInstrBuilder &MIB, const MachineInstr &MI,
int OpIdx = -1) const;		int OpIdx = -1) const;

void renderTruncTImm(MachineInstrBuilder &MIB, const MachineInstr &MI,		void renderTruncTImm(MachineInstrBuilder &MIB, const MachineInstr &MI,
int OpIdx) const;		int OpIdx) const;

void renderNegateImm(MachineInstrBuilder &MIB, const MachineInstr &MI,		void renderNegateImm(MachineInstrBuilder &MIB, const MachineInstr &MI,
int OpIdx) const;		int OpIdx) const;
▲ Show 20 Lines • Show All 44 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp

Show First 20 Lines • Show All 516 Lines • ▼ Show 20 Lines	bool AMDGPUInstructionSelector::selectG_EXTRACT(MachineInstr &I) const {
const DebugLoc &DL = I.getDebugLoc();		const DebugLoc &DL = I.getDebugLoc();
BuildMI(*BB, &I, DL, TII.get(TargetOpcode::COPY), DstReg)		BuildMI(*BB, &I, DL, TII.get(TargetOpcode::COPY), DstReg)
.addReg(SrcReg, 0, SubReg);		.addReg(SrcReg, 0, SubReg);

I.eraseFromParent();		I.eraseFromParent();
return true;		return true;
}		}

		bool AMDGPUInstructionSelector::selectG_FMA(MachineInstr &I) const {
		foadUnsubmitted Done Reply Inline Actions Call it selectG_FMA_FMAD. foad: Call it selectG_FMA_FMAD.
		assert(I.getOpcode() == AMDGPU::G_FMA \|\| I.getOpcode() == AMDGPU::G_FMAD);

		// Try to manually select MAD_MIX/FMA_MIX.
		Register Dst = I.getOperand(0).getReg();
		LLT ResultTy = MRI->getType(Dst);
		bool IsFMA = I.getOpcode() == AMDGPU::G_FMA;
		if (ResultTy != LLT::scalar(32) \|\|
		(!Subtarget->hasMadMixInsts() && !Subtarget->hasFmaMixInsts()) \|\|
		((IsFMA && Subtarget->hasMadMixInsts()) \|\|
		foadUnsubmitted Done Reply Inline Actions The usual convention seems to be to return false (with no braces) here, and have the caller do: if (selectG_FMA(I)) return true; return selectImpl(I, CoverageInfo); Would that work? Or is there something special about the return false on line 575? foad:* The usual convention seems to be to return false (with no braces) here, and have the caller do…
		(!IsFMA && Subtarget->hasFmaMixInsts()))) {
		foadUnsubmitted Done Reply Inline Actions Personally I prefer to write this as `(IsFMA ? Subtarget->hasMadMixInsts() : Subtarget->hasFmaMixInsts())`. Also I don't understand what the `(!Subtarget->hasMadMixInsts() && !Subtarget->hasFmaMixInsts())` part is for. Why can't the whole thing just be: `(IsFMA ? Subtarget->!hasFmaMixInsts() : !Subtarget->hasMadMixInsts())`? foad: Personally I prefer to write this as `(IsFMA ? Subtarget->hasMadMixInsts() : Subtarget…
		Pierre-vhAuthorUnsubmitted Done Reply Inline Actions That condition was imported from selectFMAD_FMA in the DAGISel and yeah, after looking at it more carefully it's indeed redundant, I rewrote it. Pierre-vh: That condition was imported from selectFMAD_FMA in the DAGISel and yeah, after looking at it…
		return selectImpl(I, *CoverageInfo);
		}

		// Avoid using v_mad_mix_f32/v_fma_mix_f32 unless there is actually an operand
		// using the conversion from f16.
		bool MatchedSrc0, MatchedSrc1, MatchedSrc2;
		auto [Src0, Src0Mods] =
		selectVOP3PMadMixModsImpl(I.getOperand(1), MatchedSrc0);
		auto [Src1, Src1Mods] =
		selectVOP3PMadMixModsImpl(I.getOperand(2), MatchedSrc1);
		auto [Src2, Src2Mods] =
		selectVOP3PMadMixModsImpl(I.getOperand(3), MatchedSrc2);

		#ifndef NDEBUG
		const SIMachineFunctionInfo *MFI =
		I.getMF()->getInfo<SIMachineFunctionInfo>();
		AMDGPU::SIModeRegisterDefaults Mode = MFI->getMode();
		assert((IsFMA \|\| !Mode.allFP32Denormals()) &&
		"fmad selected with denormals enabled");
		#endif

		// TODO: We can select this with f32 denormals enabled if all the sources are
		// converted from f16 (in which case fmad isn't legal).
		foadUnsubmitted Done Reply Inline Actions Return false? No braces. foad: Return false? No braces.
		if (!MatchedSrc0 && !MatchedSrc1 && !MatchedSrc2) {
		return selectImpl(I, *CoverageInfo);
		}

		const unsigned OpC = IsFMA ? AMDGPU::V_FMA_MIX_F32 : AMDGPU::V_MAD_MIX_F32;
		MachineInstr *MixInst =
		BuildMI(*I.getParent(), I, I.getDebugLoc(), TII.get(OpC), Dst)
		.addImm(Src0Mods)
		.addReg(Src0)
		.addImm(Src1Mods)
		.addReg(Src1)
		.addImm(Src2Mods)
		.addReg(Src2)
		.addImm(0)
		.addImm(0)
		.addImm(0);
		foadUnsubmitted Done Reply Inline Actions No braces. foad: No braces.

		if (!constrainSelectedInstRegOperands(*MixInst, TII, TRI, RBI)) {
		return false;
		}

		I.removeFromParent();
		arsenmUnsubmitted Done Reply Inline Actions eraseFromParent arsenm: eraseFromParent
		return true;
		}

bool AMDGPUInstructionSelector::selectG_MERGE_VALUES(MachineInstr &MI) const {		bool AMDGPUInstructionSelector::selectG_MERGE_VALUES(MachineInstr &MI) const {
MachineBasicBlock *BB = MI.getParent();		MachineBasicBlock *BB = MI.getParent();
Register DstReg = MI.getOperand(0).getReg();		Register DstReg = MI.getOperand(0).getReg();
LLT DstTy = MRI->getType(DstReg);		LLT DstTy = MRI->getType(DstReg);
LLT SrcTy = MRI->getType(MI.getOperand(1).getReg());		LLT SrcTy = MRI->getType(MI.getOperand(1).getReg());

const unsigned SrcSize = SrcTy.getSizeInBits();		const unsigned SrcSize = SrcTy.getSizeInBits();
if (SrcSize < 32)		if (SrcSize < 32)
return selectImpl(MI, *CoverageInfo);		return selectImpl(MI, *CoverageInfo);

		arsenmUnsubmitted Done Reply Inline Actions Weird ternary usage. Should use some returns? Also should just consolidate the G_BUILD_VECTOR handling above arsenm: Weird ternary usage. Should use some returns? Also should just consolidate the G_BUILD_VECTOR…
const DebugLoc &DL = MI.getDebugLoc();		const DebugLoc &DL = MI.getDebugLoc();
const RegisterBank DstBank = RBI.getRegBank(DstReg, MRI, TRI);		const RegisterBank DstBank = RBI.getRegBank(DstReg, MRI, TRI);
const unsigned DstSize = DstTy.getSizeInBits();		const unsigned DstSize = DstTy.getSizeInBits();
const TargetRegisterClass *DstRC =		const TargetRegisterClass *DstRC =
TRI.getRegClassForSizeOnBank(DstSize, *DstBank);		TRI.getRegClassForSizeOnBank(DstSize, *DstBank);
if (!DstRC)		if (!DstRC)
return false;		return false;

▲ Show 20 Lines • Show All 81 Lines • ▼ Show 20 Lines	bool AMDGPUInstructionSelector::selectG_BUILD_VECTOR(MachineInstr &MI) const {
// For G_BUILD_VECTOR_TRUNC, additionally check that the operands are s32.		// For G_BUILD_VECTOR_TRUNC, additionally check that the operands are s32.
Register Dst = MI.getOperand(0).getReg();		Register Dst = MI.getOperand(0).getReg();
if (MRI->getType(Dst) != LLT::fixed_vector(2, 16) \|\|		if (MRI->getType(Dst) != LLT::fixed_vector(2, 16) \|\|
(MI.getOpcode() == AMDGPU::G_BUILD_VECTOR_TRUNC &&		(MI.getOpcode() == AMDGPU::G_BUILD_VECTOR_TRUNC &&
SrcTy != LLT::scalar(32)))		SrcTy != LLT::scalar(32)))
return selectImpl(MI, *CoverageInfo);		return selectImpl(MI, *CoverageInfo);

const RegisterBank DstBank = RBI.getRegBank(Dst, MRI, TRI);		const RegisterBank DstBank = RBI.getRegBank(Dst, MRI, TRI);
		if (DstBank->getID() == AMDGPU::AGPRRegBankID)
		return false;

assert(DstBank->getID() == AMDGPU::SGPRRegBankID \|\|		assert(DstBank->getID() == AMDGPU::SGPRRegBankID \|\|
DstBank->getID() == AMDGPU::VGPRRegBankID);		DstBank->getID() == AMDGPU::VGPRRegBankID);
const bool IsVector = DstBank->getID() == AMDGPU::VGPRRegBankID;		const bool IsVector = DstBank->getID() == AMDGPU::VGPRRegBankID;

const DebugLoc &DL = MI.getDebugLoc();		const DebugLoc &DL = MI.getDebugLoc();
MachineBasicBlock *BB = MI.getParent();		MachineBasicBlock *BB = MI.getParent();

// First, before trying TableGen patterns, check if both sources are		// First, before trying TableGen patterns, check if both sources are
Show All 38 Lines	const auto &RC =
IsVector ? AMDGPU::VGPR_32RegClass : AMDGPU::SReg_32RegClass;		IsVector ? AMDGPU::VGPR_32RegClass : AMDGPU::SReg_32RegClass;
return RBI.constrainGenericRegister(Dst, RC, *MRI) &&		return RBI.constrainGenericRegister(Dst, RC, *MRI) &&
RBI.constrainGenericRegister(Src0, RC, *MRI);		RBI.constrainGenericRegister(Src0, RC, *MRI);
}		}

// TODO: Can be improved?		// TODO: Can be improved?
if (IsVector) {		if (IsVector) {
Register TmpReg = MRI->createVirtualRegister(&AMDGPU::VGPR_32RegClass);		Register TmpReg = MRI->createVirtualRegister(&AMDGPU::VGPR_32RegClass);
BuildMI(*BB, MI, DL, TII.get(AMDGPU::V_AND_B32_e32), TmpReg)		auto MIB = BuildMI(*BB, MI, DL, TII.get(AMDGPU::V_AND_B32_e32), TmpReg)
.addImm(0xFFFF)		.addImm(0xFFFF)
.addReg(Src0);		.addReg(Src0);
BuildMI(*BB, MI, DL, TII.get(AMDGPU::V_LSHL_OR_B32_e64), Dst)		if (!constrainSelectedInstRegOperands(*MIB, TII, TRI, RBI)) {
		foadUnsubmitted Done Reply Inline Actions Is this related to the current patch? Could it be split out? foad: Is this related to the current patch? Could it be split out?
		Pierre-vhAuthorUnsubmitted Done Reply Inline Actions It's required. D136235 Pierre-vh: It's required. D136235
		return false;
		}
		foadUnsubmitted Done Reply Inline Actions Don't need the braces. foad: Don't need the braces.

		MIB = BuildMI(*BB, MI, DL, TII.get(AMDGPU::V_LSHL_OR_B32_e64), Dst)
.addReg(Src1)		.addReg(Src1)
.addImm(16)		.addImm(16)
.addReg(TmpReg);		.addReg(TmpReg);
		if (!constrainSelectedInstRegOperands(*MIB, TII, TRI, RBI)) {
		foadUnsubmitted Done Reply Inline Actions Don't need the braces. foad: Don't need the braces.
		return false;
		}

MI.eraseFromParent();		MI.eraseFromParent();
return true;		return true;
}		}

Register ShiftSrc0;		Register ShiftSrc0;
Register ShiftSrc1;		Register ShiftSrc1;

// With multiple uses of the shift, this will duplicate the shift and		// With multiple uses of the shift, this will duplicate the shift and
▲ Show 20 Lines • Show All 2,837 Lines • ▼ Show 20 Lines	if (selectImpl(I, *CoverageInfo))
return true;		return true;
return selectG_FNEG(I);		return selectG_FNEG(I);
case TargetOpcode::G_FABS:		case TargetOpcode::G_FABS:
if (selectImpl(I, *CoverageInfo))		if (selectImpl(I, *CoverageInfo))
return true;		return true;
return selectG_FABS(I);		return selectG_FABS(I);
case TargetOpcode::G_EXTRACT:		case TargetOpcode::G_EXTRACT:
return selectG_EXTRACT(I);		return selectG_EXTRACT(I);
		case TargetOpcode::G_FMA:
		case TargetOpcode::G_FMAD:
		return selectG_FMA(I);
case TargetOpcode::G_MERGE_VALUES:		case TargetOpcode::G_MERGE_VALUES:
case TargetOpcode::G_CONCAT_VECTORS:		case TargetOpcode::G_CONCAT_VECTORS:
return selectG_MERGE_VALUES(I);		return selectG_MERGE_VALUES(I);
case TargetOpcode::G_UNMERGE_VALUES:		case TargetOpcode::G_UNMERGE_VALUES:
return selectG_UNMERGE_VALUES(I);		return selectG_UNMERGE_VALUES(I);
case TargetOpcode::G_BUILD_VECTOR:		case TargetOpcode::G_BUILD_VECTOR:
case TargetOpcode::G_BUILD_VECTOR_TRUNC:		case TargetOpcode::G_BUILD_VECTOR_TRUNC:
return selectG_BUILD_VECTOR(I);		return selectG_BUILD_VECTOR(I);
▲ Show 20 Lines • Show All 94 Lines • ▼ Show 20 Lines

std::pair<Register, unsigned> AMDGPUInstructionSelector::selectVOP3ModsImpl(		std::pair<Register, unsigned> AMDGPUInstructionSelector::selectVOP3ModsImpl(
MachineOperand &Root, bool AllowAbs, bool OpSel, bool ForceVGPR) const {		MachineOperand &Root, bool AllowAbs, bool OpSel, bool ForceVGPR) const {
Register Src = Root.getReg();		Register Src = Root.getReg();
Register OrigSrc = Src;		Register OrigSrc = Src;
unsigned Mods = 0;		unsigned Mods = 0;
MachineInstr MI = getDefIgnoringCopies(Src, MRI);		MachineInstr MI = getDefIgnoringCopies(Src, MRI);

if (MI && MI->getOpcode() == AMDGPU::G_FNEG) {		if (MI->getOpcode() == AMDGPU::G_FNEG) {
		foadUnsubmitted Not Done Reply Inline Actions I've also done this as part of D136238. foad: I've also done this as part of D136238.
Src = MI->getOperand(1).getReg();		Src = MI->getOperand(1).getReg();
Mods \|= SISrcMods::NEG;		Mods \|= SISrcMods::NEG;
MI = getDefIgnoringCopies(Src, *MRI);		MI = getDefIgnoringCopies(Src, *MRI);
}		}

if (AllowAbs && MI && MI->getOpcode() == AMDGPU::G_FABS) {		// TODO: Should be a combine instead
		if (MI->getOpcode() == AMDGPU::G_FSUB) {
		arsenmUnsubmitted Done Reply Inline Actions MI cannot be null. We should remove the null checks elsewhere here arsenm: MI cannot be null. We should remove the null checks elsewhere here
		MachineInstr LHS = getDefIgnoringCopies(MI->getOperand(1).getReg(), MRI);

		if (LHS->getOpcode() == AMDGPU::G_FCONSTANT &&
		LHS->getOperand(1).getFPImm()->isZeroValue()) {
		Src = MI->getOperand(2).getReg();
		Mods \|= SISrcMods::NEG;
		MI = getDefIgnoringCopies(Src, *MRI);
		}
		}

		if (AllowAbs && MI->getOpcode() == AMDGPU::G_FABS) {
		arsenmUnsubmitted Done Reply Inline Actions This should definitely be a combine and should not be in the selector. This should only directly interpret fabs and fneg arsenm: This should definitely be a combine and should not be in the selector. This should only…
Src = MI->getOperand(1).getReg();		Src = MI->getOperand(1).getReg();
Mods \|= SISrcMods::ABS;		Mods \|= SISrcMods::ABS;
}		}

if (OpSel)		if (OpSel)
Mods \|= SISrcMods::OP_SEL_0;		Mods \|= SISrcMods::OP_SEL_0;

if ((Mods != 0 \|\| ForceVGPR) &&		if ((Mods != 0 \|\| ForceVGPR) &&
▲ Show 20 Lines • Show All 1,323 Lines • ▼ Show 20 Lines	AMDGPUInstructionSelector::selectSMRDBufferSgprImm(MachineOperand &Root) const {
if (!EncodedOffset)		if (!EncodedOffset)
return None;		return None;

assert(MRI->getType(SOffset) == LLT::scalar(32));		assert(MRI->getType(SOffset) == LLT::scalar(32));
return {{[=](MachineInstrBuilder &MIB) { MIB.addReg(SOffset); },		return {{[=](MachineInstrBuilder &MIB) { MIB.addReg(SOffset); },
[=](MachineInstrBuilder &MIB) { MIB.addImm(*EncodedOffset); }}};		[=](MachineInstrBuilder &MIB) { MIB.addImm(*EncodedOffset); }}};
}		}

		// Variant of stripBitCast that returns the instruction instead of a
		// MachineOperand.
		static MachineInstr stripBitCast(MachineInstr MI, MachineRegisterInfo &MRI) {
		if (MI->getOpcode() == AMDGPU::G_BITCAST)
		return getDefIgnoringCopies(MI->getOperand(1).getReg(), MRI);
		return MI;
		}

		// Figure out if this is really an extract of the high 16-bits of a dword,
		// returns nullptr if it isn't.
		static MachineInstr isExtractHiElt(MachineInstr Inst,
		MachineRegisterInfo &MRI) {
		Inst = stripBitCast(Inst, MRI);

		if (Inst->getOpcode() == AMDGPU::G_EXTRACT_VECTOR_ELT) {
		MachineOperand &InOp = Inst->getOperand(2);
		if (InOp.isImm()) {
		if (InOp.getImm() != 1)
		return nullptr;
		return getDefIgnoringCopies(Inst->getOperand(1).getReg(), MRI);
		}
		}
		arsenmUnsubmitted Done Reply Inline Actions I'd expect this to be dead code. 16-bit vector extracts are lowered to 32-bit bit ops arsenm: I'd expect this to be dead code. 16-bit vector extracts are lowered to 32-bit bit ops

		if (Inst->getOpcode() != AMDGPU::G_TRUNC)
		return nullptr;
		arsenmUnsubmitted Done Reply Inline Actions FPTRUNC would be ineligible arsenm: FPTRUNC would be ineligible

		MachineInstr *TruncOp =
		getDefIgnoringCopies(Inst->getOperand(1).getReg(), MRI);
		TruncOp = stripBitCast(TruncOp, MRI);

		// G_LSHR x, (G_CONSTANT i32 16)
		if (TruncOp->getOpcode() == AMDGPU::G_LSHR) {
		auto SrlAmount = getIConstantVRegValWithLookThrough(
		TruncOp->getOperand(2).getReg(), MRI);
		if (SrlAmount && SrlAmount->Value.getZExtValue() == 16) {
		MachineInstr *SrlOp =
		getDefIgnoringCopies(TruncOp->getOperand(1).getReg(), MRI);
		arsenmUnsubmitted Done Reply Inline Actions Should use the constant matcher here. There shouldn't be any copies of the constant (although we're still missing a regbank constant optimization) arsenm: Should use the constant matcher here. There shouldn't be any copies of the constant (although…
		return stripBitCast(SrlOp, MRI);
		}
		}

		// G_SHUFFLE_VECTOR x, y, shufflemask(1, 1\|0)
		// 1, 0 swaps the low/high 16 bits.
		// 1, 1 sets the high 16 bits to be the same as the low 16.
		// in any case, it selects the high elts.
		if (TruncOp->getOpcode() == AMDGPU::G_SHUFFLE_VECTOR) {
		assert(MRI.getType(TruncOp->getOperand(0).getReg()) ==
		LLT::fixed_vector(2, 16));

		ArrayRef<int> Mask = TruncOp->getOperand(3).getShuffleMask();
		assert(Mask.size() == 2);

		if (Mask[0] == 1 && Mask[1] <= 1) {
		MachineInstr *LHS =
		getDefIgnoringCopies(TruncOp->getOperand(1).getReg(), MRI);
		return stripBitCast(LHS, MRI);
		}
		}

		return nullptr;
		}

		std::pair<Register, unsigned>
		AMDGPUInstructionSelector::selectVOP3PMadMixModsImpl(MachineOperand &Root,
		bool &Matched) const {
		Matched = false;

		Register Src;
		unsigned Mods;
		std::tie(Src, Mods) = selectVOP3ModsImpl(Root);

		MachineInstr MI = getDefIgnoringCopies(Src, MRI);
		if (MI->getOpcode() == AMDGPU::G_FPEXT) {
		MachineOperand *MO = &MI->getOperand(1);
		Src = MO->getReg();
		MI = getDefIgnoringCopies(Src, *MRI);

		// FIXME: add assert back?
		// assert(MO->getValueType() == MVT::f16);
		arsenmUnsubmitted Done Reply Inline Actions Asserting on s16 is adequate arsenm: Asserting on s16 is adequate

		// See through bitcasts.
		// FIXME: Would be nice to use stripBitCast here.
		if (MI->getOpcode() == AMDGPU::G_BITCAST) {
		MO = &MI->getOperand(1);
		Src = MO->getReg();
		MI = getDefIgnoringCopies(Src, *MRI);
		}

		const auto CheckAbsNeg = [&]() {
		// Be careful about folding modifiers if we already have an abs. fneg is
		// applied last, so we don't want to apply an earlier fneg.
		if ((Mods & SISrcMods::ABS) == 0) {
		unsigned ModsTmp;
		std::tie(Src, ModsTmp) = selectVOP3ModsImpl(*MO);
		MI = getDefIgnoringCopies(Src, *MRI);

		if ((ModsTmp & SISrcMods::NEG) != 0)
		Mods ^= SISrcMods::NEG;

		if ((ModsTmp & SISrcMods::ABS) != 0)
		Mods \|= SISrcMods::ABS;
		}
		};

		CheckAbsNeg();

		// op_sel/op_sel_hi decide the source type and source.
		// If the source's op_sel_hi is set, it indicates to do a conversion from
		// fp16. If the sources's op_sel is set, it picks the high half of the
		// source register.

		Mods \|= SISrcMods::OP_SEL_1;

		if (MachineInstr ExtractHiEltMI = isExtractHiElt(MI, MRI)) {
		Mods \|= SISrcMods::OP_SEL_0;
		MI = ExtractHiEltMI;
		MO = &MI->getOperand(0);
		Src = MO->getReg();

		CheckAbsNeg();
		}

		Matched = true;
		}

		return {Src, Mods};
		}

		InstructionSelector::ComplexRendererFns
		AMDGPUInstructionSelector::selectVOP3PMadMixMods(MachineOperand &Root) const {
		Register Src;
		unsigned Mods;
		bool Matched;
		std::tie(Src, Mods) = selectVOP3PMadMixModsImpl(Root, Matched);

		return {{
		[=](MachineInstrBuilder &MIB) { MIB.addReg(Src); },
		[=](MachineInstrBuilder &MIB) { MIB.addImm(Mods); } // src_mods
		}};
		}

void AMDGPUInstructionSelector::renderTruncImm32(MachineInstrBuilder &MIB,		void AMDGPUInstructionSelector::renderTruncImm32(MachineInstrBuilder &MIB,
const MachineInstr &MI,		const MachineInstr &MI,
int OpIdx) const {		int OpIdx) const {
assert(MI.getOpcode() == TargetOpcode::G_CONSTANT && OpIdx == -1 &&		assert(MI.getOpcode() == TargetOpcode::G_CONSTANT && OpIdx == -1 &&
"Expected G_CONSTANT");		"Expected G_CONSTANT");
MIB.addImm(MI.getOperand(1).getCImm()->getSExtValue());		MIB.addImm(MI.getOperand(1).getCImm()->getSExtValue());
}		}

▲ Show 20 Lines • Show All 80 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/AMDGPURegBankCombiner.cpp

	Show First 20 Lines • Show All 238 Lines • ▼ Show 20 Lines
	bool AMDGPURegBankCombinerHelper::matchFPMinMaxToClamp(MachineInstr &MI,			bool AMDGPURegBankCombinerHelper::matchFPMinMaxToClamp(MachineInstr &MI,
	Register &Reg) {			Register &Reg) {
	// Clamp is available on all types after regbankselect (f16, f32, f64, v2f16).			// Clamp is available on all types after regbankselect (f16, f32, f64, v2f16).
	auto OpcodeTriple = getMinMaxPair(MI.getOpcode());			auto OpcodeTriple = getMinMaxPair(MI.getOpcode());
	Register Val;			Register Val;
	Optional<FPValueAndVReg> K0, K1;			Optional<FPValueAndVReg> K0, K1;
	// Match min(max(Val, K0), K1) or max(min(Val, K1), K0).			// Match min(max(Val, K0), K1) or max(min(Val, K1), K0).
	if (!matchMed<GFCstOrSplatGFCstMatch>(MI, MRI, OpcodeTriple, Val, K0, K1))			if (!matchMed<GFCstOrSplatGFCstMatch>(MI, MRI, OpcodeTriple, Val, K0, K1))
	return false;

				return false;
				arsenmUnsubmitted Done Reply Inline Actions Don't see what happened here? arsenm: Don't see what happened here?
	if (!K0->Value.isExactlyValue(0.0) \|\| !K1->Value.isExactlyValue(1.0))			if (!K0->Value.isExactlyValue(0.0) \|\| !K1->Value.isExactlyValue(1.0))
	return false;			return false;

	// For IEEE=false perform combine only when it's safe to assume that there are			// For IEEE=false perform combine only when it's safe to assume that there are
	// no NaN inputs. Most often MI is marked with nnan fast math flag.			// no NaN inputs. Most often MI is marked with nnan fast math flag.
	// For IEEE=true consider NaN inputs. Only min(max(QNaN, 0.0), 1.0) evaluates			// For IEEE=true consider NaN inputs. Only min(max(QNaN, 0.0), 1.0) evaluates
	// to 0.0 requires dx10_clamp = true.			// to 0.0 requires dx10_clamp = true.
	if ((getIEEE() && getDX10Clamp() && isFminnumIeee(MI) &&			if ((getIEEE() && getDX10Clamp() && isFminnumIeee(MI) &&
	isKnownNeverSNaN(Val, MRI)) \|\|			isKnownNeverSNaN(Val, MRI)) \|\|
	isKnownNeverNaN(MI.getOperand(0).getReg(), MRI)) {			isKnownNeverNaN(MI.getOperand(0).getReg(), MRI)) {
	Reg = Val;			Reg = Val;
				dbgs() << " yes:\n";
				arsenmUnsubmitted Done Reply Inline Actions Leftover debugging arsenm: Leftover debugging
	return true;			return true;
	}			}

	return false;			return false;
	}			}

	// Replacing fmed3(NaN, 0.0, 1.0) with clamp. Requires dx10_clamp = true.			// Replacing fmed3(NaN, 0.0, 1.0) with clamp. Requires dx10_clamp = true.
	// Val = SNaN only for ieee = true. It is important which operand is NaN.			// Val = SNaN only for ieee = true. It is important which operand is NaN.
	▲ Show 20 Lines • Show All 279 Lines • Show Last 20 Lines

llvm/lib/Target/AMDGPU/VOP3PInstructions.td

Show First 20 Lines • Show All 162 Lines • ▼ Show 20 Lines	multiclass MadFmaMixPats<SDPatternOperator fma_like,
def : GCNPat <		def : GCNPat <
(build_vector f16:$elt0, (fpround (fma_like (f32 (VOP3PMadMixMods f16:$src0, i32:$src0_modifiers)),		(build_vector f16:$elt0, (fpround (fma_like (f32 (VOP3PMadMixMods f16:$src0, i32:$src0_modifiers)),
(f32 (VOP3PMadMixMods f16:$src1, i32:$src1_modifiers)),		(f32 (VOP3PMadMixMods f16:$src1, i32:$src1_modifiers)),
(f32 (VOP3PMadMixMods f16:$src2, i32:$src2_modifiers))))),		(f32 (VOP3PMadMixMods f16:$src2, i32:$src2_modifiers))))),
(v2f16 (mixhi_inst $src0_modifiers, $src0,		(v2f16 (mixhi_inst $src0_modifiers, $src0,
$src1_modifiers, $src1,		$src1_modifiers, $src1,
$src2_modifiers, $src2,		$src2_modifiers, $src2,
DSTCLAMP.NONE,		DSTCLAMP.NONE,
$elt0))		VGPR_32:$elt0))
>;		>;

def : GCNPat <		def : GCNPat <
(build_vector		(build_vector
f16:$elt0,		f16:$elt0,
(AMDGPUclamp (fpround (fma_like (f32 (VOP3PMadMixMods f16:$src0, i32:$src0_modifiers)),		(AMDGPUclamp (fpround (fma_like (f32 (VOP3PMadMixMods f16:$src0, i32:$src0_modifiers)),
(f32 (VOP3PMadMixMods f16:$src1, i32:$src1_modifiers)),		(f32 (VOP3PMadMixMods f16:$src1, i32:$src1_modifiers)),
(f32 (VOP3PMadMixMods f16:$src2, i32:$src2_modifiers)))))),		(f32 (VOP3PMadMixMods f16:$src2, i32:$src2_modifiers)))))),
(v2f16 (mixhi_inst $src0_modifiers, $src0,		(v2f16 (mixhi_inst $src0_modifiers, $src0,
$src1_modifiers, $src1,		$src1_modifiers, $src1,
$src2_modifiers, $src2,		$src2_modifiers, $src2,
DSTCLAMP.ENABLE,		DSTCLAMP.ENABLE,
$elt0))		VGPR_32:$elt0))
>;		>;

def : GCNPat <		def : GCNPat <
(AMDGPUclamp (build_vector		(AMDGPUclamp (build_vector
(fpround (fma_like (f32 (VOP3PMadMixMods f16:$lo_src0, i32:$lo_src0_modifiers)),		(fpround (fma_like (f32 (VOP3PMadMixMods f16:$lo_src0, i32:$lo_src0_modifiers)),
(f32 (VOP3PMadMixMods f16:$lo_src1, i32:$lo_src1_modifiers)),		(f32 (VOP3PMadMixMods f16:$lo_src1, i32:$lo_src1_modifiers)),
(f32 (VOP3PMadMixMods f16:$lo_src2, i32:$lo_src2_modifiers)))),		(f32 (VOP3PMadMixMods f16:$lo_src2, i32:$lo_src2_modifiers)))),
(fpround (fma_like (f32 (VOP3PMadMixMods f16:$hi_src0, i32:$hi_src0_modifiers)),		(fpround (fma_like (f32 (VOP3PMadMixMods f16:$hi_src0, i32:$hi_src0_modifiers)),
▲ Show 20 Lines • Show All 1,054 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-add-ext-fma.ll

; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc -global-isel -march=amdgcn -mcpu=gfx900 --denormal-fp-math=preserve-sign < %s \| FileCheck -check-prefix=GFX9-DENORM %s		; RUN: llc -global-isel -march=amdgcn -mcpu=gfx900 --denormal-fp-math=preserve-sign < %s \| FileCheck -check-prefix=GFX9-DENORM %s
; RUN: llc -global-isel -march=amdgcn -mcpu=gfx1010 < %s \| FileCheck -check-prefix=GFX10 %s		; RUN: llc -global-isel -march=amdgcn -mcpu=gfx1010 < %s \| FileCheck -check-prefix=GFX10 %s
; RUN: llc -global-isel -march=amdgcn -mcpu=gfx1010 -fp-contract=fast < %s \| FileCheck -check-prefix=GFX10-CONTRACT %s		; RUN: llc -global-isel -march=amdgcn -mcpu=gfx1010 -fp-contract=fast < %s \| FileCheck -check-prefix=GFX10-CONTRACT %s
; RUN: llc -global-isel -march=amdgcn -mcpu=gfx1010 --denormal-fp-math=preserve-sign < %s \| FileCheck -check-prefix=GFX10-DENORM %s		; RUN: llc -global-isel -march=amdgcn -mcpu=gfx1010 --denormal-fp-math=preserve-sign < %s \| FileCheck -check-prefix=GFX10-DENORM %s

; fold (fadd (fma x, y, (fpext (fmul u, v))), z) -> (fma x, y, (fma (fpext u), (fpext v), z))		; fold (fadd (fma x, y, (fpext (fmul u, v))), z) -> (fma x, y, (fma (fpext u), (fpext v), z))
define amdgpu_vs float @test_f16_f32_add_fma_ext_mul(float %x, float %y, float %z, half %u, half %v) {		define amdgpu_vs float @test_f16_f32_add_fma_ext_mul(float %x, float %y, float %z, half %u, half %v) {
; GFX9-DENORM-LABEL: test_f16_f32_add_fma_ext_mul:		; GFX9-DENORM-LABEL: test_f16_f32_add_fma_ext_mul:
; GFX9-DENORM: ; %bb.0: ; %.entry		; GFX9-DENORM: ; %bb.0: ; %.entry
; GFX9-DENORM-NEXT: v_cvt_f32_f16_e32 v3, v3		; GFX9-DENORM-NEXT: v_mad_mix_f32 v2, v3, v4, v2 op_sel_hi:[1,1,0]
; GFX9-DENORM-NEXT: v_cvt_f32_f16_e32 v4, v4
; GFX9-DENORM-NEXT: v_mad_f32 v2, v3, v4, v2
; GFX9-DENORM-NEXT: v_mac_f32_e32 v2, v0, v1		; GFX9-DENORM-NEXT: v_mac_f32_e32 v2, v0, v1
; GFX9-DENORM-NEXT: v_mov_b32_e32 v0, v2		; GFX9-DENORM-NEXT: v_mov_b32_e32 v0, v2
; GFX9-DENORM-NEXT: ; return to shader part epilog		; GFX9-DENORM-NEXT: ; return to shader part epilog
;		;
; GFX10-LABEL: test_f16_f32_add_fma_ext_mul:		; GFX10-LABEL: test_f16_f32_add_fma_ext_mul:
; GFX10: ; %bb.0: ; %.entry		; GFX10: ; %bb.0: ; %.entry
; GFX10-NEXT: v_mul_f16_e32 v3, v3, v4		; GFX10-NEXT: v_mul_f16_e32 v3, v3, v4
; GFX10-NEXT: v_cvt_f32_f16_e32 v3, v3		; GFX10-NEXT: v_fma_mix_f32 v0, v0, v1, v3 op_sel_hi:[0,0,1]
; GFX10-NEXT: v_fmac_f32_e32 v3, v0, v1		; GFX10-NEXT: v_add_f32_e32 v0, v0, v2
; GFX10-NEXT: v_add_f32_e32 v0, v3, v2
; GFX10-NEXT: ; return to shader part epilog		; GFX10-NEXT: ; return to shader part epilog
;		;
; GFX10-CONTRACT-LABEL: test_f16_f32_add_fma_ext_mul:		; GFX10-CONTRACT-LABEL: test_f16_f32_add_fma_ext_mul:
; GFX10-CONTRACT: ; %bb.0: ; %.entry		; GFX10-CONTRACT: ; %bb.0: ; %.entry
; GFX10-CONTRACT-NEXT: v_mul_f16_e32 v3, v3, v4		; GFX10-CONTRACT-NEXT: v_mul_f16_e32 v3, v3, v4
; GFX10-CONTRACT-NEXT: v_cvt_f32_f16_e32 v3, v3		; GFX10-CONTRACT-NEXT: v_fma_mix_f32 v0, v0, v1, v3 op_sel_hi:[0,0,1]
; GFX10-CONTRACT-NEXT: v_fmac_f32_e32 v3, v0, v1		; GFX10-CONTRACT-NEXT: v_add_f32_e32 v0, v0, v2
; GFX10-CONTRACT-NEXT: v_add_f32_e32 v0, v3, v2
; GFX10-CONTRACT-NEXT: ; return to shader part epilog		; GFX10-CONTRACT-NEXT: ; return to shader part epilog
;		;
; GFX10-DENORM-LABEL: test_f16_f32_add_fma_ext_mul:		; GFX10-DENORM-LABEL: test_f16_f32_add_fma_ext_mul:
; GFX10-DENORM: ; %bb.0: ; %.entry		; GFX10-DENORM: ; %bb.0: ; %.entry
; GFX10-DENORM-NEXT: v_mul_f16_e32 v3, v3, v4		; GFX10-DENORM-NEXT: v_mul_f16_e32 v3, v3, v4
; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v3, v3		; GFX10-DENORM-NEXT: v_fma_mix_f32 v0, v0, v1, v3 op_sel_hi:[0,0,1]
; GFX10-DENORM-NEXT: v_fmac_f32_e32 v3, v0, v1		; GFX10-DENORM-NEXT: v_add_f32_e32 v0, v0, v2
; GFX10-DENORM-NEXT: v_add_f32_e32 v0, v3, v2
; GFX10-DENORM-NEXT: ; return to shader part epilog		; GFX10-DENORM-NEXT: ; return to shader part epilog
.entry:		.entry:
%a = fmul half %u, %v		%a = fmul half %u, %v
%b = fpext half %a to float		%b = fpext half %a to float
%c = call float @llvm.fmuladd.f32(float %x, float %y, float %b)		%c = call float @llvm.fmuladd.f32(float %x, float %y, float %b)
%d = fadd float %c, %z		%d = fadd float %c, %z
ret float %d		ret float %d
}		}

; fold (fadd (fpext (fma x, y, (fmul u, v))), z) -> (fma (fpext x), (fpext y), (fma (fpext u), (fpext v), z))		; fold (fadd (fpext (fma x, y, (fmul u, v))), z) -> (fma (fpext x), (fpext y), (fma (fpext u), (fpext v), z))
define amdgpu_vs float @test_f16_f32_add_ext_fma_mul(half %x, half %y, float %z, half %u, half %v) {		define amdgpu_vs float @test_f16_f32_add_ext_fma_mul(half %x, half %y, float %z, half %u, half %v) {
; GFX9-DENORM-LABEL: test_f16_f32_add_ext_fma_mul:		; GFX9-DENORM-LABEL: test_f16_f32_add_ext_fma_mul:
; GFX9-DENORM: ; %bb.0: ; %.entry		; GFX9-DENORM: ; %bb.0: ; %.entry
; GFX9-DENORM-NEXT: v_cvt_f32_f16_e32 v5, v0		; GFX9-DENORM-NEXT: v_mad_mix_f32 v2, v3, v4, v2 op_sel_hi:[1,1,0]
; GFX9-DENORM-NEXT: v_cvt_f32_f16_e32 v0, v3		; GFX9-DENORM-NEXT: v_mad_mix_f32 v0, v0, v1, v2 op_sel_hi:[1,1,0]
; GFX9-DENORM-NEXT: v_cvt_f32_f16_e32 v3, v4
; GFX9-DENORM-NEXT: v_cvt_f32_f16_e32 v1, v1
; GFX9-DENORM-NEXT: v_mad_f32 v0, v0, v3, v2
; GFX9-DENORM-NEXT: v_mac_f32_e32 v0, v5, v1
; GFX9-DENORM-NEXT: ; return to shader part epilog		; GFX9-DENORM-NEXT: ; return to shader part epilog
;		;
; GFX10-LABEL: test_f16_f32_add_ext_fma_mul:		; GFX10-LABEL: test_f16_f32_add_ext_fma_mul:
; GFX10: ; %bb.0: ; %.entry		; GFX10: ; %bb.0: ; %.entry
; GFX10-NEXT: v_mul_f16_e32 v3, v3, v4		; GFX10-NEXT: v_mul_f16_e32 v3, v3, v4
; GFX10-NEXT: v_fmac_f16_e32 v3, v0, v1		; GFX10-NEXT: v_fmac_f16_e32 v3, v0, v1
; GFX10-NEXT: v_cvt_f32_f16_e32 v0, v3		; GFX10-NEXT: v_cvt_f32_f16_e32 v0, v3
; GFX10-NEXT: v_add_f32_e32 v0, v0, v2		; GFX10-NEXT: v_add_f32_e32 v0, v0, v2
Show All 22 Lines	.entry:
%d = fadd float %c, %z		%d = fadd float %c, %z
ret float %d		ret float %d
}		}

; fold (fadd x, (fma y, z, (fpext (fmul u, v))) -> (fma y, z, (fma (fpext u), (fpext v), x))		; fold (fadd x, (fma y, z, (fpext (fmul u, v))) -> (fma y, z, (fma (fpext u), (fpext v), x))
define amdgpu_vs float @test_f16_f32_add_fma_ext_mul_rhs(float %x, float %y, float %z, half %u, half %v) {		define amdgpu_vs float @test_f16_f32_add_fma_ext_mul_rhs(float %x, float %y, float %z, half %u, half %v) {
; GFX9-DENORM-LABEL: test_f16_f32_add_fma_ext_mul_rhs:		; GFX9-DENORM-LABEL: test_f16_f32_add_fma_ext_mul_rhs:
; GFX9-DENORM: ; %bb.0: ; %.entry		; GFX9-DENORM: ; %bb.0: ; %.entry
; GFX9-DENORM-NEXT: v_cvt_f32_f16_e32 v3, v3		; GFX9-DENORM-NEXT: v_mad_mix_f32 v0, v3, v4, v0 op_sel_hi:[1,1,0]
; GFX9-DENORM-NEXT: v_cvt_f32_f16_e32 v4, v4
; GFX9-DENORM-NEXT: v_mac_f32_e32 v0, v3, v4
; GFX9-DENORM-NEXT: v_mac_f32_e32 v0, v1, v2		; GFX9-DENORM-NEXT: v_mac_f32_e32 v0, v1, v2
; GFX9-DENORM-NEXT: ; return to shader part epilog		; GFX9-DENORM-NEXT: ; return to shader part epilog
;		;
; GFX10-LABEL: test_f16_f32_add_fma_ext_mul_rhs:		; GFX10-LABEL: test_f16_f32_add_fma_ext_mul_rhs:
; GFX10: ; %bb.0: ; %.entry		; GFX10: ; %bb.0: ; %.entry
; GFX10-NEXT: v_mul_f16_e32 v3, v3, v4		; GFX10-NEXT: v_mul_f16_e32 v3, v3, v4
; GFX10-NEXT: v_cvt_f32_f16_e32 v3, v3		; GFX10-NEXT: v_fma_mix_f32 v1, v1, v2, v3 op_sel_hi:[0,0,1]
; GFX10-NEXT: v_fmac_f32_e32 v3, v1, v2		; GFX10-NEXT: v_add_f32_e32 v0, v0, v1
; GFX10-NEXT: v_add_f32_e32 v0, v0, v3
; GFX10-NEXT: ; return to shader part epilog		; GFX10-NEXT: ; return to shader part epilog
;		;
; GFX10-CONTRACT-LABEL: test_f16_f32_add_fma_ext_mul_rhs:		; GFX10-CONTRACT-LABEL: test_f16_f32_add_fma_ext_mul_rhs:
; GFX10-CONTRACT: ; %bb.0: ; %.entry		; GFX10-CONTRACT: ; %bb.0: ; %.entry
; GFX10-CONTRACT-NEXT: v_mul_f16_e32 v3, v3, v4		; GFX10-CONTRACT-NEXT: v_mul_f16_e32 v3, v3, v4
; GFX10-CONTRACT-NEXT: v_cvt_f32_f16_e32 v3, v3		; GFX10-CONTRACT-NEXT: v_fma_mix_f32 v1, v1, v2, v3 op_sel_hi:[0,0,1]
; GFX10-CONTRACT-NEXT: v_fmac_f32_e32 v3, v1, v2		; GFX10-CONTRACT-NEXT: v_add_f32_e32 v0, v0, v1
; GFX10-CONTRACT-NEXT: v_add_f32_e32 v0, v0, v3
; GFX10-CONTRACT-NEXT: ; return to shader part epilog		; GFX10-CONTRACT-NEXT: ; return to shader part epilog
;		;
; GFX10-DENORM-LABEL: test_f16_f32_add_fma_ext_mul_rhs:		; GFX10-DENORM-LABEL: test_f16_f32_add_fma_ext_mul_rhs:
; GFX10-DENORM: ; %bb.0: ; %.entry		; GFX10-DENORM: ; %bb.0: ; %.entry
; GFX10-DENORM-NEXT: v_mul_f16_e32 v3, v3, v4		; GFX10-DENORM-NEXT: v_mul_f16_e32 v3, v3, v4
; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v3, v3		; GFX10-DENORM-NEXT: v_fma_mix_f32 v1, v1, v2, v3 op_sel_hi:[0,0,1]
; GFX10-DENORM-NEXT: v_fmac_f32_e32 v3, v1, v2		; GFX10-DENORM-NEXT: v_add_f32_e32 v0, v0, v1
; GFX10-DENORM-NEXT: v_add_f32_e32 v0, v0, v3
; GFX10-DENORM-NEXT: ; return to shader part epilog		; GFX10-DENORM-NEXT: ; return to shader part epilog
.entry:		.entry:
%a = fmul half %u, %v		%a = fmul half %u, %v
%b = fpext half %a to float		%b = fpext half %a to float
%c = call float @llvm.fmuladd.f32(float %y, float %z, float %b)		%c = call float @llvm.fmuladd.f32(float %y, float %z, float %b)
%d = fadd float %x, %c		%d = fadd float %x, %c
ret float %d		ret float %d
}		}

; fold (fadd x, (fpext (fma y, z, (fmul u, v))) -> (fma (fpext y), (fpext z), (fma (fpext u), (fpext v), x))		; fold (fadd x, (fpext (fma y, z, (fmul u, v))) -> (fma (fpext y), (fpext z), (fma (fpext u), (fpext v), x))
define amdgpu_vs float @test_f16_f32_add_ext_fma_mul_rhs(float %x, half %y, half %z, half %u, half %v) {		define amdgpu_vs float @test_f16_f32_add_ext_fma_mul_rhs(float %x, half %y, half %z, half %u, half %v) {
; GFX9-DENORM-LABEL: test_f16_f32_add_ext_fma_mul_rhs:		; GFX9-DENORM-LABEL: test_f16_f32_add_ext_fma_mul_rhs:
; GFX9-DENORM: ; %bb.0: ; %.entry		; GFX9-DENORM: ; %bb.0: ; %.entry
; GFX9-DENORM-NEXT: v_cvt_f32_f16_e32 v3, v3		; GFX9-DENORM-NEXT: v_mad_mix_f32 v0, v3, v4, v0 op_sel_hi:[1,1,0]
; GFX9-DENORM-NEXT: v_cvt_f32_f16_e32 v4, v4		; GFX9-DENORM-NEXT: v_mad_mix_f32 v0, v1, v2, v0 op_sel_hi:[1,1,0]
; GFX9-DENORM-NEXT: v_cvt_f32_f16_e32 v1, v1
; GFX9-DENORM-NEXT: v_cvt_f32_f16_e32 v2, v2
; GFX9-DENORM-NEXT: v_mac_f32_e32 v0, v3, v4
; GFX9-DENORM-NEXT: v_mac_f32_e32 v0, v1, v2
; GFX9-DENORM-NEXT: ; return to shader part epilog		; GFX9-DENORM-NEXT: ; return to shader part epilog
;		;
; GFX10-LABEL: test_f16_f32_add_ext_fma_mul_rhs:		; GFX10-LABEL: test_f16_f32_add_ext_fma_mul_rhs:
; GFX10: ; %bb.0: ; %.entry		; GFX10: ; %bb.0: ; %.entry
; GFX10-NEXT: v_mul_f16_e32 v3, v3, v4		; GFX10-NEXT: v_mul_f16_e32 v3, v3, v4
; GFX10-NEXT: v_fmac_f16_e32 v3, v1, v2		; GFX10-NEXT: v_fmac_f16_e32 v3, v1, v2
; GFX10-NEXT: v_cvt_f32_f16_e32 v1, v3		; GFX10-NEXT: v_cvt_f32_f16_e32 v1, v3
; GFX10-NEXT: v_add_f32_e32 v0, v0, v1		; GFX10-NEXT: v_add_f32_e32 v0, v0, v1
Show All 24 Lines
}		}

; fold (fadd (fma x, y, (fpext (fmul u, v))), z) -> (fma x, y, (fma (fpext u), (fpext v), z))		; fold (fadd (fma x, y, (fpext (fmul u, v))), z) -> (fma x, y, (fma (fpext u), (fpext v), z))
define amdgpu_vs <4 x float> @test_v4f16_v4f32_add_fma_ext_mul(<4 x float> %x, <4 x float> %y, <4 x float> %z, <4 x half> %u, <4 x half> %v) {		define amdgpu_vs <4 x float> @test_v4f16_v4f32_add_fma_ext_mul(<4 x float> %x, <4 x float> %y, <4 x float> %z, <4 x half> %u, <4 x half> %v) {
; GFX9-DENORM-LABEL: test_v4f16_v4f32_add_fma_ext_mul:		; GFX9-DENORM-LABEL: test_v4f16_v4f32_add_fma_ext_mul:
; GFX9-DENORM: ; %bb.0: ; %.entry		; GFX9-DENORM: ; %bb.0: ; %.entry
; GFX9-DENORM-NEXT: v_pk_mul_f16 v12, v12, v14		; GFX9-DENORM-NEXT: v_pk_mul_f16 v12, v12, v14
; GFX9-DENORM-NEXT: v_pk_mul_f16 v13, v13, v15		; GFX9-DENORM-NEXT: v_pk_mul_f16 v13, v13, v15
; GFX9-DENORM-NEXT: v_cvt_f32_f16_e32 v14, v12		; GFX9-DENORM-NEXT: v_mad_mix_f32 v0, v0, v4, v12 op_sel_hi:[0,0,1]
; GFX9-DENORM-NEXT: v_cvt_f32_f16_sdwa v12, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1		; GFX9-DENORM-NEXT: v_mad_mix_f32 v1, v1, v5, v12 op_sel:[0,0,1] op_sel_hi:[0,0,1]
; GFX9-DENORM-NEXT: v_cvt_f32_f16_e32 v15, v13		; GFX9-DENORM-NEXT: v_mad_mix_f32 v2, v2, v6, v13 op_sel_hi:[0,0,1]
; GFX9-DENORM-NEXT: v_cvt_f32_f16_sdwa v13, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1		; GFX9-DENORM-NEXT: v_mad_mix_f32 v3, v3, v7, v13 op_sel:[0,0,1] op_sel_hi:[0,0,1]
; GFX9-DENORM-NEXT: v_mac_f32_e32 v14, v0, v4		; GFX9-DENORM-NEXT: v_add_f32_e32 v0, v0, v8
; GFX9-DENORM-NEXT: v_mac_f32_e32 v12, v1, v5		; GFX9-DENORM-NEXT: v_add_f32_e32 v1, v1, v9
; GFX9-DENORM-NEXT: v_mac_f32_e32 v15, v2, v6		; GFX9-DENORM-NEXT: v_add_f32_e32 v2, v2, v10
; GFX9-DENORM-NEXT: v_mac_f32_e32 v13, v3, v7		; GFX9-DENORM-NEXT: v_add_f32_e32 v3, v3, v11
; GFX9-DENORM-NEXT: v_add_f32_e32 v0, v14, v8
; GFX9-DENORM-NEXT: v_add_f32_e32 v1, v12, v9
; GFX9-DENORM-NEXT: v_add_f32_e32 v2, v15, v10
; GFX9-DENORM-NEXT: v_add_f32_e32 v3, v13, v11
; GFX9-DENORM-NEXT: ; return to shader part epilog		; GFX9-DENORM-NEXT: ; return to shader part epilog
;		;
; GFX10-LABEL: test_v4f16_v4f32_add_fma_ext_mul:		; GFX10-LABEL: test_v4f16_v4f32_add_fma_ext_mul:
; GFX10: ; %bb.0: ; %.entry		; GFX10: ; %bb.0: ; %.entry
; GFX10-NEXT: v_pk_mul_f16 v12, v12, v14		; GFX10-NEXT: v_pk_mul_f16 v12, v12, v14
; GFX10-NEXT: v_pk_mul_f16 v13, v13, v15		; GFX10-NEXT: v_pk_mul_f16 v13, v13, v15
; GFX10-NEXT: v_cvt_f32_f16_e32 v14, v12		; GFX10-NEXT: v_fma_mix_f32 v0, v0, v4, v12 op_sel_hi:[0,0,1]
; GFX10-NEXT: v_cvt_f32_f16_sdwa v12, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1		; GFX10-NEXT: v_fma_mix_f32 v1, v1, v5, v12 op_sel:[0,0,1] op_sel_hi:[0,0,1]
; GFX10-NEXT: v_cvt_f32_f16_e32 v15, v13		; GFX10-NEXT: v_fma_mix_f32 v2, v2, v6, v13 op_sel_hi:[0,0,1]
; GFX10-NEXT: v_cvt_f32_f16_sdwa v13, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1		; GFX10-NEXT: v_fma_mix_f32 v3, v3, v7, v13 op_sel:[0,0,1] op_sel_hi:[0,0,1]
; GFX10-NEXT: v_fmac_f32_e32 v14, v0, v4		; GFX10-NEXT: v_add_f32_e32 v0, v0, v8
; GFX10-NEXT: v_fmac_f32_e32 v12, v1, v5		; GFX10-NEXT: v_add_f32_e32 v1, v1, v9
; GFX10-NEXT: v_fmac_f32_e32 v15, v2, v6		; GFX10-NEXT: v_add_f32_e32 v2, v2, v10
; GFX10-NEXT: v_fmac_f32_e32 v13, v3, v7		; GFX10-NEXT: v_add_f32_e32 v3, v3, v11
; GFX10-NEXT: v_add_f32_e32 v0, v14, v8
; GFX10-NEXT: v_add_f32_e32 v1, v12, v9
; GFX10-NEXT: v_add_f32_e32 v2, v15, v10
; GFX10-NEXT: v_add_f32_e32 v3, v13, v11
; GFX10-NEXT: ; return to shader part epilog		; GFX10-NEXT: ; return to shader part epilog
;		;
; GFX10-CONTRACT-LABEL: test_v4f16_v4f32_add_fma_ext_mul:		; GFX10-CONTRACT-LABEL: test_v4f16_v4f32_add_fma_ext_mul:
; GFX10-CONTRACT: ; %bb.0: ; %.entry		; GFX10-CONTRACT: ; %bb.0: ; %.entry
; GFX10-CONTRACT-NEXT: v_pk_mul_f16 v12, v12, v14		; GFX10-CONTRACT-NEXT: v_pk_mul_f16 v12, v12, v14
; GFX10-CONTRACT-NEXT: v_pk_mul_f16 v13, v13, v15		; GFX10-CONTRACT-NEXT: v_pk_mul_f16 v13, v13, v15
; GFX10-CONTRACT-NEXT: v_cvt_f32_f16_e32 v14, v12		; GFX10-CONTRACT-NEXT: v_fma_mix_f32 v0, v0, v4, v12 op_sel_hi:[0,0,1]
; GFX10-CONTRACT-NEXT: v_cvt_f32_f16_sdwa v12, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1		; GFX10-CONTRACT-NEXT: v_fma_mix_f32 v1, v1, v5, v12 op_sel:[0,0,1] op_sel_hi:[0,0,1]
; GFX10-CONTRACT-NEXT: v_cvt_f32_f16_e32 v15, v13		; GFX10-CONTRACT-NEXT: v_fma_mix_f32 v2, v2, v6, v13 op_sel_hi:[0,0,1]
; GFX10-CONTRACT-NEXT: v_cvt_f32_f16_sdwa v13, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1		; GFX10-CONTRACT-NEXT: v_fma_mix_f32 v3, v3, v7, v13 op_sel:[0,0,1] op_sel_hi:[0,0,1]
; GFX10-CONTRACT-NEXT: v_fmac_f32_e32 v14, v0, v4		; GFX10-CONTRACT-NEXT: v_add_f32_e32 v0, v0, v8
; GFX10-CONTRACT-NEXT: v_fmac_f32_e32 v12, v1, v5		; GFX10-CONTRACT-NEXT: v_add_f32_e32 v1, v1, v9
; GFX10-CONTRACT-NEXT: v_fmac_f32_e32 v15, v2, v6		; GFX10-CONTRACT-NEXT: v_add_f32_e32 v2, v2, v10
; GFX10-CONTRACT-NEXT: v_fmac_f32_e32 v13, v3, v7		; GFX10-CONTRACT-NEXT: v_add_f32_e32 v3, v3, v11
; GFX10-CONTRACT-NEXT: v_add_f32_e32 v0, v14, v8
; GFX10-CONTRACT-NEXT: v_add_f32_e32 v1, v12, v9
; GFX10-CONTRACT-NEXT: v_add_f32_e32 v2, v15, v10
; GFX10-CONTRACT-NEXT: v_add_f32_e32 v3, v13, v11
; GFX10-CONTRACT-NEXT: ; return to shader part epilog		; GFX10-CONTRACT-NEXT: ; return to shader part epilog
;		;
; GFX10-DENORM-LABEL: test_v4f16_v4f32_add_fma_ext_mul:		; GFX10-DENORM-LABEL: test_v4f16_v4f32_add_fma_ext_mul:
; GFX10-DENORM: ; %bb.0: ; %.entry		; GFX10-DENORM: ; %bb.0: ; %.entry
; GFX10-DENORM-NEXT: v_pk_mul_f16 v12, v12, v14		; GFX10-DENORM-NEXT: v_pk_mul_f16 v12, v12, v14
; GFX10-DENORM-NEXT: v_pk_mul_f16 v13, v13, v15		; GFX10-DENORM-NEXT: v_pk_mul_f16 v13, v13, v15
; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v14, v12		; GFX10-DENORM-NEXT: v_fma_mix_f32 v0, v0, v4, v12 op_sel_hi:[0,0,1]
; GFX10-DENORM-NEXT: v_cvt_f32_f16_sdwa v12, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1		; GFX10-DENORM-NEXT: v_fma_mix_f32 v1, v1, v5, v12 op_sel:[0,0,1] op_sel_hi:[0,0,1]
; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v15, v13		; GFX10-DENORM-NEXT: v_fma_mix_f32 v2, v2, v6, v13 op_sel_hi:[0,0,1]
; GFX10-DENORM-NEXT: v_cvt_f32_f16_sdwa v13, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1		; GFX10-DENORM-NEXT: v_fma_mix_f32 v3, v3, v7, v13 op_sel:[0,0,1] op_sel_hi:[0,0,1]
; GFX10-DENORM-NEXT: v_fmac_f32_e32 v14, v0, v4		; GFX10-DENORM-NEXT: v_add_f32_e32 v0, v0, v8
; GFX10-DENORM-NEXT: v_fmac_f32_e32 v12, v1, v5		; GFX10-DENORM-NEXT: v_add_f32_e32 v1, v1, v9
; GFX10-DENORM-NEXT: v_fmac_f32_e32 v15, v2, v6		; GFX10-DENORM-NEXT: v_add_f32_e32 v2, v2, v10
; GFX10-DENORM-NEXT: v_fmac_f32_e32 v13, v3, v7		; GFX10-DENORM-NEXT: v_add_f32_e32 v3, v3, v11
; GFX10-DENORM-NEXT: v_add_f32_e32 v0, v14, v8
; GFX10-DENORM-NEXT: v_add_f32_e32 v1, v12, v9
; GFX10-DENORM-NEXT: v_add_f32_e32 v2, v15, v10
; GFX10-DENORM-NEXT: v_add_f32_e32 v3, v13, v11
; GFX10-DENORM-NEXT: ; return to shader part epilog		; GFX10-DENORM-NEXT: ; return to shader part epilog
.entry:		.entry:
%a = fmul <4 x half> %u, %v		%a = fmul <4 x half> %u, %v
%b = fpext <4 x half> %a to <4 x float>		%b = fpext <4 x half> %a to <4 x float>
%c = call <4 x float> @llvm.fmuladd.v4f32(<4 x float> %x, <4 x float> %y, <4 x float> %b)		%c = call <4 x float> @llvm.fmuladd.v4f32(<4 x float> %x, <4 x float> %y, <4 x float> %b)
%d = fadd <4 x float> %c, %z		%d = fadd <4 x float> %c, %z
ret <4 x float> %d		ret <4 x float> %d
}		}
▲ Show 20 Lines • Show All 76 Lines • ▼ Show 20 Lines
}		}

; fold (fadd x, (fma y, z, (fpext (fmul u, v))) -> (fma y, z, (fma (fpext u), (fpext v), x))		; fold (fadd x, (fma y, z, (fpext (fmul u, v))) -> (fma y, z, (fma (fpext u), (fpext v), x))
define amdgpu_vs <4 x float> @test_v4f16_v4f32_add_fma_ext_mul_rhs(<4 x float> %x, <4 x float> %y, <4 x float> %z, <4 x half> %u, <4 x half> %v) {		define amdgpu_vs <4 x float> @test_v4f16_v4f32_add_fma_ext_mul_rhs(<4 x float> %x, <4 x float> %y, <4 x float> %z, <4 x half> %u, <4 x half> %v) {
; GFX9-DENORM-LABEL: test_v4f16_v4f32_add_fma_ext_mul_rhs:		; GFX9-DENORM-LABEL: test_v4f16_v4f32_add_fma_ext_mul_rhs:
; GFX9-DENORM: ; %bb.0: ; %.entry		; GFX9-DENORM: ; %bb.0: ; %.entry
; GFX9-DENORM-NEXT: v_pk_mul_f16 v12, v12, v14		; GFX9-DENORM-NEXT: v_pk_mul_f16 v12, v12, v14
; GFX9-DENORM-NEXT: v_pk_mul_f16 v13, v13, v15		; GFX9-DENORM-NEXT: v_pk_mul_f16 v13, v13, v15
; GFX9-DENORM-NEXT: v_cvt_f32_f16_e32 v14, v12		; GFX9-DENORM-NEXT: v_mad_mix_f32 v4, v4, v8, v12 op_sel_hi:[0,0,1]
; GFX9-DENORM-NEXT: v_cvt_f32_f16_sdwa v12, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1		; GFX9-DENORM-NEXT: v_mad_mix_f32 v5, v5, v9, v12 op_sel:[0,0,1] op_sel_hi:[0,0,1]
; GFX9-DENORM-NEXT: v_cvt_f32_f16_e32 v15, v13		; GFX9-DENORM-NEXT: v_mad_mix_f32 v6, v6, v10, v13 op_sel_hi:[0,0,1]
; GFX9-DENORM-NEXT: v_cvt_f32_f16_sdwa v13, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1		; GFX9-DENORM-NEXT: v_mad_mix_f32 v7, v7, v11, v13 op_sel:[0,0,1] op_sel_hi:[0,0,1]
; GFX9-DENORM-NEXT: v_mac_f32_e32 v14, v4, v8		; GFX9-DENORM-NEXT: v_add_f32_e32 v0, v0, v4
; GFX9-DENORM-NEXT: v_mac_f32_e32 v12, v5, v9		; GFX9-DENORM-NEXT: v_add_f32_e32 v1, v1, v5
; GFX9-DENORM-NEXT: v_mac_f32_e32 v15, v6, v10		; GFX9-DENORM-NEXT: v_add_f32_e32 v2, v2, v6
; GFX9-DENORM-NEXT: v_mac_f32_e32 v13, v7, v11		; GFX9-DENORM-NEXT: v_add_f32_e32 v3, v3, v7
; GFX9-DENORM-NEXT: v_add_f32_e32 v0, v0, v14
; GFX9-DENORM-NEXT: v_add_f32_e32 v1, v1, v12
; GFX9-DENORM-NEXT: v_add_f32_e32 v2, v2, v15
; GFX9-DENORM-NEXT: v_add_f32_e32 v3, v3, v13
; GFX9-DENORM-NEXT: ; return to shader part epilog		; GFX9-DENORM-NEXT: ; return to shader part epilog
;		;
; GFX10-LABEL: test_v4f16_v4f32_add_fma_ext_mul_rhs:		; GFX10-LABEL: test_v4f16_v4f32_add_fma_ext_mul_rhs:
; GFX10: ; %bb.0: ; %.entry		; GFX10: ; %bb.0: ; %.entry
; GFX10-NEXT: v_pk_mul_f16 v12, v12, v14		; GFX10-NEXT: v_pk_mul_f16 v12, v12, v14
; GFX10-NEXT: v_pk_mul_f16 v13, v13, v15		; GFX10-NEXT: v_pk_mul_f16 v13, v13, v15
; GFX10-NEXT: v_cvt_f32_f16_e32 v14, v12		; GFX10-NEXT: v_fma_mix_f32 v4, v4, v8, v12 op_sel_hi:[0,0,1]
; GFX10-NEXT: v_cvt_f32_f16_sdwa v12, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1		; GFX10-NEXT: v_fma_mix_f32 v5, v5, v9, v12 op_sel:[0,0,1] op_sel_hi:[0,0,1]
; GFX10-NEXT: v_cvt_f32_f16_e32 v15, v13		; GFX10-NEXT: v_fma_mix_f32 v6, v6, v10, v13 op_sel_hi:[0,0,1]
; GFX10-NEXT: v_cvt_f32_f16_sdwa v13, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1		; GFX10-NEXT: v_fma_mix_f32 v7, v7, v11, v13 op_sel:[0,0,1] op_sel_hi:[0,0,1]
; GFX10-NEXT: v_fmac_f32_e32 v14, v4, v8		; GFX10-NEXT: v_add_f32_e32 v0, v0, v4
; GFX10-NEXT: v_fmac_f32_e32 v12, v5, v9		; GFX10-NEXT: v_add_f32_e32 v1, v1, v5
; GFX10-NEXT: v_fmac_f32_e32 v15, v6, v10		; GFX10-NEXT: v_add_f32_e32 v2, v2, v6
; GFX10-NEXT: v_fmac_f32_e32 v13, v7, v11		; GFX10-NEXT: v_add_f32_e32 v3, v3, v7
; GFX10-NEXT: v_add_f32_e32 v0, v0, v14
; GFX10-NEXT: v_add_f32_e32 v1, v1, v12
; GFX10-NEXT: v_add_f32_e32 v2, v2, v15
; GFX10-NEXT: v_add_f32_e32 v3, v3, v13
; GFX10-NEXT: ; return to shader part epilog		; GFX10-NEXT: ; return to shader part epilog
;		;
; GFX10-CONTRACT-LABEL: test_v4f16_v4f32_add_fma_ext_mul_rhs:		; GFX10-CONTRACT-LABEL: test_v4f16_v4f32_add_fma_ext_mul_rhs:
; GFX10-CONTRACT: ; %bb.0: ; %.entry		; GFX10-CONTRACT: ; %bb.0: ; %.entry
; GFX10-CONTRACT-NEXT: v_pk_mul_f16 v12, v12, v14		; GFX10-CONTRACT-NEXT: v_pk_mul_f16 v12, v12, v14
; GFX10-CONTRACT-NEXT: v_pk_mul_f16 v13, v13, v15		; GFX10-CONTRACT-NEXT: v_pk_mul_f16 v13, v13, v15
; GFX10-CONTRACT-NEXT: v_cvt_f32_f16_e32 v14, v12		; GFX10-CONTRACT-NEXT: v_fma_mix_f32 v4, v4, v8, v12 op_sel_hi:[0,0,1]
; GFX10-CONTRACT-NEXT: v_cvt_f32_f16_sdwa v12, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1		; GFX10-CONTRACT-NEXT: v_fma_mix_f32 v5, v5, v9, v12 op_sel:[0,0,1] op_sel_hi:[0,0,1]
; GFX10-CONTRACT-NEXT: v_cvt_f32_f16_e32 v15, v13		; GFX10-CONTRACT-NEXT: v_fma_mix_f32 v6, v6, v10, v13 op_sel_hi:[0,0,1]
; GFX10-CONTRACT-NEXT: v_cvt_f32_f16_sdwa v13, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1		; GFX10-CONTRACT-NEXT: v_fma_mix_f32 v7, v7, v11, v13 op_sel:[0,0,1] op_sel_hi:[0,0,1]
; GFX10-CONTRACT-NEXT: v_fmac_f32_e32 v14, v4, v8		; GFX10-CONTRACT-NEXT: v_add_f32_e32 v0, v0, v4
; GFX10-CONTRACT-NEXT: v_fmac_f32_e32 v12, v5, v9		; GFX10-CONTRACT-NEXT: v_add_f32_e32 v1, v1, v5
; GFX10-CONTRACT-NEXT: v_fmac_f32_e32 v15, v6, v10		; GFX10-CONTRACT-NEXT: v_add_f32_e32 v2, v2, v6
; GFX10-CONTRACT-NEXT: v_fmac_f32_e32 v13, v7, v11		; GFX10-CONTRACT-NEXT: v_add_f32_e32 v3, v3, v7
; GFX10-CONTRACT-NEXT: v_add_f32_e32 v0, v0, v14
; GFX10-CONTRACT-NEXT: v_add_f32_e32 v1, v1, v12
; GFX10-CONTRACT-NEXT: v_add_f32_e32 v2, v2, v15
; GFX10-CONTRACT-NEXT: v_add_f32_e32 v3, v3, v13
; GFX10-CONTRACT-NEXT: ; return to shader part epilog		; GFX10-CONTRACT-NEXT: ; return to shader part epilog
;		;
; GFX10-DENORM-LABEL: test_v4f16_v4f32_add_fma_ext_mul_rhs:		; GFX10-DENORM-LABEL: test_v4f16_v4f32_add_fma_ext_mul_rhs:
; GFX10-DENORM: ; %bb.0: ; %.entry		; GFX10-DENORM: ; %bb.0: ; %.entry
; GFX10-DENORM-NEXT: v_pk_mul_f16 v12, v12, v14		; GFX10-DENORM-NEXT: v_pk_mul_f16 v12, v12, v14
; GFX10-DENORM-NEXT: v_pk_mul_f16 v13, v13, v15		; GFX10-DENORM-NEXT: v_pk_mul_f16 v13, v13, v15
; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v14, v12		; GFX10-DENORM-NEXT: v_fma_mix_f32 v4, v4, v8, v12 op_sel_hi:[0,0,1]
; GFX10-DENORM-NEXT: v_cvt_f32_f16_sdwa v12, v12 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1		; GFX10-DENORM-NEXT: v_fma_mix_f32 v5, v5, v9, v12 op_sel:[0,0,1] op_sel_hi:[0,0,1]
; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v15, v13		; GFX10-DENORM-NEXT: v_fma_mix_f32 v6, v6, v10, v13 op_sel_hi:[0,0,1]
; GFX10-DENORM-NEXT: v_cvt_f32_f16_sdwa v13, v13 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1		; GFX10-DENORM-NEXT: v_fma_mix_f32 v7, v7, v11, v13 op_sel:[0,0,1] op_sel_hi:[0,0,1]
; GFX10-DENORM-NEXT: v_fmac_f32_e32 v14, v4, v8		; GFX10-DENORM-NEXT: v_add_f32_e32 v0, v0, v4
; GFX10-DENORM-NEXT: v_fmac_f32_e32 v12, v5, v9		; GFX10-DENORM-NEXT: v_add_f32_e32 v1, v1, v5
; GFX10-DENORM-NEXT: v_fmac_f32_e32 v15, v6, v10		; GFX10-DENORM-NEXT: v_add_f32_e32 v2, v2, v6
; GFX10-DENORM-NEXT: v_fmac_f32_e32 v13, v7, v11		; GFX10-DENORM-NEXT: v_add_f32_e32 v3, v3, v7
; GFX10-DENORM-NEXT: v_add_f32_e32 v0, v0, v14
; GFX10-DENORM-NEXT: v_add_f32_e32 v1, v1, v12
; GFX10-DENORM-NEXT: v_add_f32_e32 v2, v2, v15
; GFX10-DENORM-NEXT: v_add_f32_e32 v3, v3, v13
; GFX10-DENORM-NEXT: ; return to shader part epilog		; GFX10-DENORM-NEXT: ; return to shader part epilog
.entry:		.entry:
%a = fmul <4 x half> %u, %v		%a = fmul <4 x half> %u, %v
%b = fpext <4 x half> %a to <4 x float>		%b = fpext <4 x half> %a to <4 x float>
%c = call <4 x float> @llvm.fmuladd.v4f32(<4 x float> %y, <4 x float> %z, <4 x float> %b)		%c = call <4 x float> @llvm.fmuladd.v4f32(<4 x float> %y, <4 x float> %z, <4 x float> %b)
%d = fadd <4 x float> %x, %c		%d = fadd <4 x float> %x, %c
ret <4 x float> %d		ret <4 x float> %d
}		}
▲ Show 20 Lines • Show All 84 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-add-ext-mul.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -global-isel -march=amdgcn -mcpu=gfx900 --denormal-fp-math=preserve-sign < %s \| FileCheck -check-prefix=GFX9-FAST-DENORM %s			; RUN: llc -global-isel -march=amdgcn -mcpu=gfx900 --denormal-fp-math=preserve-sign < %s \| FileCheck -check-prefix=GFX9-FAST-DENORM %s
	; RUN: llc -global-isel -march=amdgcn -mcpu=gfx1010 --denormal-fp-math=preserve-sign < %s \| FileCheck -check-prefix=GFX10-FAST-DENORM %s			; RUN: llc -global-isel -march=amdgcn -mcpu=gfx1010 --denormal-fp-math=preserve-sign < %s \| FileCheck -check-prefix=GFX10-FAST-DENORM %s

	; fold (fadd fast (fpext (fmul fast x, y)), z) -> (fma (fpext x), (fpext y), z)			; fold (fadd fast (fpext (fmul fast x, y)), z) -> (fma (fpext x), (fpext y), z)
	; fold (fadd fast x, (fpext (fmul fast y, z))) -> (fma (fpext y), (fpext z), x)			; fold (fadd fast x, (fpext (fmul fast y, z))) -> (fma (fpext y), (fpext z), x)

	define amdgpu_vs float @test_f16_f32_add_ext_mul(half inreg %x, half inreg %y, float inreg %z) {			define amdgpu_vs float @test_f16_f32_add_ext_mul(half inreg %x, half inreg %y, float inreg %z) {
	; GFX9-FAST-DENORM-LABEL: test_f16_f32_add_ext_mul:			; GFX9-FAST-DENORM-LABEL: test_f16_f32_add_ext_mul:
	; GFX9-FAST-DENORM: ; %bb.0: ; %.entry			; GFX9-FAST-DENORM: ; %bb.0: ; %.entry
	; GFX9-FAST-DENORM-NEXT: v_cvt_f32_f16_e32 v0, s0			; GFX9-FAST-DENORM-NEXT: v_mov_b32_e32 v0, s1
	; GFX9-FAST-DENORM-NEXT: v_cvt_f32_f16_e32 v1, s1			; GFX9-FAST-DENORM-NEXT: v_mov_b32_e32 v1, s2
	; GFX9-FAST-DENORM-NEXT: v_mad_f32 v0, v0, v1, s2			; GFX9-FAST-DENORM-NEXT: v_mad_mix_f32 v0, s0, v0, v1 op_sel_hi:[1,1,0]
	; GFX9-FAST-DENORM-NEXT: ; return to shader part epilog			; GFX9-FAST-DENORM-NEXT: ; return to shader part epilog
	;			;
	; GFX10-FAST-DENORM-LABEL: test_f16_f32_add_ext_mul:			; GFX10-FAST-DENORM-LABEL: test_f16_f32_add_ext_mul:
	; GFX10-FAST-DENORM: ; %bb.0: ; %.entry			; GFX10-FAST-DENORM: ; %bb.0: ; %.entry
	; GFX10-FAST-DENORM-NEXT: v_cvt_f32_f16_e32 v0, s0			; GFX10-FAST-DENORM-NEXT: v_mov_b32_e32 v0, s2
	; GFX10-FAST-DENORM-NEXT: v_cvt_f32_f16_e32 v1, s1			; GFX10-FAST-DENORM-NEXT: v_fma_mix_f32 v0, s0, s1, v0 op_sel_hi:[1,1,0]
	; GFX10-FAST-DENORM-NEXT: v_fma_f32 v0, v0, v1, s2
	; GFX10-FAST-DENORM-NEXT: ; return to shader part epilog			; GFX10-FAST-DENORM-NEXT: ; return to shader part epilog
	.entry:			.entry:
	%a = fmul fast half %x, %y			%a = fmul fast half %x, %y
	%b = fpext half %a to float			%b = fpext half %a to float
	%c = fadd fast float %b, %z			%c = fadd fast float %b, %z
	ret float %c			ret float %c
	}			}

	define amdgpu_vs float @test_f16_f32_add_ext_mul_rhs(half inreg %x, half inreg %y, float inreg %z) {			define amdgpu_vs float @test_f16_f32_add_ext_mul_rhs(half inreg %x, half inreg %y, float inreg %z) {
	; GFX9-FAST-DENORM-LABEL: test_f16_f32_add_ext_mul_rhs:			; GFX9-FAST-DENORM-LABEL: test_f16_f32_add_ext_mul_rhs:
	; GFX9-FAST-DENORM: ; %bb.0: ; %.entry			; GFX9-FAST-DENORM: ; %bb.0: ; %.entry
	; GFX9-FAST-DENORM-NEXT: v_cvt_f32_f16_e32 v0, s0			; GFX9-FAST-DENORM-NEXT: v_mov_b32_e32 v0, s1
	; GFX9-FAST-DENORM-NEXT: v_cvt_f32_f16_e32 v1, s1			; GFX9-FAST-DENORM-NEXT: v_mov_b32_e32 v1, s2
	; GFX9-FAST-DENORM-NEXT: v_mad_f32 v0, v0, v1, s2			; GFX9-FAST-DENORM-NEXT: v_mad_mix_f32 v0, s0, v0, v1 op_sel_hi:[1,1,0]
	; GFX9-FAST-DENORM-NEXT: ; return to shader part epilog			; GFX9-FAST-DENORM-NEXT: ; return to shader part epilog
	;			;
	; GFX10-FAST-DENORM-LABEL: test_f16_f32_add_ext_mul_rhs:			; GFX10-FAST-DENORM-LABEL: test_f16_f32_add_ext_mul_rhs:
	; GFX10-FAST-DENORM: ; %bb.0: ; %.entry			; GFX10-FAST-DENORM: ; %bb.0: ; %.entry
	; GFX10-FAST-DENORM-NEXT: v_cvt_f32_f16_e32 v0, s0			; GFX10-FAST-DENORM-NEXT: v_mov_b32_e32 v0, s2
	; GFX10-FAST-DENORM-NEXT: v_cvt_f32_f16_e32 v1, s1			; GFX10-FAST-DENORM-NEXT: v_fma_mix_f32 v0, s0, s1, v0 op_sel_hi:[1,1,0]
	; GFX10-FAST-DENORM-NEXT: v_fma_f32 v0, v0, v1, s2
	; GFX10-FAST-DENORM-NEXT: ; return to shader part epilog			; GFX10-FAST-DENORM-NEXT: ; return to shader part epilog
	.entry:			.entry:
	%a = fmul fast half %x, %y			%a = fmul fast half %x, %y
	%b = fpext half %a to float			%b = fpext half %a to float
	%c = fadd fast float %z, %b			%c = fadd fast float %z, %b
	ret float %c			ret float %c
	}			}

	Show All 19 Lines
	; GFX9-FAST-DENORM-NEXT: v_add_f32_e32 v1, s7, v4			; GFX9-FAST-DENORM-NEXT: v_add_f32_e32 v1, s7, v4
	; GFX9-FAST-DENORM-NEXT: v_add_f32_e32 v2, s8, v5			; GFX9-FAST-DENORM-NEXT: v_add_f32_e32 v2, s8, v5
	; GFX9-FAST-DENORM-NEXT: v_add_f32_e32 v3, s9, v6			; GFX9-FAST-DENORM-NEXT: v_add_f32_e32 v3, s9, v6
	; GFX9-FAST-DENORM-NEXT: v_add_f32_e32 v4, s10, v7			; GFX9-FAST-DENORM-NEXT: v_add_f32_e32 v4, s10, v7
	; GFX9-FAST-DENORM-NEXT: ; return to shader part epilog			; GFX9-FAST-DENORM-NEXT: ; return to shader part epilog
	;			;
	; GFX10-FAST-DENORM-LABEL: test_5xf16_5xf32_add_ext_mul:			; GFX10-FAST-DENORM-LABEL: test_5xf16_5xf32_add_ext_mul:
	; GFX10-FAST-DENORM: ; %bb.0: ; %.entry			; GFX10-FAST-DENORM: ; %bb.0: ; %.entry
	; GFX10-FAST-DENORM-NEXT: s_lshr_b32 s11, s0, 16			; GFX10-FAST-DENORM-NEXT: v_mov_b32_e32 v0, s6
	; GFX10-FAST-DENORM-NEXT: s_lshr_b32 s12, s1, 16			; GFX10-FAST-DENORM-NEXT: v_mov_b32_e32 v1, s7
	; GFX10-FAST-DENORM-NEXT: s_lshr_b32 s13, s3, 16			; GFX10-FAST-DENORM-NEXT: v_mov_b32_e32 v2, s8
	; GFX10-FAST-DENORM-NEXT: s_lshr_b32 s14, s4, 16			; GFX10-FAST-DENORM-NEXT: v_mov_b32_e32 v3, s9
	; GFX10-FAST-DENORM-NEXT: v_cvt_f32_f16_e32 v0, s0			; GFX10-FAST-DENORM-NEXT: v_mov_b32_e32 v4, s10
	; GFX10-FAST-DENORM-NEXT: v_cvt_f32_f16_e32 v1, s11			; GFX10-FAST-DENORM-NEXT: v_fma_mix_f32 v0, s0, s3, v0 op_sel_hi:[1,1,0]
	; GFX10-FAST-DENORM-NEXT: v_cvt_f32_f16_e32 v2, s1			; GFX10-FAST-DENORM-NEXT: v_fma_mix_f32 v1, s0, s3, v1 op_sel:[1,1,0] op_sel_hi:[1,1,0]
	; GFX10-FAST-DENORM-NEXT: v_cvt_f32_f16_e32 v3, s12			; GFX10-FAST-DENORM-NEXT: v_fma_mix_f32 v2, s1, s4, v2 op_sel_hi:[1,1,0]
	; GFX10-FAST-DENORM-NEXT: v_cvt_f32_f16_e32 v4, s2			; GFX10-FAST-DENORM-NEXT: v_fma_mix_f32 v3, s1, s4, v3 op_sel:[1,1,0] op_sel_hi:[1,1,0]
	; GFX10-FAST-DENORM-NEXT: v_cvt_f32_f16_e32 v5, s3			; GFX10-FAST-DENORM-NEXT: v_fma_mix_f32 v4, s2, s5, v4 op_sel_hi:[1,1,0]
	; GFX10-FAST-DENORM-NEXT: v_cvt_f32_f16_e32 v6, s13
	; GFX10-FAST-DENORM-NEXT: v_cvt_f32_f16_e32 v7, s4
	; GFX10-FAST-DENORM-NEXT: v_cvt_f32_f16_e32 v8, s14
	; GFX10-FAST-DENORM-NEXT: v_cvt_f32_f16_e32 v9, s5
	; GFX10-FAST-DENORM-NEXT: v_fma_f32 v0, v0, v5, s6
	; GFX10-FAST-DENORM-NEXT: v_fma_f32 v1, v1, v6, s7
	; GFX10-FAST-DENORM-NEXT: v_fma_f32 v2, v2, v7, s8
	; GFX10-FAST-DENORM-NEXT: v_fma_f32 v3, v3, v8, s9
	; GFX10-FAST-DENORM-NEXT: v_fma_f32 v4, v4, v9, s10
	; GFX10-FAST-DENORM-NEXT: ; return to shader part epilog			; GFX10-FAST-DENORM-NEXT: ; return to shader part epilog
	.entry:			.entry:
	%a = fmul fast <5 x half> %x, %y			%a = fmul fast <5 x half> %x, %y
	%b = fpext <5 x half> %a to <5 x float>			%b = fpext <5 x half> %a to <5 x float>
	%c = fadd fast <5 x float> %b, %z			%c = fadd fast <5 x float> %b, %z
	ret <5 x float> %c			ret <5 x float> %c
	}			}

	Show All 17 Lines
	; GFX9-FAST-DENORM-NEXT: v_add_f32_e32 v2, s8, v5			; GFX9-FAST-DENORM-NEXT: v_add_f32_e32 v2, s8, v5
	; GFX9-FAST-DENORM-NEXT: v_add_f32_e32 v3, s9, v6			; GFX9-FAST-DENORM-NEXT: v_add_f32_e32 v3, s9, v6
	; GFX9-FAST-DENORM-NEXT: v_add_f32_e32 v4, s10, v7			; GFX9-FAST-DENORM-NEXT: v_add_f32_e32 v4, s10, v7
	; GFX9-FAST-DENORM-NEXT: v_add_f32_e32 v5, s11, v8			; GFX9-FAST-DENORM-NEXT: v_add_f32_e32 v5, s11, v8
	; GFX9-FAST-DENORM-NEXT: ; return to shader part epilog			; GFX9-FAST-DENORM-NEXT: ; return to shader part epilog
	;			;
	; GFX10-FAST-DENORM-LABEL: test_6xf16_6xf32_add_ext_mul_rhs:			; GFX10-FAST-DENORM-LABEL: test_6xf16_6xf32_add_ext_mul_rhs:
	; GFX10-FAST-DENORM: ; %bb.0: ; %.entry			; GFX10-FAST-DENORM: ; %bb.0: ; %.entry
	; GFX10-FAST-DENORM-NEXT: s_lshr_b32 s12, s0, 16			; GFX10-FAST-DENORM-NEXT: v_mov_b32_e32 v0, s6
	; GFX10-FAST-DENORM-NEXT: s_lshr_b32 s13, s1, 16			; GFX10-FAST-DENORM-NEXT: v_mov_b32_e32 v1, s7
	; GFX10-FAST-DENORM-NEXT: s_lshr_b32 s14, s2, 16			; GFX10-FAST-DENORM-NEXT: v_mov_b32_e32 v2, s8
	; GFX10-FAST-DENORM-NEXT: v_cvt_f32_f16_e32 v0, s0			; GFX10-FAST-DENORM-NEXT: v_mov_b32_e32 v3, s9
	; GFX10-FAST-DENORM-NEXT: v_cvt_f32_f16_e32 v2, s1			; GFX10-FAST-DENORM-NEXT: v_mov_b32_e32 v4, s10
	; GFX10-FAST-DENORM-NEXT: v_cvt_f32_f16_e32 v4, s2			; GFX10-FAST-DENORM-NEXT: v_mov_b32_e32 v5, s11
	; GFX10-FAST-DENORM-NEXT: s_lshr_b32 s0, s3, 16			; GFX10-FAST-DENORM-NEXT: v_fma_mix_f32 v0, s0, s3, v0 op_sel_hi:[1,1,0]
	; GFX10-FAST-DENORM-NEXT: s_lshr_b32 s1, s4, 16			; GFX10-FAST-DENORM-NEXT: v_fma_mix_f32 v1, s0, s3, v1 op_sel:[1,1,0] op_sel_hi:[1,1,0]
	; GFX10-FAST-DENORM-NEXT: s_lshr_b32 s2, s5, 16			; GFX10-FAST-DENORM-NEXT: v_fma_mix_f32 v2, s1, s4, v2 op_sel_hi:[1,1,0]
	; GFX10-FAST-DENORM-NEXT: v_cvt_f32_f16_e32 v1, s12			; GFX10-FAST-DENORM-NEXT: v_fma_mix_f32 v3, s1, s4, v3 op_sel:[1,1,0] op_sel_hi:[1,1,0]
	; GFX10-FAST-DENORM-NEXT: v_cvt_f32_f16_e32 v3, s13			; GFX10-FAST-DENORM-NEXT: v_fma_mix_f32 v4, s2, s5, v4 op_sel_hi:[1,1,0]
	; GFX10-FAST-DENORM-NEXT: v_cvt_f32_f16_e32 v5, s14			; GFX10-FAST-DENORM-NEXT: v_fma_mix_f32 v5, s2, s5, v5 op_sel:[1,1,0] op_sel_hi:[1,1,0]
	; GFX10-FAST-DENORM-NEXT: v_cvt_f32_f16_e32 v6, s3
	; GFX10-FAST-DENORM-NEXT: v_cvt_f32_f16_e32 v7, s0
	; GFX10-FAST-DENORM-NEXT: v_cvt_f32_f16_e32 v8, s4
	; GFX10-FAST-DENORM-NEXT: v_cvt_f32_f16_e32 v9, s1
	; GFX10-FAST-DENORM-NEXT: v_cvt_f32_f16_e32 v10, s5
	; GFX10-FAST-DENORM-NEXT: v_cvt_f32_f16_e32 v11, s2
	; GFX10-FAST-DENORM-NEXT: v_fma_f32 v0, v0, v6, s6
	; GFX10-FAST-DENORM-NEXT: v_fma_f32 v1, v1, v7, s7
	; GFX10-FAST-DENORM-NEXT: v_fma_f32 v2, v2, v8, s8
	; GFX10-FAST-DENORM-NEXT: v_fma_f32 v3, v3, v9, s9
	; GFX10-FAST-DENORM-NEXT: v_fma_f32 v4, v4, v10, s10
	; GFX10-FAST-DENORM-NEXT: v_fma_f32 v5, v5, v11, s11
	; GFX10-FAST-DENORM-NEXT: ; return to shader part epilog			; GFX10-FAST-DENORM-NEXT: ; return to shader part epilog
	.entry:			.entry:
	%a = fmul fast <6 x half> %x, %y			%a = fmul fast <6 x half> %x, %y
	%b = fpext <6 x half> %a to <6 x float>			%b = fpext <6 x half> %a to <6 x float>
	%c = fadd fast <6 x float> %z, %b			%c = fadd fast <6 x float> %z, %b
	ret <6 x float> %c			ret <6 x float> %c
	}			}

llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-sub-ext-mul.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -global-isel -march=amdgcn -mcpu=gfx900 --denormal-fp-math=preserve-sign < %s \| FileCheck -check-prefix=GFX9-DENORM %s			; RUN: llc -global-isel -march=amdgcn -mcpu=gfx900 --denormal-fp-math=preserve-sign < %s \| FileCheck -check-prefix=GFX9-DENORM %s
	; RUN: llc -global-isel -march=amdgcn -mcpu=gfx1010 --denormal-fp-math=preserve-sign < %s \| FileCheck -check-prefix=GFX10-DENORM %s			; RUN: llc -global-isel -march=amdgcn -mcpu=gfx1010 --denormal-fp-math=preserve-sign < %s \| FileCheck -check-prefix=GFX10-DENORM %s

	; fold (fsub (fpext (fmul x, y)), z) -> (fma (fpext x), (fpext y), (fneg z))			; fold (fsub (fpext (fmul x, y)), z) -> (fma (fpext x), (fpext y), (fneg z))
	define amdgpu_vs float @test_f16_to_f32_sub_ext_mul(half %x, half %y, float %z) {			define amdgpu_vs float @test_f16_to_f32_sub_ext_mul(half %x, half %y, float %z) {
	; GFX9-DENORM-LABEL: test_f16_to_f32_sub_ext_mul:			; GFX9-DENORM-LABEL: test_f16_to_f32_sub_ext_mul:
	; GFX9-DENORM: ; %bb.0: ; %entry			; GFX9-DENORM: ; %bb.0: ; %entry
	; GFX9-DENORM-NEXT: v_cvt_f32_f16_e32 v0, v0			; GFX9-DENORM-NEXT: v_mad_mix_f32 v0, v0, v1, -v2 op_sel_hi:[1,1,0]
	; GFX9-DENORM-NEXT: v_cvt_f32_f16_e32 v1, v1
	; GFX9-DENORM-NEXT: v_mad_f32 v0, v0, v1, -v2
	; GFX9-DENORM-NEXT: ; return to shader part epilog			; GFX9-DENORM-NEXT: ; return to shader part epilog
	;			;
	; GFX10-DENORM-LABEL: test_f16_to_f32_sub_ext_mul:			; GFX10-DENORM-LABEL: test_f16_to_f32_sub_ext_mul:
	; GFX10-DENORM: ; %bb.0: ; %entry			; GFX10-DENORM: ; %bb.0: ; %entry
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v0, v0			; GFX10-DENORM-NEXT: v_fma_mix_f32 v0, v0, v1, -v2 op_sel_hi:[1,1,0]
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v1, v1
	; GFX10-DENORM-NEXT: v_fma_f32 v0, v0, v1, -v2
	; GFX10-DENORM-NEXT: ; return to shader part epilog			; GFX10-DENORM-NEXT: ; return to shader part epilog
	entry:			entry:
	%a = fmul fast half %x, %y			%a = fmul fast half %x, %y
	%b = fpext half %a to float			%b = fpext half %a to float
	%c = fsub fast float %b, %z			%c = fsub fast float %b, %z
	ret float %c			ret float %c
	}			}

	; fold (fsub x, (fpext (fmul y, z))) -> (fma (fneg (fpext y)), (fpext z), x)			; fold (fsub x, (fpext (fmul y, z))) -> (fma (fneg (fpext y)), (fpext z), x)
	define amdgpu_vs float @test_f16_to_f32_sub_ext_mul_rhs(float %x, half %y, half %z) {			define amdgpu_vs float @test_f16_to_f32_sub_ext_mul_rhs(float %x, half %y, half %z) {
	; GFX9-DENORM-LABEL: test_f16_to_f32_sub_ext_mul_rhs:			; GFX9-DENORM-LABEL: test_f16_to_f32_sub_ext_mul_rhs:
	; GFX9-DENORM: ; %bb.0: ; %.entry			; GFX9-DENORM: ; %bb.0: ; %.entry
	; GFX9-DENORM-NEXT: v_cvt_f32_f16_e32 v1, v1			; GFX9-DENORM-NEXT: v_mad_mix_f32 v0, -v1, v2, v0 op_sel_hi:[1,1,0]
	; GFX9-DENORM-NEXT: v_cvt_f32_f16_e32 v2, v2
	; GFX9-DENORM-NEXT: v_mad_f32 v0, -v1, v2, v0
	; GFX9-DENORM-NEXT: ; return to shader part epilog			; GFX9-DENORM-NEXT: ; return to shader part epilog
	;			;
	; GFX10-DENORM-LABEL: test_f16_to_f32_sub_ext_mul_rhs:			; GFX10-DENORM-LABEL: test_f16_to_f32_sub_ext_mul_rhs:
	; GFX10-DENORM: ; %bb.0: ; %.entry			; GFX10-DENORM: ; %bb.0: ; %.entry
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v1, v1			; GFX10-DENORM-NEXT: v_fma_mix_f32 v0, -v1, v2, v0 op_sel_hi:[1,1,0]
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v2, v2
	; GFX10-DENORM-NEXT: v_fma_f32 v0, -v1, v2, v0
	; GFX10-DENORM-NEXT: ; return to shader part epilog			; GFX10-DENORM-NEXT: ; return to shader part epilog
	.entry:			.entry:
	%a = fmul fast half %y, %z			%a = fmul fast half %y, %z
	%b = fpext half %a to float			%b = fpext half %a to float
	%c = fsub fast float %x, %b			%c = fsub fast float %x, %b
	ret float %c			ret float %c
	}			}

	Show All 10 Lines
	; GFX9-DENORM-NEXT: v_sub_f32_e32 v0, v2, v4			; GFX9-DENORM-NEXT: v_sub_f32_e32 v0, v2, v4
	; GFX9-DENORM-NEXT: v_sub_f32_e32 v1, v3, v5			; GFX9-DENORM-NEXT: v_sub_f32_e32 v1, v3, v5
	; GFX9-DENORM-NEXT: v_sub_f32_e32 v2, v8, v6			; GFX9-DENORM-NEXT: v_sub_f32_e32 v2, v8, v6
	; GFX9-DENORM-NEXT: v_sub_f32_e32 v3, v9, v7			; GFX9-DENORM-NEXT: v_sub_f32_e32 v3, v9, v7
	; GFX9-DENORM-NEXT: ; return to shader part epilog			; GFX9-DENORM-NEXT: ; return to shader part epilog
	;			;
	; GFX10-DENORM-LABEL: test_v4f16_to_v4f32_sub_ext_mul:			; GFX10-DENORM-LABEL: test_v4f16_to_v4f32_sub_ext_mul:
	; GFX10-DENORM: ; %bb.0: ; %entry			; GFX10-DENORM: ; %bb.0: ; %entry
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v8, v0			; GFX10-DENORM-NEXT: v_fma_mix_f32 v4, v0, v2, -v4 op_sel_hi:[1,1,0]
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_sdwa v9, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1			; GFX10-DENORM-NEXT: v_fma_mix_f32 v5, v0, v2, -v5 op_sel:[1,1,0] op_sel_hi:[1,1,0]
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v10, v1			; GFX10-DENORM-NEXT: v_fma_mix_f32 v2, v1, v3, -v6 op_sel_hi:[1,1,0]
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_sdwa v11, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1			; GFX10-DENORM-NEXT: v_fma_mix_f32 v3, v1, v3, -v7 op_sel:[1,1,0] op_sel_hi:[1,1,0]
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v0, v2			; GFX10-DENORM-NEXT: v_mov_b32_e32 v0, v4
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_sdwa v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1			; GFX10-DENORM-NEXT: v_mov_b32_e32 v1, v5
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v2, v3
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_sdwa v3, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
	; GFX10-DENORM-NEXT: v_fma_f32 v0, v8, v0, -v4
	; GFX10-DENORM-NEXT: v_fma_f32 v1, v9, v1, -v5
	; GFX10-DENORM-NEXT: v_fma_f32 v2, v10, v2, -v6
	; GFX10-DENORM-NEXT: v_fma_f32 v3, v11, v3, -v7
	; GFX10-DENORM-NEXT: ; return to shader part epilog			; GFX10-DENORM-NEXT: ; return to shader part epilog
	entry:			entry:
	%a = fmul fast <4 x half> %x, %y			%a = fmul fast <4 x half> %x, %y
	%b = fpext <4 x half> %a to <4 x float>			%b = fpext <4 x half> %a to <4 x float>
	%c = fsub fast <4 x float> %b, %z			%c = fsub fast <4 x float> %b, %z
	ret <4 x float> %c			ret <4 x float> %c
	}			}

	Show All 10 Lines
	; GFX9-DENORM-NEXT: v_sub_f32_e32 v0, v0, v6			; GFX9-DENORM-NEXT: v_sub_f32_e32 v0, v0, v6
	; GFX9-DENORM-NEXT: v_sub_f32_e32 v1, v1, v4			; GFX9-DENORM-NEXT: v_sub_f32_e32 v1, v1, v4
	; GFX9-DENORM-NEXT: v_sub_f32_e32 v2, v2, v7			; GFX9-DENORM-NEXT: v_sub_f32_e32 v2, v2, v7
	; GFX9-DENORM-NEXT: v_sub_f32_e32 v3, v3, v5			; GFX9-DENORM-NEXT: v_sub_f32_e32 v3, v3, v5
	; GFX9-DENORM-NEXT: ; return to shader part epilog			; GFX9-DENORM-NEXT: ; return to shader part epilog
	;			;
	; GFX10-DENORM-LABEL: test_v4f16_to_v4f32_sub_ext_mul_rhs:			; GFX10-DENORM-LABEL: test_v4f16_to_v4f32_sub_ext_mul_rhs:
	; GFX10-DENORM: ; %bb.0: ; %.entry			; GFX10-DENORM: ; %bb.0: ; %.entry
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v8, v4			; GFX10-DENORM-NEXT: v_fma_mix_f32 v0, -v4, v6, v0 op_sel_hi:[1,1,0]
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_sdwa v4, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1			; GFX10-DENORM-NEXT: v_fma_mix_f32 v1, -v4, v6, v1 op_sel:[1,1,0] op_sel_hi:[1,1,0]
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v9, v5			; GFX10-DENORM-NEXT: v_fma_mix_f32 v2, -v5, v7, v2 op_sel_hi:[1,1,0]
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_sdwa v5, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1			; GFX10-DENORM-NEXT: v_fma_mix_f32 v3, -v5, v7, v3 op_sel:[1,1,0] op_sel_hi:[1,1,0]
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v10, v6
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_sdwa v6, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v11, v7
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_sdwa v7, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
	; GFX10-DENORM-NEXT: v_fma_f32 v0, -v8, v10, v0
	; GFX10-DENORM-NEXT: v_fma_f32 v1, -v4, v6, v1
	; GFX10-DENORM-NEXT: v_fma_f32 v2, -v9, v11, v2
	; GFX10-DENORM-NEXT: v_fma_f32 v3, -v5, v7, v3
	; GFX10-DENORM-NEXT: ; return to shader part epilog			; GFX10-DENORM-NEXT: ; return to shader part epilog
	.entry:			.entry:
	%a = fmul fast <4 x half> %y, %z			%a = fmul fast <4 x half> %y, %z
	%b = fpext <4 x half> %a to <4 x float>			%b = fpext <4 x half> %a to <4 x float>
	%c = fsub fast <4 x float> %x, %b			%c = fsub fast <4 x float> %x, %b
	ret <4 x float> %c			ret <4 x float> %c
	}			}

llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-sub-ext-neg-mul.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -global-isel -march=amdgcn -mcpu=gfx900 --denormal-fp-math=preserve-sign < %s \| FileCheck -check-prefix=GFX9-DENORM %s			; RUN: llc -global-isel -march=amdgcn -mcpu=gfx900 --denormal-fp-math=preserve-sign < %s \| FileCheck -check-prefix=GFX9-DENORM %s
	; RUN: llc -global-isel -march=amdgcn -mcpu=gfx1010 --denormal-fp-math=preserve-sign < %s \| FileCheck -check-prefix=GFX10-DENORM %s			; RUN: llc -global-isel -march=amdgcn -mcpu=gfx1010 --denormal-fp-math=preserve-sign < %s \| FileCheck -check-prefix=GFX10-DENORM %s

	; fold (fsub (fpext (fneg (fmul, x, y))), z) -> (fneg (fma (fpext x), (fpext y), z))			; fold (fsub (fpext (fneg (fmul, x, y))), z) -> (fneg (fma (fpext x), (fpext y), z))
	define amdgpu_vs float @test_f16_to_f32_sub_ext_neg_mul(half %x, half %y, float %z) {			define amdgpu_vs float @test_f16_to_f32_sub_ext_neg_mul(half %x, half %y, float %z) {
	; GFX9-DENORM-LABEL: test_f16_to_f32_sub_ext_neg_mul:			; GFX9-DENORM-LABEL: test_f16_to_f32_sub_ext_neg_mul:
	; GFX9-DENORM: ; %bb.0: ; %entry			; GFX9-DENORM: ; %bb.0: ; %entry
	; GFX9-DENORM-NEXT: v_cvt_f32_f16_e32 v0, v0			; GFX9-DENORM-NEXT: v_mad_mix_f32 v0, v0, -v1, -v2 op_sel_hi:[1,1,0]
	; GFX9-DENORM-NEXT: v_cvt_f32_f16_e64 v1, -v1
	; GFX9-DENORM-NEXT: v_mad_f32 v0, v0, v1, -v2
	; GFX9-DENORM-NEXT: ; return to shader part epilog			; GFX9-DENORM-NEXT: ; return to shader part epilog
	;			;
	; GFX10-DENORM-LABEL: test_f16_to_f32_sub_ext_neg_mul:			; GFX10-DENORM-LABEL: test_f16_to_f32_sub_ext_neg_mul:
	; GFX10-DENORM: ; %bb.0: ; %entry			; GFX10-DENORM: ; %bb.0: ; %entry
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v0, v0			; GFX10-DENORM-NEXT: v_fma_mix_f32 v0, v0, -v1, -v2 op_sel_hi:[1,1,0]
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e64 v1, -v1
	; GFX10-DENORM-NEXT: v_fma_f32 v0, v0, v1, -v2
	; GFX10-DENORM-NEXT: ; return to shader part epilog			; GFX10-DENORM-NEXT: ; return to shader part epilog
	entry:			entry:
	%a = fmul fast half %x, %y			%a = fmul fast half %x, %y
	%b = fneg half %a			%b = fneg half %a
	%c = fpext half %b to float			%c = fpext half %b to float
	%d = fsub fast float %c, %z			%d = fsub fast float %c, %z
	ret float %d			ret float %d
	}			}

	; fold (fsub (fneg (fpext (fmul, x, y))), z) -> (fneg (fma (fpext x)), (fpext y), z)			; fold (fsub (fneg (fpext (fmul, x, y))), z) -> (fneg (fma (fpext x)), (fpext y), z)
	define amdgpu_vs float @test_f16_to_f32_sub_neg_ext_mul(half %x, half %y, float %z) {			define amdgpu_vs float @test_f16_to_f32_sub_neg_ext_mul(half %x, half %y, float %z) {
	; GFX9-DENORM-LABEL: test_f16_to_f32_sub_neg_ext_mul:			; GFX9-DENORM-LABEL: test_f16_to_f32_sub_neg_ext_mul:
	; GFX9-DENORM: ; %bb.0: ; %entry			; GFX9-DENORM: ; %bb.0: ; %entry
	; GFX9-DENORM-NEXT: v_cvt_f32_f16_e32 v0, v0			; GFX9-DENORM-NEXT: v_mad_mix_f32 v0, v0, -v1, -v2 op_sel_hi:[1,1,0]
	; GFX9-DENORM-NEXT: v_cvt_f32_f16_e64 v1, -v1
	; GFX9-DENORM-NEXT: v_mad_f32 v0, v0, v1, -v2
	; GFX9-DENORM-NEXT: ; return to shader part epilog			; GFX9-DENORM-NEXT: ; return to shader part epilog
	;			;
	; GFX10-DENORM-LABEL: test_f16_to_f32_sub_neg_ext_mul:			; GFX10-DENORM-LABEL: test_f16_to_f32_sub_neg_ext_mul:
	; GFX10-DENORM: ; %bb.0: ; %entry			; GFX10-DENORM: ; %bb.0: ; %entry
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v0, v0			; GFX10-DENORM-NEXT: v_fma_mix_f32 v0, v0, -v1, -v2 op_sel_hi:[1,1,0]
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e64 v1, -v1
	; GFX10-DENORM-NEXT: v_fma_f32 v0, v0, v1, -v2
	; GFX10-DENORM-NEXT: ; return to shader part epilog			; GFX10-DENORM-NEXT: ; return to shader part epilog
	entry:			entry:
	%a = fmul fast half %x, %y			%a = fmul fast half %x, %y
	%b = fpext half %a to float			%b = fpext half %a to float
	%c = fneg float %b			%c = fneg float %b
	%d = fsub fast float %c, %z			%d = fsub fast float %c, %z
	ret float %d			ret float %d
	}			}


	; fold (fsub x, (fpext (fneg (fmul y, z)))) -> (fma (fpext y), (fpext z), x)			; fold (fsub x, (fpext (fneg (fmul y, z)))) -> (fma (fpext y), (fpext z), x)
	define amdgpu_vs float @test_f16_to_f32_sub_ext_neg_mul2(float %x, half %y, half %z) {			define amdgpu_vs float @test_f16_to_f32_sub_ext_neg_mul2(float %x, half %y, half %z) {
	; GFX9-DENORM-LABEL: test_f16_to_f32_sub_ext_neg_mul2:			; GFX9-DENORM-LABEL: test_f16_to_f32_sub_ext_neg_mul2:
	; GFX9-DENORM: ; %bb.0: ; %entry			; GFX9-DENORM: ; %bb.0: ; %entry
	; GFX9-DENORM-NEXT: v_cvt_f32_f16_e32 v1, v1			; GFX9-DENORM-NEXT: v_mad_mix_f32 v0, -v1, -v2, v0 op_sel_hi:[1,1,0]
	; GFX9-DENORM-NEXT: v_cvt_f32_f16_e64 v2, -v2
	; GFX9-DENORM-NEXT: v_mad_f32 v0, -v1, v2, v0
	; GFX9-DENORM-NEXT: ; return to shader part epilog			; GFX9-DENORM-NEXT: ; return to shader part epilog
	;			;
	; GFX10-DENORM-LABEL: test_f16_to_f32_sub_ext_neg_mul2:			; GFX10-DENORM-LABEL: test_f16_to_f32_sub_ext_neg_mul2:
	; GFX10-DENORM: ; %bb.0: ; %entry			; GFX10-DENORM: ; %bb.0: ; %entry
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v1, v1			; GFX10-DENORM-NEXT: v_fma_mix_f32 v0, -v1, -v2, v0 op_sel_hi:[1,1,0]
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e64 v2, -v2
	; GFX10-DENORM-NEXT: v_fma_f32 v0, -v1, v2, v0
	; GFX10-DENORM-NEXT: ; return to shader part epilog			; GFX10-DENORM-NEXT: ; return to shader part epilog
	entry:			entry:
	%a = fmul fast half %y, %z			%a = fmul fast half %y, %z
	%b = fneg half %a			%b = fneg half %a
	%c = fpext half %b to float			%c = fpext half %b to float
	%d = fsub fast float %x, %c			%d = fsub fast float %x, %c
	ret float %d			ret float %d
	}			}

	; fold (fsub x, (fneg (fpext (fmul y, z)))) -> (fma (fpext y), (fpext z), x)			; fold (fsub x, (fneg (fpext (fmul y, z)))) -> (fma (fpext y), (fpext z), x)
	define amdgpu_vs float @test_f16_to_f32_sub_neg_ext_mul2(float %x, half %y, half %z) {			define amdgpu_vs float @test_f16_to_f32_sub_neg_ext_mul2(float %x, half %y, half %z) {
	; GFX9-DENORM-LABEL: test_f16_to_f32_sub_neg_ext_mul2:			; GFX9-DENORM-LABEL: test_f16_to_f32_sub_neg_ext_mul2:
	; GFX9-DENORM: ; %bb.0: ; %entry			; GFX9-DENORM: ; %bb.0: ; %entry
	; GFX9-DENORM-NEXT: v_cvt_f32_f16_e32 v1, v1			; GFX9-DENORM-NEXT: v_mad_mix_f32 v0, -v1, -v2, v0 op_sel_hi:[1,1,0]
	; GFX9-DENORM-NEXT: v_cvt_f32_f16_e64 v2, -v2
	; GFX9-DENORM-NEXT: v_mad_f32 v0, -v1, v2, v0
	; GFX9-DENORM-NEXT: ; return to shader part epilog			; GFX9-DENORM-NEXT: ; return to shader part epilog
	;			;
	; GFX10-DENORM-LABEL: test_f16_to_f32_sub_neg_ext_mul2:			; GFX10-DENORM-LABEL: test_f16_to_f32_sub_neg_ext_mul2:
	; GFX10-DENORM: ; %bb.0: ; %entry			; GFX10-DENORM: ; %bb.0: ; %entry
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v1, v1			; GFX10-DENORM-NEXT: v_fma_mix_f32 v0, -v1, -v2, v0 op_sel_hi:[1,1,0]
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e64 v2, -v2
	; GFX10-DENORM-NEXT: v_fma_f32 v0, -v1, v2, v0
	; GFX10-DENORM-NEXT: ; return to shader part epilog			; GFX10-DENORM-NEXT: ; return to shader part epilog
	entry:			entry:
	%a = fmul fast half %y, %z			%a = fmul fast half %y, %z
	%b = fpext half %a to float			%b = fpext half %a to float
	%c = fneg float %b			%c = fneg float %b
	%d = fsub fast float %x, %c			%d = fsub fast float %x, %c
	ret float %d			ret float %d
	}			}
	Show All 11 Lines
	; GFX9-DENORM-NEXT: v_sub_f32_e32 v0, v2, v4			; GFX9-DENORM-NEXT: v_sub_f32_e32 v0, v2, v4
	; GFX9-DENORM-NEXT: v_sub_f32_e32 v1, v3, v5			; GFX9-DENORM-NEXT: v_sub_f32_e32 v1, v3, v5
	; GFX9-DENORM-NEXT: v_sub_f32_e32 v2, v8, v6			; GFX9-DENORM-NEXT: v_sub_f32_e32 v2, v8, v6
	; GFX9-DENORM-NEXT: v_sub_f32_e32 v3, v9, v7			; GFX9-DENORM-NEXT: v_sub_f32_e32 v3, v9, v7
	; GFX9-DENORM-NEXT: ; return to shader part epilog			; GFX9-DENORM-NEXT: ; return to shader part epilog
	;			;
	; GFX10-DENORM-LABEL: test_v4f16_to_v4f32_sub_ext_neg_mul:			; GFX10-DENORM-LABEL: test_v4f16_to_v4f32_sub_ext_neg_mul:
	; GFX10-DENORM: ; %bb.0: ; %entry			; GFX10-DENORM: ; %bb.0: ; %entry
	; GFX10-DENORM-NEXT: v_xor_b32_e32 v2, 0x80008000, v2			; GFX10-DENORM-NEXT: v_xor_b32_e32 v8, 0x80008000, v2
	; GFX10-DENORM-NEXT: v_xor_b32_e32 v3, 0x80008000, v3			; GFX10-DENORM-NEXT: v_xor_b32_e32 v9, 0x80008000, v3
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v8, v0			; GFX10-DENORM-NEXT: v_fma_mix_f32 v5, v0, -v2, -v5 op_sel:[1,1,0] op_sel_hi:[1,1,0]
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_sdwa v9, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1			; GFX10-DENORM-NEXT: v_fma_mix_f32 v3, v1, -v3, -v7 op_sel:[1,1,0] op_sel_hi:[1,1,0]
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v10, v1			; GFX10-DENORM-NEXT: v_fma_mix_f32 v0, v0, v8, -v4 op_sel_hi:[1,1,0]
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_sdwa v11, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1			; GFX10-DENORM-NEXT: v_fma_mix_f32 v2, v1, v9, -v6 op_sel_hi:[1,1,0]
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v0, v2			; GFX10-DENORM-NEXT: v_mov_b32_e32 v1, v5
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_sdwa v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v2, v3
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_sdwa v3, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
	; GFX10-DENORM-NEXT: v_fma_f32 v0, v8, v0, -v4
	; GFX10-DENORM-NEXT: v_fma_f32 v1, v9, v1, -v5
	; GFX10-DENORM-NEXT: v_fma_f32 v2, v10, v2, -v6
	; GFX10-DENORM-NEXT: v_fma_f32 v3, v11, v3, -v7
	; GFX10-DENORM-NEXT: ; return to shader part epilog			; GFX10-DENORM-NEXT: ; return to shader part epilog
	entry:			entry:
	%a = fmul fast <4 x half> %x, %y			%a = fmul fast <4 x half> %x, %y
	%b = fneg <4 x half> %a			%b = fneg <4 x half> %a
	%c = fpext <4 x half> %b to <4 x float>			%c = fpext <4 x half> %b to <4 x float>
	%d = fsub fast <4 x float> %c, %z			%d = fsub fast <4 x float> %c, %z
	ret <4 x float> %d			ret <4 x float> %d
	}			}
	Show All 11 Lines
	; GFX9-DENORM-NEXT: v_sub_f32_e32 v0, v2, v4			; GFX9-DENORM-NEXT: v_sub_f32_e32 v0, v2, v4
	; GFX9-DENORM-NEXT: v_sub_f32_e32 v1, v3, v5			; GFX9-DENORM-NEXT: v_sub_f32_e32 v1, v3, v5
	; GFX9-DENORM-NEXT: v_sub_f32_e32 v2, v8, v6			; GFX9-DENORM-NEXT: v_sub_f32_e32 v2, v8, v6
	; GFX9-DENORM-NEXT: v_sub_f32_e32 v3, v9, v7			; GFX9-DENORM-NEXT: v_sub_f32_e32 v3, v9, v7
	; GFX9-DENORM-NEXT: ; return to shader part epilog			; GFX9-DENORM-NEXT: ; return to shader part epilog
	;			;
	; GFX10-DENORM-LABEL: test_v4f16_to_v4f32_sub_neg_ext_mul:			; GFX10-DENORM-LABEL: test_v4f16_to_v4f32_sub_neg_ext_mul:
	; GFX10-DENORM: ; %bb.0: ; %entry			; GFX10-DENORM: ; %bb.0: ; %entry
	; GFX10-DENORM-NEXT: v_xor_b32_e32 v2, 0x80008000, v2			; GFX10-DENORM-NEXT: v_xor_b32_e32 v8, 0x80008000, v2
	; GFX10-DENORM-NEXT: v_xor_b32_e32 v3, 0x80008000, v3			; GFX10-DENORM-NEXT: v_xor_b32_e32 v9, 0x80008000, v3
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v8, v0			; GFX10-DENORM-NEXT: v_fma_mix_f32 v5, v0, -v2, -v5 op_sel:[1,1,0] op_sel_hi:[1,1,0]
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_sdwa v9, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1			; GFX10-DENORM-NEXT: v_fma_mix_f32 v3, v1, -v3, -v7 op_sel:[1,1,0] op_sel_hi:[1,1,0]
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v10, v1			; GFX10-DENORM-NEXT: v_fma_mix_f32 v0, v0, v8, -v4 op_sel_hi:[1,1,0]
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_sdwa v11, v1 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1			; GFX10-DENORM-NEXT: v_fma_mix_f32 v2, v1, v9, -v6 op_sel_hi:[1,1,0]
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v0, v2			; GFX10-DENORM-NEXT: v_mov_b32_e32 v1, v5
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_sdwa v1, v2 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v2, v3
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_sdwa v3, v3 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
	; GFX10-DENORM-NEXT: v_fma_f32 v0, v8, v0, -v4
	; GFX10-DENORM-NEXT: v_fma_f32 v1, v9, v1, -v5
	; GFX10-DENORM-NEXT: v_fma_f32 v2, v10, v2, -v6
	; GFX10-DENORM-NEXT: v_fma_f32 v3, v11, v3, -v7
	; GFX10-DENORM-NEXT: ; return to shader part epilog			; GFX10-DENORM-NEXT: ; return to shader part epilog
	entry:			entry:
	%a = fmul fast <4 x half> %x, %y			%a = fmul fast <4 x half> %x, %y
	%b = fpext <4 x half> %a to <4 x float>			%b = fpext <4 x half> %a to <4 x float>
	%c = fneg <4 x float> %b			%c = fneg <4 x float> %b
	%d = fsub fast <4 x float> %c, %z			%d = fsub fast <4 x float> %c, %z
	ret <4 x float> %d			ret <4 x float> %d
	}			}
	Show All 12 Lines
	; GFX9-DENORM-NEXT: v_sub_f32_e32 v0, v0, v6			; GFX9-DENORM-NEXT: v_sub_f32_e32 v0, v0, v6
	; GFX9-DENORM-NEXT: v_sub_f32_e32 v1, v1, v4			; GFX9-DENORM-NEXT: v_sub_f32_e32 v1, v1, v4
	; GFX9-DENORM-NEXT: v_sub_f32_e32 v2, v2, v7			; GFX9-DENORM-NEXT: v_sub_f32_e32 v2, v2, v7
	; GFX9-DENORM-NEXT: v_sub_f32_e32 v3, v3, v5			; GFX9-DENORM-NEXT: v_sub_f32_e32 v3, v3, v5
	; GFX9-DENORM-NEXT: ; return to shader part epilog			; GFX9-DENORM-NEXT: ; return to shader part epilog
	;			;
	; GFX10-DENORM-LABEL: test_v4f16_to_v4f32_sub_ext_neg_mul2:			; GFX10-DENORM-LABEL: test_v4f16_to_v4f32_sub_ext_neg_mul2:
	; GFX10-DENORM: ; %bb.0: ; %entry			; GFX10-DENORM: ; %bb.0: ; %entry
	; GFX10-DENORM-NEXT: v_xor_b32_e32 v6, 0x80008000, v6			; GFX10-DENORM-NEXT: v_xor_b32_e32 v8, 0x80008000, v6
	; GFX10-DENORM-NEXT: v_xor_b32_e32 v7, 0x80008000, v7			; GFX10-DENORM-NEXT: v_xor_b32_e32 v9, 0x80008000, v7
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v8, v4			; GFX10-DENORM-NEXT: v_fma_mix_f32 v1, -v4, -v6, v1 op_sel:[1,1,0] op_sel_hi:[1,1,0]
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_sdwa v4, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1			; GFX10-DENORM-NEXT: v_fma_mix_f32 v3, -v5, -v7, v3 op_sel:[1,1,0] op_sel_hi:[1,1,0]
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v9, v5			; GFX10-DENORM-NEXT: v_fma_mix_f32 v0, -v4, v8, v0 op_sel_hi:[1,1,0]
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_sdwa v5, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1			; GFX10-DENORM-NEXT: v_fma_mix_f32 v2, -v5, v9, v2 op_sel_hi:[1,1,0]
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v10, v6
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_sdwa v6, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v11, v7
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_sdwa v7, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
	; GFX10-DENORM-NEXT: v_fma_f32 v0, -v8, v10, v0
	; GFX10-DENORM-NEXT: v_fma_f32 v1, -v4, v6, v1
	; GFX10-DENORM-NEXT: v_fma_f32 v2, -v9, v11, v2
	; GFX10-DENORM-NEXT: v_fma_f32 v3, -v5, v7, v3
	; GFX10-DENORM-NEXT: ; return to shader part epilog			; GFX10-DENORM-NEXT: ; return to shader part epilog
	entry:			entry:
	%a = fmul fast <4 x half> %y, %z			%a = fmul fast <4 x half> %y, %z
	%b = fneg <4 x half> %a			%b = fneg <4 x half> %a
	%c = fpext <4 x half> %b to <4 x float>			%c = fpext <4 x half> %b to <4 x float>
	%d = fsub fast <4 x float> %x, %c			%d = fsub fast <4 x float> %x, %c
	ret <4 x float> %d			ret <4 x float> %d
	}			}
	Show All 11 Lines
	; GFX9-DENORM-NEXT: v_sub_f32_e32 v0, v0, v6			; GFX9-DENORM-NEXT: v_sub_f32_e32 v0, v0, v6
	; GFX9-DENORM-NEXT: v_sub_f32_e32 v1, v1, v4			; GFX9-DENORM-NEXT: v_sub_f32_e32 v1, v1, v4
	; GFX9-DENORM-NEXT: v_sub_f32_e32 v2, v2, v7			; GFX9-DENORM-NEXT: v_sub_f32_e32 v2, v2, v7
	; GFX9-DENORM-NEXT: v_sub_f32_e32 v3, v3, v5			; GFX9-DENORM-NEXT: v_sub_f32_e32 v3, v3, v5
	; GFX9-DENORM-NEXT: ; return to shader part epilog			; GFX9-DENORM-NEXT: ; return to shader part epilog
	;			;
	; GFX10-DENORM-LABEL: test_v4f16_to_v4f32_sub_neg_ext_mul2:			; GFX10-DENORM-LABEL: test_v4f16_to_v4f32_sub_neg_ext_mul2:
	; GFX10-DENORM: ; %bb.0: ; %entry			; GFX10-DENORM: ; %bb.0: ; %entry
	; GFX10-DENORM-NEXT: v_xor_b32_e32 v6, 0x80008000, v6			; GFX10-DENORM-NEXT: v_xor_b32_e32 v8, 0x80008000, v6
	; GFX10-DENORM-NEXT: v_xor_b32_e32 v7, 0x80008000, v7			; GFX10-DENORM-NEXT: v_xor_b32_e32 v9, 0x80008000, v7
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v8, v4			; GFX10-DENORM-NEXT: v_fma_mix_f32 v1, -v4, -v6, v1 op_sel:[1,1,0] op_sel_hi:[1,1,0]
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_sdwa v4, v4 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1			; GFX10-DENORM-NEXT: v_fma_mix_f32 v3, -v5, -v7, v3 op_sel:[1,1,0] op_sel_hi:[1,1,0]
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v9, v5			; GFX10-DENORM-NEXT: v_fma_mix_f32 v0, -v4, v8, v0 op_sel_hi:[1,1,0]
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_sdwa v5, v5 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1			; GFX10-DENORM-NEXT: v_fma_mix_f32 v2, -v5, v9, v2 op_sel_hi:[1,1,0]
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v10, v6
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_sdwa v6, v6 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_e32 v11, v7
	; GFX10-DENORM-NEXT: v_cvt_f32_f16_sdwa v7, v7 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:WORD_1
	; GFX10-DENORM-NEXT: v_fma_f32 v0, -v8, v10, v0
	; GFX10-DENORM-NEXT: v_fma_f32 v1, -v4, v6, v1
	; GFX10-DENORM-NEXT: v_fma_f32 v2, -v9, v11, v2
	; GFX10-DENORM-NEXT: v_fma_f32 v3, -v5, v7, v3
	; GFX10-DENORM-NEXT: ; return to shader part epilog			; GFX10-DENORM-NEXT: ; return to shader part epilog
	entry:			entry:
	%a = fmul fast <4 x half> %y, %z			%a = fmul fast <4 x half> %y, %z
	%b = fpext <4 x half> %a to <4 x float>			%b = fpext <4 x half> %a to <4 x float>
	%c = fneg <4 x float> %b			%c = fneg <4 x float> %b
	%d = fsub fast <4 x float> %x, %c			%d = fsub fast <4 x float> %x, %c
	ret <4 x float> %d			ret <4 x float> %d
	}			}

llvm/test/CodeGen/AMDGPU/GlobalISel/fmed3.ll

	Show All 17 Lines
	; SI-NEXT: buffer_load_dword v2, v[0:1], s[8:11], 0 addr64 glc			; SI-NEXT: buffer_load_dword v2, v[0:1], s[8:11], 0 addr64 glc
	; SI-NEXT: s_waitcnt vmcnt(0)			; SI-NEXT: s_waitcnt vmcnt(0)
	; SI-NEXT: s_mov_b64 s[8:9], s[4:5]			; SI-NEXT: s_mov_b64 s[8:9], s[4:5]
	; SI-NEXT: buffer_load_dword v3, v[0:1], s[8:11], 0 addr64 glc			; SI-NEXT: buffer_load_dword v3, v[0:1], s[8:11], 0 addr64 glc
	; SI-NEXT: s_waitcnt vmcnt(0)			; SI-NEXT: s_waitcnt vmcnt(0)
	; SI-NEXT: s_mov_b64 s[8:9], s[6:7]			; SI-NEXT: s_mov_b64 s[8:9], s[6:7]
	; SI-NEXT: buffer_load_dword v4, v[0:1], s[8:11], 0 addr64 glc			; SI-NEXT: buffer_load_dword v4, v[0:1], s[8:11], 0 addr64 glc
	; SI-NEXT: s_waitcnt vmcnt(0)			; SI-NEXT: s_waitcnt vmcnt(0)
	; SI-NEXT: v_sub_f32_e32 v2, 0x80000000, v2			; SI-NEXT: v_med3_f32 v2, -v2, v3, v4
	; SI-NEXT: v_med3_f32 v2, v2, v3, v4
	; SI-NEXT: s_mov_b64 s[2:3], s[10:11]			; SI-NEXT: s_mov_b64 s[2:3], s[10:11]
	; SI-NEXT: buffer_store_dword v2, v[0:1], s[0:3], 0 addr64			; SI-NEXT: buffer_store_dword v2, v[0:1], s[0:3], 0 addr64
	; SI-NEXT: s_endpgm			; SI-NEXT: s_endpgm
	;			;
	; VI-LABEL: v_test_global_nnans_med3_f32_pat0_srcmod0:			; VI-LABEL: v_test_global_nnans_med3_f32_pat0_srcmod0:
	; VI: ; %bb.0:			; VI: ; %bb.0:
	; VI-NEXT: s_load_dwordx8 s[0:7], s[0:1], 0x24			; VI-NEXT: s_load_dwordx8 s[0:7], s[0:1], 0x24
	; VI-NEXT: v_lshlrev_b32_e32 v6, 2, v0			; VI-NEXT: v_lshlrev_b32_e32 v6, 2, v0
	Show All 15 Lines
	; VI-NEXT: flat_load_dword v2, v[2:3] glc			; VI-NEXT: flat_load_dword v2, v[2:3] glc
	; VI-NEXT: s_waitcnt vmcnt(0)			; VI-NEXT: s_waitcnt vmcnt(0)
	; VI-NEXT: flat_load_dword v3, v[4:5] glc			; VI-NEXT: flat_load_dword v3, v[4:5] glc
	; VI-NEXT: s_waitcnt vmcnt(0)			; VI-NEXT: s_waitcnt vmcnt(0)
	; VI-NEXT: v_mov_b32_e32 v0, s0			; VI-NEXT: v_mov_b32_e32 v0, s0
	; VI-NEXT: v_mov_b32_e32 v1, s1			; VI-NEXT: v_mov_b32_e32 v1, s1
	; VI-NEXT: v_add_u32_e32 v0, vcc, v0, v6			; VI-NEXT: v_add_u32_e32 v0, vcc, v0, v6
	; VI-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc			; VI-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc
	; VI-NEXT: v_sub_f32_e32 v4, 0x80000000, v7			; VI-NEXT: v_med3_f32 v2, -v7, v2, v3
	; VI-NEXT: v_med3_f32 v2, v4, v2, v3
	; VI-NEXT: flat_store_dword v[0:1], v2			; VI-NEXT: flat_store_dword v[0:1], v2
	; VI-NEXT: s_endpgm			; VI-NEXT: s_endpgm
	;			;
	; GFX9-LABEL: v_test_global_nnans_med3_f32_pat0_srcmod0:			; GFX9-LABEL: v_test_global_nnans_med3_f32_pat0_srcmod0:
	; GFX9: ; %bb.0:			; GFX9: ; %bb.0:
	; GFX9-NEXT: s_load_dwordx8 s[0:7], s[0:1], 0x24			; GFX9-NEXT: s_load_dwordx8 s[0:7], s[0:1], 0x24
	; GFX9-NEXT: v_lshlrev_b32_e32 v0, 2, v0			; GFX9-NEXT: v_lshlrev_b32_e32 v0, 2, v0
	; GFX9-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-NEXT: global_load_dword v1, v0, s[2:3] glc			; GFX9-NEXT: global_load_dword v1, v0, s[2:3] glc
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: global_load_dword v2, v0, s[4:5] glc			; GFX9-NEXT: global_load_dword v2, v0, s[4:5] glc
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: global_load_dword v3, v0, s[6:7] glc			; GFX9-NEXT: global_load_dword v3, v0, s[6:7] glc
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: v_sub_f32_e32 v1, 0x80000000, v1			; GFX9-NEXT: v_med3_f32 v1, -v1, v2, v3
	; GFX9-NEXT: v_med3_f32 v1, v1, v2, v3
	; GFX9-NEXT: global_store_dword v0, v1, s[0:1]			; GFX9-NEXT: global_store_dword v0, v1, s[0:1]
	; GFX9-NEXT: s_endpgm			; GFX9-NEXT: s_endpgm
	;			;
	; GFX10-LABEL: v_test_global_nnans_med3_f32_pat0_srcmod0:			; GFX10-LABEL: v_test_global_nnans_med3_f32_pat0_srcmod0:
	; GFX10: ; %bb.0:			; GFX10: ; %bb.0:
	; GFX10-NEXT: s_load_dwordx8 s[0:7], s[0:1], 0x24			; GFX10-NEXT: s_load_dwordx8 s[0:7], s[0:1], 0x24
	; GFX10-NEXT: v_lshlrev_b32_e32 v0, 2, v0			; GFX10-NEXT: v_lshlrev_b32_e32 v0, 2, v0
	; GFX10-NEXT: s_waitcnt lgkmcnt(0)			; GFX10-NEXT: s_waitcnt lgkmcnt(0)
	; GFX10-NEXT: global_load_dword v1, v0, s[2:3] glc dlc			; GFX10-NEXT: global_load_dword v1, v0, s[2:3] glc dlc
	; GFX10-NEXT: s_waitcnt vmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0)
	; GFX10-NEXT: global_load_dword v2, v0, s[4:5] glc dlc			; GFX10-NEXT: global_load_dword v2, v0, s[4:5] glc dlc
	; GFX10-NEXT: s_waitcnt vmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0)
	; GFX10-NEXT: global_load_dword v3, v0, s[6:7] glc dlc			; GFX10-NEXT: global_load_dword v3, v0, s[6:7] glc dlc
	; GFX10-NEXT: s_waitcnt vmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0)
	; GFX10-NEXT: v_sub_f32_e32 v1, 0x80000000, v1			; GFX10-NEXT: v_med3_f32 v1, -v1, v2, v3
	; GFX10-NEXT: v_med3_f32 v1, v1, v2, v3
	; GFX10-NEXT: global_store_dword v0, v1, s[0:1]			; GFX10-NEXT: global_store_dword v0, v1, s[0:1]
	; GFX10-NEXT: s_endpgm			; GFX10-NEXT: s_endpgm
	;			;
	; GFX11-LABEL: v_test_global_nnans_med3_f32_pat0_srcmod0:			; GFX11-LABEL: v_test_global_nnans_med3_f32_pat0_srcmod0:
	; GFX11: ; %bb.0:			; GFX11: ; %bb.0:
	; GFX11-NEXT: s_load_b256 s[0:7], s[0:1], 0x24			; GFX11-NEXT: s_load_b256 s[0:7], s[0:1], 0x24
	; GFX11-NEXT: v_lshlrev_b32_e32 v0, 2, v0			; GFX11-NEXT: v_lshlrev_b32_e32 v0, 2, v0
	; GFX11-NEXT: s_waitcnt lgkmcnt(0)			; GFX11-NEXT: s_waitcnt lgkmcnt(0)
	; GFX11-NEXT: global_load_b32 v1, v0, s[2:3] glc dlc			; GFX11-NEXT: global_load_b32 v1, v0, s[2:3] glc dlc
	; GFX11-NEXT: s_waitcnt vmcnt(0)			; GFX11-NEXT: s_waitcnt vmcnt(0)
	; GFX11-NEXT: global_load_b32 v2, v0, s[4:5] glc dlc			; GFX11-NEXT: global_load_b32 v2, v0, s[4:5] glc dlc
	; GFX11-NEXT: s_waitcnt vmcnt(0)			; GFX11-NEXT: s_waitcnt vmcnt(0)
	; GFX11-NEXT: global_load_b32 v3, v0, s[6:7] glc dlc			; GFX11-NEXT: global_load_b32 v3, v0, s[6:7] glc dlc
	; GFX11-NEXT: s_waitcnt vmcnt(0)			; GFX11-NEXT: s_waitcnt vmcnt(0)
	; GFX11-NEXT: v_sub_f32_e32 v1, 0x80000000, v1			; GFX11-NEXT: v_med3_f32 v1, -v1, v2, v3
	; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1)
	; GFX11-NEXT: v_med3_f32 v1, v1, v2, v3
	; GFX11-NEXT: global_store_b32 v0, v1, s[0:1]			; GFX11-NEXT: global_store_b32 v0, v1, s[0:1]
	; GFX11-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)			; GFX11-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
	; GFX11-NEXT: s_endpgm			; GFX11-NEXT: s_endpgm
	%tid = call i32 @llvm.amdgcn.workitem.id.x()			%tid = call i32 @llvm.amdgcn.workitem.id.x()
	%gep0 = getelementptr float, float addrspace(1)* %aptr, i32 %tid			%gep0 = getelementptr float, float addrspace(1)* %aptr, i32 %tid
	%gep1 = getelementptr float, float addrspace(1)* %bptr, i32 %tid			%gep1 = getelementptr float, float addrspace(1)* %bptr, i32 %tid
	%gep2 = getelementptr float, float addrspace(1)* %cptr, i32 %tid			%gep2 = getelementptr float, float addrspace(1)* %cptr, i32 %tid
	%outgep = getelementptr float, float addrspace(1)* %out, i32 %tid			%outgep = getelementptr float, float addrspace(1)* %out, i32 %tid
	Show All 22 Lines
	; SI-NEXT: buffer_load_dword v2, v[0:1], s[8:11], 0 addr64 glc			; SI-NEXT: buffer_load_dword v2, v[0:1], s[8:11], 0 addr64 glc
	; SI-NEXT: s_waitcnt vmcnt(0)			; SI-NEXT: s_waitcnt vmcnt(0)
	; SI-NEXT: s_mov_b64 s[8:9], s[4:5]			; SI-NEXT: s_mov_b64 s[8:9], s[4:5]
	; SI-NEXT: buffer_load_dword v3, v[0:1], s[8:11], 0 addr64 glc			; SI-NEXT: buffer_load_dword v3, v[0:1], s[8:11], 0 addr64 glc
	; SI-NEXT: s_waitcnt vmcnt(0)			; SI-NEXT: s_waitcnt vmcnt(0)
	; SI-NEXT: s_mov_b64 s[8:9], s[6:7]			; SI-NEXT: s_mov_b64 s[8:9], s[6:7]
	; SI-NEXT: buffer_load_dword v4, v[0:1], s[8:11], 0 addr64 glc			; SI-NEXT: buffer_load_dword v4, v[0:1], s[8:11], 0 addr64 glc
	; SI-NEXT: s_waitcnt vmcnt(0)			; SI-NEXT: s_waitcnt vmcnt(0)
	; SI-NEXT: v_sub_f32_e32 v2, 0x80000000, v2			; SI-NEXT: v_mul_f32_e64 v2, 1.0, -v2
	; SI-NEXT: v_mul_f32_e32 v2, 1.0, v2
	; SI-NEXT: v_mul_f32_e32 v3, 1.0, v3			; SI-NEXT: v_mul_f32_e32 v3, 1.0, v3
	; SI-NEXT: v_min_f32_e32 v5, v2, v3			; SI-NEXT: v_min_f32_e32 v5, v2, v3
	; SI-NEXT: v_max_f32_e32 v2, v2, v3			; SI-NEXT: v_max_f32_e32 v2, v2, v3
	; SI-NEXT: v_mul_f32_e32 v3, 1.0, v4			; SI-NEXT: v_mul_f32_e32 v3, 1.0, v4
	; SI-NEXT: v_min_f32_e32 v2, v2, v3			; SI-NEXT: v_min_f32_e32 v2, v2, v3
	; SI-NEXT: v_max_f32_e32 v2, v5, v2			; SI-NEXT: v_max_f32_e32 v2, v5, v2
	; SI-NEXT: s_mov_b64 s[2:3], s[10:11]			; SI-NEXT: s_mov_b64 s[2:3], s[10:11]
	; SI-NEXT: buffer_store_dword v2, v[0:1], s[0:3], 0 addr64			; SI-NEXT: buffer_store_dword v2, v[0:1], s[0:3], 0 addr64
	Show All 21 Lines
	; VI-NEXT: v_add_u32_e32 v0, vcc, v4, v6			; VI-NEXT: v_add_u32_e32 v0, vcc, v4, v6
	; VI-NEXT: v_addc_u32_e32 v1, vcc, 0, v5, vcc			; VI-NEXT: v_addc_u32_e32 v1, vcc, 0, v5, vcc
	; VI-NEXT: flat_load_dword v3, v[0:1] glc			; VI-NEXT: flat_load_dword v3, v[0:1] glc
	; VI-NEXT: s_waitcnt vmcnt(0)			; VI-NEXT: s_waitcnt vmcnt(0)
	; VI-NEXT: v_mov_b32_e32 v0, s0			; VI-NEXT: v_mov_b32_e32 v0, s0
	; VI-NEXT: v_mov_b32_e32 v1, s1			; VI-NEXT: v_mov_b32_e32 v1, s1
	; VI-NEXT: v_add_u32_e32 v0, vcc, v0, v6			; VI-NEXT: v_add_u32_e32 v0, vcc, v0, v6
	; VI-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc			; VI-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc
	; VI-NEXT: v_sub_f32_e32 v4, 0x80000000, v7			; VI-NEXT: v_mul_f32_e64 v4, 1.0, -v7
	; VI-NEXT: v_mul_f32_e32 v2, 1.0, v2			; VI-NEXT: v_mul_f32_e32 v2, 1.0, v2
	; VI-NEXT: v_mul_f32_e32 v4, 1.0, v4
	; VI-NEXT: v_min_f32_e32 v5, v4, v2			; VI-NEXT: v_min_f32_e32 v5, v4, v2
	; VI-NEXT: v_max_f32_e32 v2, v4, v2			; VI-NEXT: v_max_f32_e32 v2, v4, v2
	; VI-NEXT: v_mul_f32_e32 v3, 1.0, v3			; VI-NEXT: v_mul_f32_e32 v3, 1.0, v3
	; VI-NEXT: v_min_f32_e32 v2, v2, v3			; VI-NEXT: v_min_f32_e32 v2, v2, v3
	; VI-NEXT: v_max_f32_e32 v2, v5, v2			; VI-NEXT: v_max_f32_e32 v2, v5, v2
	; VI-NEXT: flat_store_dword v[0:1], v2			; VI-NEXT: flat_store_dword v[0:1], v2
	; VI-NEXT: s_endpgm			; VI-NEXT: s_endpgm
	;			;
	; GFX9-LABEL: v_test_no_global_nnans_med3_f32_pat0_srcmod0:			; GFX9-LABEL: v_test_no_global_nnans_med3_f32_pat0_srcmod0:
	; GFX9: ; %bb.0:			; GFX9: ; %bb.0:
	; GFX9-NEXT: s_load_dwordx8 s[0:7], s[0:1], 0x24			; GFX9-NEXT: s_load_dwordx8 s[0:7], s[0:1], 0x24
	; GFX9-NEXT: v_lshlrev_b32_e32 v0, 2, v0			; GFX9-NEXT: v_lshlrev_b32_e32 v0, 2, v0
	; GFX9-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-NEXT: global_load_dword v1, v0, s[2:3] glc			; GFX9-NEXT: global_load_dword v1, v0, s[2:3] glc
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: global_load_dword v2, v0, s[4:5] glc			; GFX9-NEXT: global_load_dword v2, v0, s[4:5] glc
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: global_load_dword v3, v0, s[6:7] glc			; GFX9-NEXT: global_load_dword v3, v0, s[6:7] glc
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: v_sub_f32_e32 v1, 0x80000000, v1			; GFX9-NEXT: v_max_f32_e64 v1, -v1, -v1
	; GFX9-NEXT: v_max_f32_e32 v2, v2, v2			; GFX9-NEXT: v_max_f32_e32 v2, v2, v2
	; GFX9-NEXT: v_max_f32_e32 v1, v1, v1
	; GFX9-NEXT: v_min_f32_e32 v4, v1, v2			; GFX9-NEXT: v_min_f32_e32 v4, v1, v2
	; GFX9-NEXT: v_max_f32_e32 v1, v1, v2			; GFX9-NEXT: v_max_f32_e32 v1, v1, v2
	; GFX9-NEXT: v_max_f32_e32 v2, v3, v3			; GFX9-NEXT: v_max_f32_e32 v2, v3, v3
	; GFX9-NEXT: v_min_f32_e32 v1, v1, v2			; GFX9-NEXT: v_min_f32_e32 v1, v1, v2
	; GFX9-NEXT: v_max_f32_e32 v1, v4, v1			; GFX9-NEXT: v_max_f32_e32 v1, v4, v1
	; GFX9-NEXT: global_store_dword v0, v1, s[0:1]			; GFX9-NEXT: global_store_dword v0, v1, s[0:1]
	; GFX9-NEXT: s_endpgm			; GFX9-NEXT: s_endpgm
	;			;
	; GFX10-LABEL: v_test_no_global_nnans_med3_f32_pat0_srcmod0:			; GFX10-LABEL: v_test_no_global_nnans_med3_f32_pat0_srcmod0:
	; GFX10: ; %bb.0:			; GFX10: ; %bb.0:
	; GFX10-NEXT: s_load_dwordx8 s[0:7], s[0:1], 0x24			; GFX10-NEXT: s_load_dwordx8 s[0:7], s[0:1], 0x24
	; GFX10-NEXT: v_lshlrev_b32_e32 v0, 2, v0			; GFX10-NEXT: v_lshlrev_b32_e32 v0, 2, v0
	; GFX10-NEXT: s_waitcnt lgkmcnt(0)			; GFX10-NEXT: s_waitcnt lgkmcnt(0)
	; GFX10-NEXT: global_load_dword v1, v0, s[2:3] glc dlc			; GFX10-NEXT: global_load_dword v1, v0, s[2:3] glc dlc
	; GFX10-NEXT: s_waitcnt vmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0)
	; GFX10-NEXT: global_load_dword v2, v0, s[4:5] glc dlc			; GFX10-NEXT: global_load_dword v2, v0, s[4:5] glc dlc
	; GFX10-NEXT: s_waitcnt vmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0)
	; GFX10-NEXT: global_load_dword v3, v0, s[6:7] glc dlc			; GFX10-NEXT: global_load_dword v3, v0, s[6:7] glc dlc
	; GFX10-NEXT: s_waitcnt vmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0)
	; GFX10-NEXT: v_sub_f32_e32 v1, 0x80000000, v1			; GFX10-NEXT: v_max_f32_e64 v1, -v1, -v1
	; GFX10-NEXT: v_max_f32_e32 v2, v2, v2			; GFX10-NEXT: v_max_f32_e32 v2, v2, v2
	; GFX10-NEXT: v_max_f32_e32 v3, v3, v3			; GFX10-NEXT: v_max_f32_e32 v3, v3, v3
	; GFX10-NEXT: v_max_f32_e32 v1, v1, v1
	; GFX10-NEXT: v_max_f32_e32 v4, v1, v2			; GFX10-NEXT: v_max_f32_e32 v4, v1, v2
	; GFX10-NEXT: v_min_f32_e32 v1, v1, v2			; GFX10-NEXT: v_min_f32_e32 v1, v1, v2
	; GFX10-NEXT: v_min_f32_e32 v2, v4, v3			; GFX10-NEXT: v_min_f32_e32 v2, v4, v3
	; GFX10-NEXT: v_max_f32_e32 v1, v1, v2			; GFX10-NEXT: v_max_f32_e32 v1, v1, v2
	; GFX10-NEXT: global_store_dword v0, v1, s[0:1]			; GFX10-NEXT: global_store_dword v0, v1, s[0:1]
	; GFX10-NEXT: s_endpgm			; GFX10-NEXT: s_endpgm
	;			;
	; GFX11-LABEL: v_test_no_global_nnans_med3_f32_pat0_srcmod0:			; GFX11-LABEL: v_test_no_global_nnans_med3_f32_pat0_srcmod0:
	; GFX11: ; %bb.0:			; GFX11: ; %bb.0:
	; GFX11-NEXT: s_load_b256 s[0:7], s[0:1], 0x24			; GFX11-NEXT: s_load_b256 s[0:7], s[0:1], 0x24
	; GFX11-NEXT: v_lshlrev_b32_e32 v0, 2, v0			; GFX11-NEXT: v_lshlrev_b32_e32 v0, 2, v0
	; GFX11-NEXT: s_waitcnt lgkmcnt(0)			; GFX11-NEXT: s_waitcnt lgkmcnt(0)
	; GFX11-NEXT: global_load_b32 v1, v0, s[2:3] glc dlc			; GFX11-NEXT: global_load_b32 v1, v0, s[2:3] glc dlc
	; GFX11-NEXT: s_waitcnt vmcnt(0)			; GFX11-NEXT: s_waitcnt vmcnt(0)
	; GFX11-NEXT: global_load_b32 v2, v0, s[4:5] glc dlc			; GFX11-NEXT: global_load_b32 v2, v0, s[4:5] glc dlc
	; GFX11-NEXT: s_waitcnt vmcnt(0)			; GFX11-NEXT: s_waitcnt vmcnt(0)
	; GFX11-NEXT: global_load_b32 v3, v0, s[6:7] glc dlc			; GFX11-NEXT: global_load_b32 v3, v0, s[6:7] glc dlc
	; GFX11-NEXT: s_waitcnt vmcnt(0)			; GFX11-NEXT: s_waitcnt vmcnt(0)
	; GFX11-NEXT: v_dual_sub_f32 v1, 0x80000000, v1 :: v_dual_max_f32 v2, v2, v2			; GFX11-NEXT: v_max_f32_e64 v1, -v1, -v1
	; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1) \| instskip(NEXT) \| instid1(VALU_DEP_1)			; GFX11-NEXT: v_max_f32_e32 v2, v2, v2
	; GFX11-NEXT: v_max_f32_e32 v1, v1, v1			; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1) \| instskip(SKIP_1) \| instid1(VALU_DEP_1)
	; GFX11-NEXT: v_min_f32_e32 v4, v1, v2			; GFX11-NEXT: v_min_f32_e32 v4, v1, v2
	; GFX11-NEXT: v_dual_max_f32 v1, v1, v2 :: v_dual_max_f32 v2, v3, v3			; GFX11-NEXT: v_dual_max_f32 v1, v1, v2 :: v_dual_max_f32 v2, v3, v3
	; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1)
	; GFX11-NEXT: v_minmax_f32 v1, v1, v2, v4			; GFX11-NEXT: v_minmax_f32 v1, v1, v2, v4
	; GFX11-NEXT: global_store_b32 v0, v1, s[0:1]			; GFX11-NEXT: global_store_b32 v0, v1, s[0:1]
	; GFX11-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)			; GFX11-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
	; GFX11-NEXT: s_endpgm			; GFX11-NEXT: s_endpgm
	%tid = call i32 @llvm.amdgcn.workitem.id.x()			%tid = call i32 @llvm.amdgcn.workitem.id.x()
	%gep0 = getelementptr float, float addrspace(1)* %aptr, i32 %tid			%gep0 = getelementptr float, float addrspace(1)* %aptr, i32 %tid
	%gep1 = getelementptr float, float addrspace(1)* %bptr, i32 %tid			%gep1 = getelementptr float, float addrspace(1)* %bptr, i32 %tid
	%gep2 = getelementptr float, float addrspace(1)* %cptr, i32 %tid			%gep2 = getelementptr float, float addrspace(1)* %cptr, i32 %tid
	Show All 23 Lines
	; SI-NEXT: buffer_load_dword v2, v[0:1], s[8:11], 0 addr64 glc			; SI-NEXT: buffer_load_dword v2, v[0:1], s[8:11], 0 addr64 glc
	; SI-NEXT: s_waitcnt vmcnt(0)			; SI-NEXT: s_waitcnt vmcnt(0)
	; SI-NEXT: s_mov_b64 s[8:9], s[4:5]			; SI-NEXT: s_mov_b64 s[8:9], s[4:5]
	; SI-NEXT: buffer_load_dword v3, v[0:1], s[8:11], 0 addr64 glc			; SI-NEXT: buffer_load_dword v3, v[0:1], s[8:11], 0 addr64 glc
	; SI-NEXT: s_waitcnt vmcnt(0)			; SI-NEXT: s_waitcnt vmcnt(0)
	; SI-NEXT: s_mov_b64 s[8:9], s[6:7]			; SI-NEXT: s_mov_b64 s[8:9], s[6:7]
	; SI-NEXT: buffer_load_dword v4, v[0:1], s[8:11], 0 addr64 glc			; SI-NEXT: buffer_load_dword v4, v[0:1], s[8:11], 0 addr64 glc
	; SI-NEXT: s_waitcnt vmcnt(0)			; SI-NEXT: s_waitcnt vmcnt(0)
	; SI-NEXT: s_mov_b32 s2, 0x80000000			; SI-NEXT: v_med3_f32 v2, -v2, \|v3\|, -\|v4\|
	; SI-NEXT: v_sub_f32_e32 v2, 0x80000000, v2
	; SI-NEXT: v_sub_f32_e64 v4, s2, \|v4\|
	; SI-NEXT: v_med3_f32 v2, v2, \|v3\|, v4
	; SI-NEXT: s_mov_b64 s[2:3], s[10:11]			; SI-NEXT: s_mov_b64 s[2:3], s[10:11]
	; SI-NEXT: buffer_store_dword v2, v[0:1], s[0:3], 0 addr64			; SI-NEXT: buffer_store_dword v2, v[0:1], s[0:3], 0 addr64
	; SI-NEXT: s_endpgm			; SI-NEXT: s_endpgm
	;			;
	; VI-LABEL: v_test_global_nnans_med3_f32_pat0_srcmod012:			; VI-LABEL: v_test_global_nnans_med3_f32_pat0_srcmod012:
	; VI: ; %bb.0:			; VI: ; %bb.0:
	; VI-NEXT: s_load_dwordx8 s[0:7], s[0:1], 0x24			; VI-NEXT: s_load_dwordx8 s[0:7], s[0:1], 0x24
	; VI-NEXT: v_lshlrev_b32_e32 v6, 2, v0			; VI-NEXT: v_lshlrev_b32_e32 v6, 2, v0
	Show All 11 Lines
	; VI-NEXT: v_add_u32_e32 v4, vcc, v4, v6			; VI-NEXT: v_add_u32_e32 v4, vcc, v4, v6
	; VI-NEXT: v_addc_u32_e32 v5, vcc, 0, v5, vcc			; VI-NEXT: v_addc_u32_e32 v5, vcc, 0, v5, vcc
	; VI-NEXT: flat_load_dword v7, v[0:1] glc			; VI-NEXT: flat_load_dword v7, v[0:1] glc
	; VI-NEXT: s_waitcnt vmcnt(0)			; VI-NEXT: s_waitcnt vmcnt(0)
	; VI-NEXT: flat_load_dword v2, v[2:3] glc			; VI-NEXT: flat_load_dword v2, v[2:3] glc
	; VI-NEXT: s_waitcnt vmcnt(0)			; VI-NEXT: s_waitcnt vmcnt(0)
	; VI-NEXT: flat_load_dword v3, v[4:5] glc			; VI-NEXT: flat_load_dword v3, v[4:5] glc
	; VI-NEXT: s_waitcnt vmcnt(0)			; VI-NEXT: s_waitcnt vmcnt(0)
	; VI-NEXT: s_mov_b32 s2, 0x80000000
	; VI-NEXT: v_mov_b32_e32 v0, s0			; VI-NEXT: v_mov_b32_e32 v0, s0
	; VI-NEXT: v_mov_b32_e32 v1, s1			; VI-NEXT: v_mov_b32_e32 v1, s1
	; VI-NEXT: v_add_u32_e32 v0, vcc, v0, v6			; VI-NEXT: v_add_u32_e32 v0, vcc, v0, v6
	; VI-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc			; VI-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc
	; VI-NEXT: v_sub_f32_e32 v4, 0x80000000, v7			; VI-NEXT: v_med3_f32 v2, -v7, \|v2\|, -\|v3\|
	; VI-NEXT: v_sub_f32_e64 v3, s2, \|v3\|
	; VI-NEXT: v_med3_f32 v2, v4, \|v2\|, v3
	; VI-NEXT: flat_store_dword v[0:1], v2			; VI-NEXT: flat_store_dword v[0:1], v2
	; VI-NEXT: s_endpgm			; VI-NEXT: s_endpgm
	;			;
	; GFX9-LABEL: v_test_global_nnans_med3_f32_pat0_srcmod012:			; GFX9-LABEL: v_test_global_nnans_med3_f32_pat0_srcmod012:
	; GFX9: ; %bb.0:			; GFX9: ; %bb.0:
	; GFX9-NEXT: s_load_dwordx8 s[0:7], s[0:1], 0x24			; GFX9-NEXT: s_load_dwordx8 s[0:7], s[0:1], 0x24
	; GFX9-NEXT: v_lshlrev_b32_e32 v0, 2, v0			; GFX9-NEXT: v_lshlrev_b32_e32 v0, 2, v0
	; GFX9-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-NEXT: global_load_dword v1, v0, s[2:3] glc			; GFX9-NEXT: global_load_dword v1, v0, s[2:3] glc
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: global_load_dword v2, v0, s[4:5] glc			; GFX9-NEXT: global_load_dword v2, v0, s[4:5] glc
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: global_load_dword v3, v0, s[6:7] glc			; GFX9-NEXT: global_load_dword v3, v0, s[6:7] glc
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: s_mov_b32 s2, 0x80000000			; GFX9-NEXT: v_med3_f32 v1, -v1, \|v2\|, -\|v3\|
	; GFX9-NEXT: v_sub_f32_e32 v1, 0x80000000, v1
	; GFX9-NEXT: v_sub_f32_e64 v3, s2, \|v3\|
	; GFX9-NEXT: v_med3_f32 v1, v1, \|v2\|, v3
	; GFX9-NEXT: global_store_dword v0, v1, s[0:1]			; GFX9-NEXT: global_store_dword v0, v1, s[0:1]
	; GFX9-NEXT: s_endpgm			; GFX9-NEXT: s_endpgm
	;			;
	; GFX10-LABEL: v_test_global_nnans_med3_f32_pat0_srcmod012:			; GFX10-LABEL: v_test_global_nnans_med3_f32_pat0_srcmod012:
	; GFX10: ; %bb.0:			; GFX10: ; %bb.0:
	; GFX10-NEXT: s_load_dwordx8 s[0:7], s[0:1], 0x24			; GFX10-NEXT: s_load_dwordx8 s[0:7], s[0:1], 0x24
	; GFX10-NEXT: v_lshlrev_b32_e32 v0, 2, v0			; GFX10-NEXT: v_lshlrev_b32_e32 v0, 2, v0
	; GFX10-NEXT: s_waitcnt lgkmcnt(0)			; GFX10-NEXT: s_waitcnt lgkmcnt(0)
	; GFX10-NEXT: global_load_dword v1, v0, s[2:3] glc dlc			; GFX10-NEXT: global_load_dword v1, v0, s[2:3] glc dlc
	; GFX10-NEXT: s_waitcnt vmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0)
	; GFX10-NEXT: global_load_dword v2, v0, s[4:5] glc dlc			; GFX10-NEXT: global_load_dword v2, v0, s[4:5] glc dlc
	; GFX10-NEXT: s_waitcnt vmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0)
	; GFX10-NEXT: global_load_dword v3, v0, s[6:7] glc dlc			; GFX10-NEXT: global_load_dword v3, v0, s[6:7] glc dlc
	; GFX10-NEXT: s_waitcnt vmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0)
	; GFX10-NEXT: v_sub_f32_e32 v1, 0x80000000, v1			; GFX10-NEXT: v_med3_f32 v1, -v1, \|v2\|, -\|v3\|
	; GFX10-NEXT: v_sub_f32_e64 v3, 0x80000000, \|v3\|
	; GFX10-NEXT: v_med3_f32 v1, v1, \|v2\|, v3
	; GFX10-NEXT: global_store_dword v0, v1, s[0:1]			; GFX10-NEXT: global_store_dword v0, v1, s[0:1]
	; GFX10-NEXT: s_endpgm			; GFX10-NEXT: s_endpgm
	;			;
	; GFX11-LABEL: v_test_global_nnans_med3_f32_pat0_srcmod012:			; GFX11-LABEL: v_test_global_nnans_med3_f32_pat0_srcmod012:
	; GFX11: ; %bb.0:			; GFX11: ; %bb.0:
	; GFX11-NEXT: s_load_b256 s[0:7], s[0:1], 0x24			; GFX11-NEXT: s_load_b256 s[0:7], s[0:1], 0x24
	; GFX11-NEXT: v_lshlrev_b32_e32 v0, 2, v0			; GFX11-NEXT: v_lshlrev_b32_e32 v0, 2, v0
	; GFX11-NEXT: s_waitcnt lgkmcnt(0)			; GFX11-NEXT: s_waitcnt lgkmcnt(0)
	; GFX11-NEXT: global_load_b32 v1, v0, s[2:3] glc dlc			; GFX11-NEXT: global_load_b32 v1, v0, s[2:3] glc dlc
	; GFX11-NEXT: s_waitcnt vmcnt(0)			; GFX11-NEXT: s_waitcnt vmcnt(0)
	; GFX11-NEXT: global_load_b32 v2, v0, s[4:5] glc dlc			; GFX11-NEXT: global_load_b32 v2, v0, s[4:5] glc dlc
	; GFX11-NEXT: s_waitcnt vmcnt(0)			; GFX11-NEXT: s_waitcnt vmcnt(0)
	; GFX11-NEXT: global_load_b32 v3, v0, s[6:7] glc dlc			; GFX11-NEXT: global_load_b32 v3, v0, s[6:7] glc dlc
	; GFX11-NEXT: s_waitcnt vmcnt(0)			; GFX11-NEXT: s_waitcnt vmcnt(0)
	; GFX11-NEXT: v_sub_f32_e32 v1, 0x80000000, v1			; GFX11-NEXT: v_med3_f32 v1, -v1, \|v2\|, -\|v3\|
	; GFX11-NEXT: v_sub_f32_e64 v3, 0x80000000, \|v3\|
	; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1)
	; GFX11-NEXT: v_med3_f32 v1, v1, \|v2\|, v3
	; GFX11-NEXT: global_store_b32 v0, v1, s[0:1]			; GFX11-NEXT: global_store_b32 v0, v1, s[0:1]
	; GFX11-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)			; GFX11-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
	; GFX11-NEXT: s_endpgm			; GFX11-NEXT: s_endpgm
	%tid = call i32 @llvm.amdgcn.workitem.id.x()			%tid = call i32 @llvm.amdgcn.workitem.id.x()
	%gep0 = getelementptr float, float addrspace(1)* %aptr, i32 %tid			%gep0 = getelementptr float, float addrspace(1)* %aptr, i32 %tid
	%gep1 = getelementptr float, float addrspace(1)* %bptr, i32 %tid			%gep1 = getelementptr float, float addrspace(1)* %bptr, i32 %tid
	%gep2 = getelementptr float, float addrspace(1)* %cptr, i32 %tid			%gep2 = getelementptr float, float addrspace(1)* %cptr, i32 %tid
	%outgep = getelementptr float, float addrspace(1)* %out, i32 %tid			%outgep = getelementptr float, float addrspace(1)* %out, i32 %tid
	Show All 28 Lines
	; SI-NEXT: buffer_load_dword v2, v[0:1], s[8:11], 0 addr64 glc			; SI-NEXT: buffer_load_dword v2, v[0:1], s[8:11], 0 addr64 glc
	; SI-NEXT: s_waitcnt vmcnt(0)			; SI-NEXT: s_waitcnt vmcnt(0)
	; SI-NEXT: s_mov_b64 s[8:9], s[4:5]			; SI-NEXT: s_mov_b64 s[8:9], s[4:5]
	; SI-NEXT: buffer_load_dword v3, v[0:1], s[8:11], 0 addr64 glc			; SI-NEXT: buffer_load_dword v3, v[0:1], s[8:11], 0 addr64 glc
	; SI-NEXT: s_waitcnt vmcnt(0)			; SI-NEXT: s_waitcnt vmcnt(0)
	; SI-NEXT: s_mov_b64 s[8:9], s[6:7]			; SI-NEXT: s_mov_b64 s[8:9], s[6:7]
	; SI-NEXT: buffer_load_dword v4, v[0:1], s[8:11], 0 addr64 glc			; SI-NEXT: buffer_load_dword v4, v[0:1], s[8:11], 0 addr64 glc
	; SI-NEXT: s_waitcnt vmcnt(0)			; SI-NEXT: s_waitcnt vmcnt(0)
	; SI-NEXT: s_mov_b32 s2, 0x80000000			; SI-NEXT: v_med3_f32 v2, -\|v2\|, -\|v3\|, -\|v4\|
	; SI-NEXT: v_sub_f32_e64 v2, s2, \|v2\|
	; SI-NEXT: v_sub_f32_e64 v3, s2, \|v3\|
	; SI-NEXT: v_sub_f32_e64 v4, s2, \|v4\|
	; SI-NEXT: v_med3_f32 v2, v2, v3, v4
	; SI-NEXT: s_mov_b64 s[2:3], s[10:11]			; SI-NEXT: s_mov_b64 s[2:3], s[10:11]
	; SI-NEXT: buffer_store_dword v2, v[0:1], s[0:3], 0 addr64			; SI-NEXT: buffer_store_dword v2, v[0:1], s[0:3], 0 addr64
	; SI-NEXT: s_endpgm			; SI-NEXT: s_endpgm
	;			;
	; VI-LABEL: v_test_global_nnans_med3_f32_pat0_negabs012:			; VI-LABEL: v_test_global_nnans_med3_f32_pat0_negabs012:
	; VI: ; %bb.0:			; VI: ; %bb.0:
	; VI-NEXT: s_load_dwordx8 s[0:7], s[0:1], 0x24			; VI-NEXT: s_load_dwordx8 s[0:7], s[0:1], 0x24
	; VI-NEXT: v_lshlrev_b32_e32 v6, 2, v0			; VI-NEXT: v_lshlrev_b32_e32 v6, 2, v0
	Show All 11 Lines
	; VI-NEXT: v_add_u32_e32 v4, vcc, v4, v6			; VI-NEXT: v_add_u32_e32 v4, vcc, v4, v6
	; VI-NEXT: v_addc_u32_e32 v5, vcc, 0, v5, vcc			; VI-NEXT: v_addc_u32_e32 v5, vcc, 0, v5, vcc
	; VI-NEXT: flat_load_dword v7, v[0:1] glc			; VI-NEXT: flat_load_dword v7, v[0:1] glc
	; VI-NEXT: s_waitcnt vmcnt(0)			; VI-NEXT: s_waitcnt vmcnt(0)
	; VI-NEXT: flat_load_dword v2, v[2:3] glc			; VI-NEXT: flat_load_dword v2, v[2:3] glc
	; VI-NEXT: s_waitcnt vmcnt(0)			; VI-NEXT: s_waitcnt vmcnt(0)
	; VI-NEXT: flat_load_dword v3, v[4:5] glc			; VI-NEXT: flat_load_dword v3, v[4:5] glc
	; VI-NEXT: s_waitcnt vmcnt(0)			; VI-NEXT: s_waitcnt vmcnt(0)
	; VI-NEXT: s_mov_b32 s2, 0x80000000
	; VI-NEXT: v_mov_b32_e32 v0, s0			; VI-NEXT: v_mov_b32_e32 v0, s0
	; VI-NEXT: v_mov_b32_e32 v1, s1			; VI-NEXT: v_mov_b32_e32 v1, s1
	; VI-NEXT: v_add_u32_e32 v0, vcc, v0, v6			; VI-NEXT: v_add_u32_e32 v0, vcc, v0, v6
	; VI-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc			; VI-NEXT: v_addc_u32_e32 v1, vcc, 0, v1, vcc
	; VI-NEXT: v_sub_f32_e64 v4, s2, \|v7\|			; VI-NEXT: v_med3_f32 v2, -\|v7\|, -\|v2\|, -\|v3\|
	; VI-NEXT: v_sub_f32_e64 v2, s2, \|v2\|
	; VI-NEXT: v_sub_f32_e64 v3, s2, \|v3\|
	; VI-NEXT: v_med3_f32 v2, v4, v2, v3
	; VI-NEXT: flat_store_dword v[0:1], v2			; VI-NEXT: flat_store_dword v[0:1], v2
	; VI-NEXT: s_endpgm			; VI-NEXT: s_endpgm
	;			;
	; GFX9-LABEL: v_test_global_nnans_med3_f32_pat0_negabs012:			; GFX9-LABEL: v_test_global_nnans_med3_f32_pat0_negabs012:
	; GFX9: ; %bb.0:			; GFX9: ; %bb.0:
	; GFX9-NEXT: s_load_dwordx8 s[0:7], s[0:1], 0x24			; GFX9-NEXT: s_load_dwordx8 s[0:7], s[0:1], 0x24
	; GFX9-NEXT: v_lshlrev_b32_e32 v0, 2, v0			; GFX9-NEXT: v_lshlrev_b32_e32 v0, 2, v0
	; GFX9-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-NEXT: global_load_dword v1, v0, s[2:3] glc			; GFX9-NEXT: global_load_dword v1, v0, s[2:3] glc
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: global_load_dword v2, v0, s[4:5] glc			; GFX9-NEXT: global_load_dword v2, v0, s[4:5] glc
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: global_load_dword v3, v0, s[6:7] glc			; GFX9-NEXT: global_load_dword v3, v0, s[6:7] glc
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: s_mov_b32 s2, 0x80000000			; GFX9-NEXT: v_med3_f32 v1, -\|v1\|, -\|v2\|, -\|v3\|
	; GFX9-NEXT: v_sub_f32_e64 v1, s2, \|v1\|
	; GFX9-NEXT: v_sub_f32_e64 v2, s2, \|v2\|
	; GFX9-NEXT: v_sub_f32_e64 v3, s2, \|v3\|
	; GFX9-NEXT: v_med3_f32 v1, v1, v2, v3
	; GFX9-NEXT: global_store_dword v0, v1, s[0:1]			; GFX9-NEXT: global_store_dword v0, v1, s[0:1]
	; GFX9-NEXT: s_endpgm			; GFX9-NEXT: s_endpgm
	;			;
	; GFX10-LABEL: v_test_global_nnans_med3_f32_pat0_negabs012:			; GFX10-LABEL: v_test_global_nnans_med3_f32_pat0_negabs012:
	; GFX10: ; %bb.0:			; GFX10: ; %bb.0:
	; GFX10-NEXT: s_load_dwordx8 s[0:7], s[0:1], 0x24			; GFX10-NEXT: s_load_dwordx8 s[0:7], s[0:1], 0x24
	; GFX10-NEXT: v_lshlrev_b32_e32 v0, 2, v0			; GFX10-NEXT: v_lshlrev_b32_e32 v0, 2, v0
	; GFX10-NEXT: s_waitcnt lgkmcnt(0)			; GFX10-NEXT: s_waitcnt lgkmcnt(0)
	; GFX10-NEXT: global_load_dword v1, v0, s[2:3] glc dlc			; GFX10-NEXT: global_load_dword v1, v0, s[2:3] glc dlc
	; GFX10-NEXT: s_waitcnt vmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0)
	; GFX10-NEXT: global_load_dword v2, v0, s[4:5] glc dlc			; GFX10-NEXT: global_load_dword v2, v0, s[4:5] glc dlc
	; GFX10-NEXT: s_waitcnt vmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0)
	; GFX10-NEXT: global_load_dword v3, v0, s[6:7] glc dlc			; GFX10-NEXT: global_load_dword v3, v0, s[6:7] glc dlc
	; GFX10-NEXT: s_waitcnt vmcnt(0)			; GFX10-NEXT: s_waitcnt vmcnt(0)
	; GFX10-NEXT: v_sub_f32_e64 v1, 0x80000000, \|v1\|			; GFX10-NEXT: v_med3_f32 v1, -\|v1\|, -\|v2\|, -\|v3\|
	; GFX10-NEXT: v_sub_f32_e64 v2, 0x80000000, \|v2\|
	; GFX10-NEXT: v_sub_f32_e64 v3, 0x80000000, \|v3\|
	; GFX10-NEXT: v_med3_f32 v1, v1, v2, v3
	; GFX10-NEXT: global_store_dword v0, v1, s[0:1]			; GFX10-NEXT: global_store_dword v0, v1, s[0:1]
	; GFX10-NEXT: s_endpgm			; GFX10-NEXT: s_endpgm
	;			;
	; GFX11-LABEL: v_test_global_nnans_med3_f32_pat0_negabs012:			; GFX11-LABEL: v_test_global_nnans_med3_f32_pat0_negabs012:
	; GFX11: ; %bb.0:			; GFX11: ; %bb.0:
	; GFX11-NEXT: s_load_b256 s[0:7], s[0:1], 0x24			; GFX11-NEXT: s_load_b256 s[0:7], s[0:1], 0x24
	; GFX11-NEXT: v_lshlrev_b32_e32 v0, 2, v0			; GFX11-NEXT: v_lshlrev_b32_e32 v0, 2, v0
	; GFX11-NEXT: s_waitcnt lgkmcnt(0)			; GFX11-NEXT: s_waitcnt lgkmcnt(0)
	; GFX11-NEXT: global_load_b32 v1, v0, s[2:3] glc dlc			; GFX11-NEXT: global_load_b32 v1, v0, s[2:3] glc dlc
	; GFX11-NEXT: s_waitcnt vmcnt(0)			; GFX11-NEXT: s_waitcnt vmcnt(0)
	; GFX11-NEXT: global_load_b32 v2, v0, s[4:5] glc dlc			; GFX11-NEXT: global_load_b32 v2, v0, s[4:5] glc dlc
	; GFX11-NEXT: s_waitcnt vmcnt(0)			; GFX11-NEXT: s_waitcnt vmcnt(0)
	; GFX11-NEXT: global_load_b32 v3, v0, s[6:7] glc dlc			; GFX11-NEXT: global_load_b32 v3, v0, s[6:7] glc dlc
	; GFX11-NEXT: s_waitcnt vmcnt(0)			; GFX11-NEXT: s_waitcnt vmcnt(0)
	; GFX11-NEXT: v_sub_f32_e64 v1, 0x80000000, \|v1\|			; GFX11-NEXT: v_med3_f32 v1, -\|v1\|, -\|v2\|, -\|v3\|
	; GFX11-NEXT: v_sub_f32_e64 v2, 0x80000000, \|v2\|
	; GFX11-NEXT: v_sub_f32_e64 v3, 0x80000000, \|v3\|
	; GFX11-NEXT: s_delay_alu instid0(VALU_DEP_1)
	; GFX11-NEXT: v_med3_f32 v1, v1, v2, v3
	; GFX11-NEXT: global_store_b32 v0, v1, s[0:1]			; GFX11-NEXT: global_store_b32 v0, v1, s[0:1]
	; GFX11-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)			; GFX11-NEXT: s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
	; GFX11-NEXT: s_endpgm			; GFX11-NEXT: s_endpgm
	%tid = call i32 @llvm.amdgcn.workitem.id.x()			%tid = call i32 @llvm.amdgcn.workitem.id.x()
	%gep0 = getelementptr float, float addrspace(1)* %aptr, i32 %tid			%gep0 = getelementptr float, float addrspace(1)* %aptr, i32 %tid
	%gep1 = getelementptr float, float addrspace(1)* %bptr, i32 %tid			%gep1 = getelementptr float, float addrspace(1)* %bptr, i32 %tid
	%gep2 = getelementptr float, float addrspace(1)* %cptr, i32 %tid			%gep2 = getelementptr float, float addrspace(1)* %cptr, i32 %tid
	%outgep = getelementptr float, float addrspace(1)* %out, i32 %tid			%outgep = getelementptr float, float addrspace(1)* %out, i32 %tid
	▲ Show 20 Lines • Show All 329 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/GlobalISel/mad-mix-hi.ll

This file was added.

				; RUN: llc -global-isel -march=amdgcn -mcpu=gfx900 -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GFX9 %s
				; RUN: llc -global-isel -march=amdgcn -mcpu=fiji -verify-machineinstrs < %s \| FileCheck -enable-var-scope --check-prefix=GCN %s
				; RUN: llc -global-isel -march=amdgcn -mcpu=hawaii -verify-machineinstrs < %s \| FileCheck -enable-var-scope --check-prefix=GCN %s

				; GCN-LABEL: {{^}}v_mad_mixhi_f16_f16lo_f16lo_f16lo_undeflo:
				; GFX9: s_waitcnt
				; GFX9-NEXT: v_mad_mixhi_f16 v0, v0, v1, v2
				; GFX9-NEXT: s_setpc_b64
				define <2 x half> @v_mad_mixhi_f16_f16lo_f16lo_f16lo_undeflo(half %src0, half %src1, half %src2) #0 {
				%src0.ext = fpext half %src0 to float
				%src1.ext = fpext half %src1 to float
				%src2.ext = fpext half %src2 to float
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext, float %src1.ext, float %src2.ext)
				%cvt.result = fptrunc float %result to half
				%vec.result = insertelement <2 x half> undef, half %cvt.result, i32 1
				ret <2 x half> %vec.result
				}

				; GCN-LABEL: {{^}}v_mad_mixhi_f16_f16lo_f16lo_f16lo_constlo:
				; GFX9: s_waitcnt
				; GFX9-NEXT: v_mov_b32_e32 v3, 0x3c00
				; GFX9-NEXT: v_mad_mixhi_f16 v3, v0, v1, v2
				; GFX9-NEXT: v_mov_b32_e32 v0, v3
				; GFX9-NEXT: s_setpc_b64
				define <2 x half> @v_mad_mixhi_f16_f16lo_f16lo_f16lo_constlo(half %src0, half %src1, half %src2) #0 {
				%src0.ext = fpext half %src0 to float
				%src1.ext = fpext half %src1 to float
				%src2.ext = fpext half %src2 to float
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext, float %src1.ext, float %src2.ext)
				%cvt.result = fptrunc float %result to half
				%vec.result = insertelement <2 x half> <half 1.0, half undef>, half %cvt.result, i32 1
				ret <2 x half> %vec.result
				}

				; GCN-LABEL: {{^}}v_mad_mixhi_f16_f16lo_f16lo_f16lo_reglo:
				; GFX9: s_waitcnt
				; GFX9-NEXT: v_mad_mixhi_f16 v3, v0, v1, v2
				; GFX9-NEXT: v_mov_b32_e32 v0, v3
				; GFX9-NEXT: s_setpc_b64
				define <2 x half> @v_mad_mixhi_f16_f16lo_f16lo_f16lo_reglo(half %src0, half %src1, half %src2, half %lo) #0 {
				%src0.ext = fpext half %src0 to float
				%src1.ext = fpext half %src1 to float
				%src2.ext = fpext half %src2 to float
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext, float %src1.ext, float %src2.ext)
				%cvt.result = fptrunc float %result to half
				%vec = insertelement <2 x half> undef, half %lo, i32 0
				%vec.result = insertelement <2 x half> %vec, half %cvt.result, i32 1
				ret <2 x half> %vec.result
				}

				; FIXME: should be v_lshlrev_b32_e32 v0, 16, v0

				; GCN-LABEL: {{^}}v_mad_mixhi_f16_f16lo_f16lo_f16lo_intpack:
				; GFX9: s_waitcnt
				; GFX9-NEXT: v_mad_mixlo_f16 v0, v0, v1, v2 op_sel_hi:[1,1,1]
				; GFX9-NEXT: v_mov_b32_e32 v1, 16
				; GFX9-NEXT: v_lshlrev_b32_sdwa v0, v1, v0 dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_0
				; GFX9-NEXT: s_setpc_b64
				define i32 @v_mad_mixhi_f16_f16lo_f16lo_f16lo_intpack(half %src0, half %src1, half %src2) #0 {
				%src0.ext = fpext half %src0 to float
				%src1.ext = fpext half %src1 to float
				%src2.ext = fpext half %src2 to float
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext, float %src1.ext, float %src2.ext)
				%cvt.result = fptrunc float %result to half
				%bc = bitcast half %cvt.result to i16
				%ext = zext i16 %bc to i32
				%shr = shl i32 %ext, 16
				ret i32 %shr
				}

				; FIXME: should be v_lshlrev_b32_e32 v0, 16, v0

				; GCN-LABEL: {{^}}v_mad_mixhi_f16_f16lo_f16lo_f16lo_intpack_sext:
				; GFX9: s_waitcnt
				; GFX9-NEXT: v_mad_mixlo_f16 v0, v0, v1, v2 op_sel_hi:[1,1,1]
				; GFX9-NEXT: v_mov_b32_e32 v1, 16
				; GFX9-NEXT: v_lshlrev_b32_sdwa v0, v1, sext(v0) dst_sel:DWORD dst_unused:UNUSED_PAD src0_sel:DWORD src1_sel:WORD_0
				; GFX9-NEXT: s_setpc_b64
				define i32 @v_mad_mixhi_f16_f16lo_f16lo_f16lo_intpack_sext(half %src0, half %src1, half %src2) #0 {
				%src0.ext = fpext half %src0 to float
				%src1.ext = fpext half %src1 to float
				%src2.ext = fpext half %src2 to float
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext, float %src1.ext, float %src2.ext)
				%cvt.result = fptrunc float %result to half
				%bc = bitcast half %cvt.result to i16
				%ext = sext i16 %bc to i32
				%shr = shl i32 %ext, 16
				ret i32 %shr
				}

				; FIXME: Could use cvt_sdwa?

				; GCN-LABEL: {{^}}v_mad_mixhi_f16_f16lo_f16lo_f16lo_undeflo_clamp_precvt:
				; GCN: s_waitcnt
				; GFX9-NEXT: v_mad_mix_f32 v0, v0, v1, v2 op_sel_hi:[1,1,1] clamp{{$}}
				; GFX9-NEXT: v_cvt_f16_f32_e32 v0, v0
				; GFX9-NEXT: v_and_b32_e32 v1, 0xffff, v0
				; GFX9-NEXT: v_lshl_or_b32 v0, v0, 16, v1
				; GFX9-NEXT: s_setpc_b64
				define <2 x half> @v_mad_mixhi_f16_f16lo_f16lo_f16lo_undeflo_clamp_precvt(half %src0, half %src1, half %src2) #0 {
				%src0.ext = fpext half %src0 to float
				%src1.ext = fpext half %src1 to float
				%src2.ext = fpext half %src2 to float
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext, float %src1.ext, float %src2.ext)
				%max = call float @llvm.maxnum.f32(float %result, float 0.0)
				%clamp = call float @llvm.minnum.f32(float %max, float 1.0)
				%cvt.result = fptrunc float %clamp to half
				%vec.result = insertelement <2 x half> undef, half %cvt.result, i32 1
				ret <2 x half> %vec.result
				}

				; GCN-LABEL: {{^}}v_mad_mixhi_f16_f16lo_f16lo_f16lo_undeflo_clamp_postcvt:
				; GCN: s_waitcnt
				; GFX9-NEXT: v_mad_mixhi_f16 v0, v0, v1, v2 op_sel_hi:[1,1,1] clamp{{$}}
				; GFX9-NEXT: s_setpc_b64
				define <2 x half> @v_mad_mixhi_f16_f16lo_f16lo_f16lo_undeflo_clamp_postcvt(half %src0, half %src1, half %src2) #0 {
				%src0.ext = fpext half %src0 to float
				%src1.ext = fpext half %src1 to float
				%src2.ext = fpext half %src2 to float
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext, float %src1.ext, float %src2.ext)
				%cvt.result = fptrunc float %result to half
				%max = call half @llvm.maxnum.f16(half %cvt.result, half 0.0)
				%clamp = call half @llvm.minnum.f16(half %max, half 1.0)
				%vec.result = insertelement <2 x half> undef, half %clamp, i32 1
				ret <2 x half> %vec.result
				}


				; GCN-LABEL: {{^}}v_mad_mixhi_f16_f16lo_f16lo_f16lo_undeflo_clamp_postcvt_multi_use:
				; GCN: s_waitcnt
				; GFX9-NEXT: v_mad_mixlo_f16 v3, v0, v1, v2 op_sel_hi:[1,1,1]{{$}}
				; GFX9-NEXT: global_store_short v{{\[[0-9]+:[0-9]+\]}}, v3
				; GFX9-NEXT: s_waitcnt vmcnt(0)
				; GFX9-NEXT: v_mad_mixhi_f16 v0, v0, v1, v2 op_sel_hi:[1,1,1] clamp{{$}}
				; GFX9-NEXT: s_setpc_b64
				define <2 x half> @v_mad_mixhi_f16_f16lo_f16lo_f16lo_undeflo_clamp_postcvt_multi_use(half %src0, half %src1, half %src2) #0 {
				%src0.ext = fpext half %src0 to float
				%src1.ext = fpext half %src1 to float
				%src2.ext = fpext half %src2 to float
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext, float %src1.ext, float %src2.ext)
				%cvt.result = fptrunc float %result to half
				store volatile half %cvt.result, half addrspace(1)* undef
				%max = call half @llvm.maxnum.f16(half %cvt.result, half 0.0)
				%clamp = call half @llvm.minnum.f16(half %max, half 1.0)
				%vec.result = insertelement <2 x half> undef, half %clamp, i32 1
				ret <2 x half> %vec.result
				}

				declare half @llvm.minnum.f16(half, half) #1
				declare half @llvm.maxnum.f16(half, half) #1
				declare float @llvm.minnum.f32(float, float) #1
				declare float @llvm.maxnum.f32(float, float) #1
				declare float @llvm.fmuladd.f32(float, float, float) #1
				declare <2 x float> @llvm.fmuladd.v2f32(<2 x float>, <2 x float>, <2 x float>) #1

				attributes #0 = { nounwind "denormal-fp-math-f32"="preserve-sign,preserve-sign" }
				attributes #1 = { nounwind readnone speculatable }

llvm/test/CodeGen/AMDGPU/GlobalISel/mad-mix-lo.ll

This file was added.

				; RUN: llc -global-isel -march=amdgcn -mcpu=gfx906 -verify-machineinstrs -enable-misched=false < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GFX9,GFX906 %s
				; RUN: llc -global-isel -march=amdgcn -mcpu=gfx900 -verify-machineinstrs -enable-misched=false < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GFX9,GFX900 %s
				; RUN: llc -global-isel -march=amdgcn -mcpu=fiji -verify-machineinstrs -enable-misched=false < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,CIVI %s
				; RUN: llc -global-isel -march=amdgcn -mcpu=hawaii -verify-machineinstrs -enable-misched=false < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,CIVI,CI %s

				; GCN-LABEL: mixlo_simple:
				; GCN: s_waitcnt
				; GFX900-NEXT: v_mad_mixlo_f16 v0, v0, v1, v2{{$}}
				; GFX906-NEXT: v_fma_mixlo_f16 v0, v0, v1, v2{{$}}
				; GFX9-NEXT: s_setpc_b64

				; CIVI: v_mac_f32_e32
				; CIVI: v_cvt_f16_f32_e32
				define half @mixlo_simple(float %src0, float %src1, float %src2) #0 {
				%result = call float @llvm.fmuladd.f32(float %src0, float %src1, float %src2)
				%cvt.result = fptrunc float %result to half
				ret half %cvt.result
				}

				; GCN-LABEL: {{^}}v_mad_mixlo_f16_f16lo_f16lo_f16lo:
				; GFX900: v_mad_mixlo_f16 v0, v0, v1, v2 op_sel_hi:[1,1,1]{{$}}
				; GFX906: v_fma_mixlo_f16 v0, v0, v1, v2 op_sel_hi:[1,1,1]{{$}}
				; CI: v_mac_f32
				; CIVI: v_cvt_f16_f32
				define half @v_mad_mixlo_f16_f16lo_f16lo_f16lo(half %src0, half %src1, half %src2) #0 {
				%src0.ext = fpext half %src0 to float
				%src1.ext = fpext half %src1 to float
				%src2.ext = fpext half %src2 to float
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext, float %src1.ext, float %src2.ext)
				%cvt.result = fptrunc float %result to half
				ret half %cvt.result
				}

				; GCN-LABEL: {{^}}v_mad_mixlo_f16_f16lo_f16lo_f32:
				; GCN: s_waitcnt
				; GFX900-NEXT: v_mad_mixlo_f16 v0, v0, v1, v2 op_sel_hi:[1,1,0]{{$}}
				; GFX906-NEXT: v_fma_mixlo_f16 v0, v0, v1, v2 op_sel_hi:[1,1,0]{{$}}
				; GFX9-NEXT: s_setpc_b64

				; CIVI: v_mac_f32
				define half @v_mad_mixlo_f16_f16lo_f16lo_f32(half %src0, half %src1, float %src2) #0 {
				%src0.ext = fpext half %src0 to float
				%src1.ext = fpext half %src1 to float
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext, float %src1.ext, float %src2)
				%cvt.result = fptrunc float %result to half
				ret half %cvt.result
				}

				; GCN-LABEL: {{^}}v_mad_mixlo_f16_f16lo_f16lo_f32_clamp_post_cvt:
				; GCN: s_waitcnt
				; GFX900-NEXT: v_mad_mixlo_f16 v0, v0, v1, v2 op_sel_hi:[1,1,0] clamp{{$}}
				; GFX906-NEXT: v_fma_mixlo_f16 v0, v0, v1, v2 op_sel_hi:[1,1,0] clamp{{$}}
				; GFX9-NEXT: s_setpc_b64

				; CIVI: v_mac_f32_e32 v{{[0-9]}}, v{{[0-9]}}, v{{[0-9]$}}
				define half @v_mad_mixlo_f16_f16lo_f16lo_f32_clamp_post_cvt(half %src0, half %src1, float %src2) #0 {
				%src0.ext = fpext half %src0 to float
				%src1.ext = fpext half %src1 to float
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext, float %src1.ext, float %src2)
				%cvt.result = fptrunc float %result to half
				%max = call half @llvm.maxnum.f16(half %cvt.result, half 0.0)
				%clamp = call half @llvm.minnum.f16(half %max, half 1.0)
				ret half %clamp
				}

				; GCN-LABEL: {{^}}v_mad_mixlo_f16_f16lo_f16lo_f32_clamp_pre_cvt:
				; GCN: s_waitcnt
				; GFX900-NEXT: v_mad_mix_f32 v0, v0, v1, v2 op_sel_hi:[1,1,0] clamp{{$}}
				; GFX906-NEXT: v_fma_mix_f32 v0, v0, v1, v2 op_sel_hi:[1,1,0] clamp{{$}}
				; GFX9-NEXT: v_cvt_f16_f32_e32 v0, v0
				; GFX9-NEXT: s_setpc_b64

				; FIXME: Should be using v_mad + clamp but v_mac isn't folded in GISel due to
				; different rules in `isCanonicalized`.
				; CIVI: v_mac_f32_e32 v{{[0-9]}}, v{{[0-9]}}, v{{[0-9]}}
				define half @v_mad_mixlo_f16_f16lo_f16lo_f32_clamp_pre_cvt(half %src0, half %src1, float %src2) #0 {
				%src0.ext = fpext half %src0 to float
				%src1.ext = fpext half %src1 to float
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext, float %src1.ext, float %src2)
				%max = call float @llvm.maxnum.f32(float %result, float 0.0)
				%clamp = call float @llvm.minnum.f32(float %max, float 1.0)
				%cvt.result = fptrunc float %clamp to half
				ret half %cvt.result
				}

				; FIXME: Should abe able to avoid extra register because first
				; operation only clobbers relevant lane.
				; GCN-LABEL: {{^}}v_mad_mix_v2f32:
				; GCN: s_waitcnt

				; GFX900-NEXT: v_mad_mixlo_f16 v3, v0, v1, v2 op_sel_hi:[1,1,1]{{$}}
				; GFX900-NEXT: v_mad_mixhi_f16 v3, v0, v1, v2 op_sel:[1,1,1] op_sel_hi:[1,1,1]{{$}}

				; GFX906-NEXT: v_fma_mixlo_f16 v3, v0, v1, v2 op_sel_hi:[1,1,1]{{$}}
				; GFX906-NEXT: v_fma_mixhi_f16 v3, v0, v1, v2 op_sel:[1,1,1] op_sel_hi:[1,1,1]{{$}}

				; GFX9-NEXT: v_mov_b32_e32 v0, v3
				; GFX9-NEXT: s_setpc_b64
				define <2 x half> @v_mad_mix_v2f32(<2 x half> %src0, <2 x half> %src1, <2 x half> %src2) #0 {
				%src0.ext = fpext <2 x half> %src0 to <2 x float>
				%src1.ext = fpext <2 x half> %src1 to <2 x float>
				%src2.ext = fpext <2 x half> %src2 to <2 x float>
				%result = tail call <2 x float> @llvm.fmuladd.v2f32(<2 x float> %src0.ext, <2 x float> %src1.ext, <2 x float> %src2.ext)
				%cvt.result = fptrunc <2 x float> %result to <2 x half>
				ret <2 x half> %cvt.result
				}

				; GCN-LABEL: {{^}}v_mad_mix_v3f32:
				; GCN: s_waitcnt

				; GFX900-NEXT: v_mad_mixlo_f16 v6, v0, v2, v4 op_sel_hi:[1,1,1]
				; GFX900-NEXT: v_mad_mixhi_f16 v6, v0, v2, v4 op_sel:[1,1,1] op_sel_hi:[1,1,1]
				; GFX900-NEXT: v_mad_mixlo_f16 v1, v1, v3, v5 op_sel_hi:[1,1,1]

				; GFX906-NEXT: v_fma_mixlo_f16 v6, v0, v2, v4 op_sel_hi:[1,1,1]
				; GFX906-NEXT: v_fma_mixhi_f16 v6, v0, v2, v4 op_sel:[1,1,1] op_sel_hi:[1,1,1]
				; GFX906-NEXT: v_fma_mixlo_f16 v1, v1, v3, v5 op_sel_hi:[1,1,1]

				; GFX9-NEXT: v_mov_b32_e32 v0, v6
				; GFX9-NEXT: s_setpc_b64
				define <3 x half> @v_mad_mix_v3f32(<3 x half> %src0, <3 x half> %src1, <3 x half> %src2) #0 {
				%src0.ext = fpext <3 x half> %src0 to <3 x float>
				%src1.ext = fpext <3 x half> %src1 to <3 x float>
				%src2.ext = fpext <3 x half> %src2 to <3 x float>
				%result = tail call <3 x float> @llvm.fmuladd.v3f32(<3 x float> %src0.ext, <3 x float> %src1.ext, <3 x float> %src2.ext)
				%cvt.result = fptrunc <3 x float> %result to <3 x half>
				ret <3 x half> %cvt.result
				}

				; GCN-LABEL: {{^}}v_mad_mix_v4f32:
				; GCN: s_waitcnt
				; GFX900-NEXT: v_mad_mixlo_f16 v6, v0, v2, v4 op_sel_hi:[1,1,1]
				; GFX900-NEXT: v_mad_mixlo_f16 v7, v1, v3, v5 op_sel_hi:[1,1,1]
				; GFX900-NEXT: v_mad_mixhi_f16 v6, v0, v2, v4 op_sel:[1,1,1] op_sel_hi:[1,1,1]
				; GFX900-NEXT: v_mad_mixhi_f16 v7, v1, v3, v5 op_sel:[1,1,1] op_sel_hi:[1,1,1]

				; GFX906-NEXT: v_fma_mixlo_f16 v6, v0, v2, v4 op_sel_hi:[1,1,1]
				; GFX906-NEXT: v_fma_mixlo_f16 v7, v1, v3, v5 op_sel_hi:[1,1,1]
				; GFX906-NEXT: v_fma_mixhi_f16 v6, v0, v2, v4 op_sel:[1,1,1] op_sel_hi:[1,1,1]
				; GFX906-NEXT: v_fma_mixhi_f16 v7, v1, v3, v5 op_sel:[1,1,1] op_sel_hi:[1,1,1]

				; GFX9-NEXT: v_mov_b32_e32 v0, v6
				; GFX9-NEXT: v_mov_b32_e32 v1, v7
				; GFX9-NEXT: s_setpc_b64
				define <4 x half> @v_mad_mix_v4f32(<4 x half> %src0, <4 x half> %src1, <4 x half> %src2) #0 {
				%src0.ext = fpext <4 x half> %src0 to <4 x float>
				%src1.ext = fpext <4 x half> %src1 to <4 x float>
				%src2.ext = fpext <4 x half> %src2 to <4 x float>
				%result = tail call <4 x float> @llvm.fmuladd.v4f32(<4 x float> %src0.ext, <4 x float> %src1.ext, <4 x float> %src2.ext)
				%cvt.result = fptrunc <4 x float> %result to <4 x half>
				ret <4 x half> %cvt.result
				}

				; FIXME: Fold clamp
				; GCN-LABEL: {{^}}v_mad_mix_v2f32_clamp_postcvt:
				; GFX900: v_mad_mixlo_f16 v3, v0, v1, v2 op_sel_hi:[1,1,1] clamp{{$}}
				; GFX900-NEXT: v_mad_mixhi_f16 v3, v0, v1, v2 op_sel:[1,1,1] op_sel_hi:[1,1,1] clamp{{$}}

				; GFX906: v_fma_mixlo_f16 v3, v0, v1, v2 op_sel_hi:[1,1,1] clamp{{$}}
				; GFX906-NEXT: v_fma_mixhi_f16 v3, v0, v1, v2 op_sel:[1,1,1] op_sel_hi:[1,1,1] clamp{{$}}

				; GFX9-NEXT: v_mov_b32_e32 v0, v3
				; GFX9-NEXT: s_setpc_b64
				define <2 x half> @v_mad_mix_v2f32_clamp_postcvt(<2 x half> %src0, <2 x half> %src1, <2 x half> %src2) #0 {
				%src0.ext = fpext <2 x half> %src0 to <2 x float>
				%src1.ext = fpext <2 x half> %src1 to <2 x float>
				%src2.ext = fpext <2 x half> %src2 to <2 x float>
				%result = tail call <2 x float> @llvm.fmuladd.v2f32(<2 x float> %src0.ext, <2 x float> %src1.ext, <2 x float> %src2.ext)
				%cvt.result = fptrunc <2 x float> %result to <2 x half>
				%max = call <2 x half> @llvm.maxnum.v2f16(<2 x half> %cvt.result, <2 x half> zeroinitializer)
				%clamp = call <2 x half> @llvm.minnum.v2f16(<2 x half> %max, <2 x half> <half 1.0, half 1.0>)
				ret <2 x half> %clamp
				}

				; GCN-LABEL: {{^}}v_mad_mix_v3f32_clamp_postcvt:
				; GCN: s_waitcnt
				; GFX900-DAG: v_mad_mixlo_f16 v{{[0-9]+}}, v0, v2, v4 op_sel_hi:[1,1,1] clamp
				; GFX900-DAG: v_mad_mixhi_f16 v{{[0-9]+}}, v0, v2, v4 op_sel:[1,1,1] op_sel_hi:[1,1,1] clamp
				; GFX900-DAG: v_mad_mixlo_f16 v{{[0-9]+}}, v1, v3, v5 op_sel_hi:[1,1,1] clamp

				; GFX906-DAG: v_fma_mixlo_f16 v{{[0-9]+}}, v0, v2, v4 op_sel_hi:[1,1,1] clamp
				; GFX906-DAG: v_fma_mixhi_f16 v{{[0-9]+}}, v0, v2, v4 op_sel:[1,1,1] op_sel_hi:[1,1,1] clamp
				; GFX906-DAG: v_fma_mixlo_f16 v{{[0-9]+}}, v1, v3, v5 op_sel_hi:[1,1,1] clamp

				; GFX9: v_mov_b32_e32 v0, v{{[0-9]+}}
				; GFX9-NEXT: s_setpc_b64
				define <3 x half> @v_mad_mix_v3f32_clamp_postcvt(<3 x half> %src0, <3 x half> %src1, <3 x half> %src2) #0 {
				%src0.ext = fpext <3 x half> %src0 to <3 x float>
				%src1.ext = fpext <3 x half> %src1 to <3 x float>
				%src2.ext = fpext <3 x half> %src2 to <3 x float>
				%result = tail call <3 x float> @llvm.fmuladd.v3f32(<3 x float> %src0.ext, <3 x float> %src1.ext, <3 x float> %src2.ext)
				%cvt.result = fptrunc <3 x float> %result to <3 x half>
				%max = call <3 x half> @llvm.maxnum.v3f16(<3 x half> %cvt.result, <3 x half> zeroinitializer)
				%clamp = call <3 x half> @llvm.minnum.v3f16(<3 x half> %max, <3 x half> <half 1.0, half 1.0, half 1.0>)
				ret <3 x half> %clamp
				}

				; GCN-LABEL: {{^}}v_mad_mix_v4f32_clamp_postcvt:
				; GCN: s_waitcnt
				; GFX900-NEXT: v_mad_mixlo_f16 v6, v0, v2, v4 op_sel_hi:[1,1,1] clamp
				; GFX900-NEXT: v_mad_mixlo_f16 v7, v1, v3, v5 op_sel_hi:[1,1,1] clamp
				; GFX900-NEXT: v_mad_mixhi_f16 v6, v0, v2, v4 op_sel:[1,1,1] op_sel_hi:[1,1,1] clamp
				; GFX900-NEXT: v_mad_mixhi_f16 v7, v1, v3, v5 op_sel:[1,1,1] op_sel_hi:[1,1,1] clamp


				; GFX906-NEXT: v_fma_mixlo_f16 v6, v0, v2, v4 op_sel_hi:[1,1,1] clamp
				; GFX906-NEXT: v_fma_mixlo_f16 v7, v1, v3, v5 op_sel_hi:[1,1,1] clamp
				; GFX906-NEXT: v_fma_mixhi_f16 v6, v0, v2, v4 op_sel:[1,1,1] op_sel_hi:[1,1,1] clamp
				; GFX906-NEXT: v_fma_mixhi_f16 v7, v1, v3, v5 op_sel:[1,1,1] op_sel_hi:[1,1,1] clamp


				; GFX9-NEXT: v_mov_b32_e32 v0, v6
				; GFX9-NEXT: v_mov_b32_e32 v1, v7
				; GFX9-NEXT: s_setpc_b64
				define <4 x half> @v_mad_mix_v4f32_clamp_postcvt(<4 x half> %src0, <4 x half> %src1, <4 x half> %src2) #0 {
				%src0.ext = fpext <4 x half> %src0 to <4 x float>
				%src1.ext = fpext <4 x half> %src1 to <4 x float>
				%src2.ext = fpext <4 x half> %src2 to <4 x float>
				%result = tail call <4 x float> @llvm.fmuladd.v4f32(<4 x float> %src0.ext, <4 x float> %src1.ext, <4 x float> %src2.ext)
				%cvt.result = fptrunc <4 x float> %result to <4 x half>
				%max = call <4 x half> @llvm.maxnum.v4f16(<4 x half> %cvt.result, <4 x half> zeroinitializer)
				%clamp = call <4 x half> @llvm.minnum.v4f16(<4 x half> %max, <4 x half> <half 1.0, half 1.0, half 1.0, half 1.0>)
				ret <4 x half> %clamp
				}

				; GCN-LABEL: {{^}}v_mad_mix_v2f32_clamp_postcvt_lo:
				; GCN: s_waitcnt
				; GFX900-NEXT: v_mad_mixlo_f16 v3, v0, v1, v2 op_sel_hi:[1,1,1] clamp
				; GFX900-NEXT: v_mad_mixhi_f16 v3, v0, v1, v2 op_sel:[1,1,1] op_sel_hi:[1,1,1]

				; GFX906-NEXT: v_fma_mixlo_f16 v3, v0, v1, v2 op_sel_hi:[1,1,1] clamp
				; GFX906-NEXT: v_fma_mixhi_f16 v3, v0, v1, v2 op_sel:[1,1,1] op_sel_hi:[1,1,1]

				; GFX9-NOT: v_mov_b32_sdwa
				; GFX9-NEXT: v_mov_b32_e32 v0, v3
				; GFX9-NEXT: s_setpc_b64
				define <2 x half> @v_mad_mix_v2f32_clamp_postcvt_lo(<2 x half> %src0, <2 x half> %src1, <2 x half> %src2) #0 {
				%src0.ext = fpext <2 x half> %src0 to <2 x float>
				%src1.ext = fpext <2 x half> %src1 to <2 x float>
				%src2.ext = fpext <2 x half> %src2 to <2 x float>
				%result = tail call <2 x float> @llvm.fmuladd.v2f32(<2 x float> %src0.ext, <2 x float> %src1.ext, <2 x float> %src2.ext)
				%cvt.result = fptrunc <2 x float> %result to <2 x half>
				%cvt.lo = extractelement <2 x half> %cvt.result, i32 0
				%max.lo = call half @llvm.maxnum.f16(half %cvt.lo, half 0.0)
				%clamp.lo = call half @llvm.minnum.f16(half %max.lo, half 1.0)
				%insert = insertelement <2 x half> %cvt.result, half %clamp.lo, i32 0
				ret <2 x half> %insert
				}

				; GCN-LABEL: {{^}}v_mad_mix_v2f32_clamp_postcvt_hi:
				; GCN: s_waitcnt
				; GFX900-NEXT: v_mad_mixlo_f16 v3, v0, v1, v2 op_sel_hi:[1,1,1]
				; GFX900-NEXT: v_mad_mixhi_f16 v3, v0, v1, v2 op_sel:[1,1,1] op_sel_hi:[1,1,1] clamp

				; GFX906-NEXT: v_fma_mixlo_f16 v3, v0, v1, v2 op_sel_hi:[1,1,1]
				; GFX906-NEXT: v_fma_mixhi_f16 v3, v0, v1, v2 op_sel:[1,1,1] op_sel_hi:[1,1,1] clamp

				; GFX9-NEXT: v_mov_b32_e32 v0, v3
				; GFX9-NEXT: s_setpc_b64
				define <2 x half> @v_mad_mix_v2f32_clamp_postcvt_hi(<2 x half> %src0, <2 x half> %src1, <2 x half> %src2) #0 {
				%src0.ext = fpext <2 x half> %src0 to <2 x float>
				%src1.ext = fpext <2 x half> %src1 to <2 x float>
				%src2.ext = fpext <2 x half> %src2 to <2 x float>
				%result = tail call <2 x float> @llvm.fmuladd.v2f32(<2 x float> %src0.ext, <2 x float> %src1.ext, <2 x float> %src2.ext)
				%cvt.result = fptrunc <2 x float> %result to <2 x half>
				%cvt.hi = extractelement <2 x half> %cvt.result, i32 1
				%max.hi = call half @llvm.maxnum.f16(half %cvt.hi, half 0.0)
				%clamp.hi = call half @llvm.minnum.f16(half %max.hi, half 1.0)
				%insert = insertelement <2 x half> %cvt.result, half %clamp.hi, i32 1
				ret <2 x half> %insert
				}

				; FIXME: Should be able to use mixlo/mixhi
				; GCN-LABEL: {{^}}v_mad_mix_v2f32_clamp_precvt:
				; GFX900: v_mad_mix_f32 v3, v0, v1, v2 op_sel_hi:[1,1,1] clamp
				; GFX900-NEXT: v_mad_mix_f32 v0, v0, v1, v2 op_sel:[1,1,1] op_sel_hi:[1,1,1] clamp

				; GFX906: v_fma_mix_f32 v3, v0, v1, v2 op_sel_hi:[1,1,1] clamp
				; GFX906-NEXT: v_fma_mix_f32 v0, v0, v1, v2 op_sel:[1,1,1] op_sel_hi:[1,1,1] clamp

				; GFX9: v_cvt_f16_f32_e32 v1, v3
				; GFX9: v_cvt_f16_f32_e32 v0, v0
				; GFX9: v_pack_b32_f16 v0, v1, v0
				; GFX9: s_setpc_b64
				define <2 x half> @v_mad_mix_v2f32_clamp_precvt(<2 x half> %src0, <2 x half> %src1, <2 x half> %src2) #0 {
				%src0.ext = fpext <2 x half> %src0 to <2 x float>
				%src1.ext = fpext <2 x half> %src1 to <2 x float>
				%src2.ext = fpext <2 x half> %src2 to <2 x float>
				%result = tail call <2 x float> @llvm.fmuladd.v2f32(<2 x float> %src0.ext, <2 x float> %src1.ext, <2 x float> %src2.ext)
				%max = call <2 x float> @llvm.maxnum.v2f32(<2 x float> %result, <2 x float> zeroinitializer)
				%clamp = call <2 x float> @llvm.minnum.v2f32(<2 x float> %max, <2 x float> <float 1.0, float 1.0>)
				%cvt.result = fptrunc <2 x float> %clamp to <2 x half>
				ret <2 x half> %cvt.result
				}

				; FIXME: Handling undef 4th component
				; GCN-LABEL: {{^}}v_mad_mix_v3f32_clamp_precvt:
				; GCN: s_waitcnt
				; GFX900: v_mad_mix_f32 v6, v0, v2, v4 op_sel_hi:[1,1,1] clamp
				; GFX900-NEXT: v_mad_mix_f32 v0, v0, v2, v4 op_sel:[1,1,1] op_sel_hi:[1,1,1] clamp
				; GFX900-NEXT: v_mad_mix_f32 v1, v1, v3, v5 op_sel_hi:[1,1,1] clamp

				; GFX906: v_fma_mix_f32 v6, v0, v2, v4 op_sel_hi:[1,1,1] clamp
				; GFX906-NEXT: v_fma_mix_f32 v0, v0, v2, v4 op_sel:[1,1,1] op_sel_hi:[1,1,1] clamp
				; GFX906-NEXT: v_fma_mix_f32 v1, v1, v3, v5 op_sel_hi:[1,1,1] clamp

				; GFX9-NEXT: v_cvt_f16_f32_e32 v2, v6
				; GFX9-NEXT: v_cvt_f16_f32_e32 v0, v0
				; GFX9-NEXT: v_cvt_f16_f32_e32 v1, v1
				; GFX9-NEXT: v_pack_b32_f16 v0, v2, v0
				; GFX9-NEXT: s_setpc_b64
				define <3 x half> @v_mad_mix_v3f32_clamp_precvt(<3 x half> %src0, <3 x half> %src1, <3 x half> %src2) #0 {
				%src0.ext = fpext <3 x half> %src0 to <3 x float>
				%src1.ext = fpext <3 x half> %src1 to <3 x float>
				%src2.ext = fpext <3 x half> %src2 to <3 x float>
				%result = tail call <3 x float> @llvm.fmuladd.v3f32(<3 x float> %src0.ext, <3 x float> %src1.ext, <3 x float> %src2.ext)
				%max = call <3 x float> @llvm.maxnum.v3f32(<3 x float> %result, <3 x float> zeroinitializer)
				%clamp = call <3 x float> @llvm.minnum.v3f32(<3 x float> %max, <3 x float> <float 1.0, float 1.0, float 1.0>)
				%cvt.result = fptrunc <3 x float> %clamp to <3 x half>
				ret <3 x half> %cvt.result
				}

				; GCN-LABEL: {{^}}v_mad_mix_v4f32_clamp_precvt:
				; GFX900: v_mad_mix_f32 v6, v0, v2, v4 op_sel_hi:[1,1,1] clamp
				; GFX900: v_mad_mix_f32 v0, v0, v2, v4 op_sel:[1,1,1] op_sel_hi:[1,1,1] clamp
				; GFX900: v_mad_mix_f32 v2, v1, v3, v5 op_sel_hi:[1,1,1] clamp
				; GFX900: v_mad_mix_f32 v1, v1, v3, v5 op_sel:[1,1,1] op_sel_hi:[1,1,1] clamp

				; GFX906: v_fma_mix_f32 v6, v0, v2, v4 op_sel_hi:[1,1,1] clamp
				; GFX906: v_fma_mix_f32 v0, v0, v2, v4 op_sel:[1,1,1] op_sel_hi:[1,1,1] clamp
				; GFX906: v_fma_mix_f32 v2, v1, v3, v5 op_sel_hi:[1,1,1] clamp
				; GFX906: v_fma_mix_f32 v1, v1, v3, v5 op_sel:[1,1,1] op_sel_hi:[1,1,1] clamp

				; GFX9: v_cvt_f16_f32
				; GFX9: v_cvt_f16_f32
				; GFX9: v_cvt_f16_f32
				; GFX9: v_cvt_f16_f32
				define <4 x half> @v_mad_mix_v4f32_clamp_precvt(<4 x half> %src0, <4 x half> %src1, <4 x half> %src2) #0 {
				%src0.ext = fpext <4 x half> %src0 to <4 x float>
				%src1.ext = fpext <4 x half> %src1 to <4 x float>
				%src2.ext = fpext <4 x half> %src2 to <4 x float>
				%result = tail call <4 x float> @llvm.fmuladd.v4f32(<4 x float> %src0.ext, <4 x float> %src1.ext, <4 x float> %src2.ext)
				%max = call <4 x float> @llvm.maxnum.v4f32(<4 x float> %result, <4 x float> zeroinitializer)
				%clamp = call <4 x float> @llvm.minnum.v4f32(<4 x float> %max, <4 x float> <float 1.0, float 1.0, float 1.0, float 1.0>)
				%cvt.result = fptrunc <4 x float> %clamp to <4 x half>
				ret <4 x half> %cvt.result
				}

				declare half @llvm.minnum.f16(half, half) #1
				declare <2 x half> @llvm.minnum.v2f16(<2 x half>, <2 x half>) #1
				declare <3 x half> @llvm.minnum.v3f16(<3 x half>, <3 x half>) #1
				declare <4 x half> @llvm.minnum.v4f16(<4 x half>, <4 x half>) #1

				declare half @llvm.maxnum.f16(half, half) #1
				declare <2 x half> @llvm.maxnum.v2f16(<2 x half>, <2 x half>) #1
				declare <3 x half> @llvm.maxnum.v3f16(<3 x half>, <3 x half>) #1
				declare <4 x half> @llvm.maxnum.v4f16(<4 x half>, <4 x half>) #1

				declare float @llvm.minnum.f32(float, float) #1
				declare <2 x float> @llvm.minnum.v2f32(<2 x float>, <2 x float>) #1
				declare <3 x float> @llvm.minnum.v3f32(<3 x float>, <3 x float>) #1
				declare <4 x float> @llvm.minnum.v4f32(<4 x float>, <4 x float>) #1

				declare float @llvm.maxnum.f32(float, float) #1
				declare <2 x float> @llvm.maxnum.v2f32(<2 x float>, <2 x float>) #1
				declare <3 x float> @llvm.maxnum.v3f32(<3 x float>, <3 x float>) #1
				declare <4 x float> @llvm.maxnum.v4f32(<4 x float>, <4 x float>) #1

				declare float @llvm.fmuladd.f32(float, float, float) #1
				declare <2 x float> @llvm.fmuladd.v2f32(<2 x float>, <2 x float>, <2 x float>) #1
				declare <3 x float> @llvm.fmuladd.v3f32(<3 x float>, <3 x float>, <3 x float>) #1
				declare <4 x float> @llvm.fmuladd.v4f32(<4 x float>, <4 x float>, <4 x float>) #1

				attributes #0 = { nounwind "denormal-fp-math-f32"="preserve-sign,preserve-sign" }
				attributes #1 = { nounwind readnone speculatable }

llvm/test/CodeGen/AMDGPU/GlobalISel/mad-mix.ll

This file was added.

				; RUN: llc -global-isel -march=amdgcn -mcpu=gfx900 -verify-machineinstrs -show-mc-encoding < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GFX900,GFX9 %s
				; RUN: llc -global-isel -march=amdgcn -mcpu=gfx906 -verify-machineinstrs -show-mc-encoding < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GFX906,GFX9 %s
				; RUN: llc -global-isel -march=amdgcn -mcpu=fiji -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,CIVI,VI %s
				; RUN: llc -global-isel -march=amdgcn -mcpu=hawaii -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,CIVI,CI %s

				arsenmUnsubmitted Not Done Reply Inline Actions This is a copy pasted version of the existing test. I'd assume they can share the same runlines in the same file arsenm: This is a copy pasted version of the existing test. I'd assume they can share the same runlines…
				Pierre-vhAuthorUnsubmitted Done Reply Inline Actions Some run lines are different unfortunately, see FIXMEs in the test Pierre-vh: Some run lines are different unfortunately, see FIXMEs in the test
				arsenmUnsubmitted Not Done Reply Inline Actions I'd work to eliminate those differences. If it turns out to be difficult, I would probably switch to generated checks and have them share the same file arsenm: I'd work to eliminate those differences. If it turns out to be difficult, I would probably…
				Pierre-vhAuthorUnsubmitted Done Reply Inline Actions The remaining difference fall in the following categories: isCanonicalized has different behaviour between GISel/DAG, so there's v_mac instead of v_mad in a few places since SIFoldOperand doesn't fold there's also v_madak_f32 that's no longer present, I think it's the same cause but I haven't looked into it yet op_sel is on the second v_mad_mix instead of the first, seems like a harmless difference due to how DAG/GISel work? A G_LSHR + G_SHUFFLEVECTOR (1,0) pair isn't folded out. I think those operations negate each other, perhaps a combine should be added for that? Some unfinished things like -[v0\| not being picked up yet. The last 2 definitely have to be fixed, I'll look into them ASAP, but are the first 2 important as well? I'm not sure of what to do with `isCanonicalized`, is there a place where I can find the list of operations that should go in there? Pierre-vh: The remaining difference fall in the following categories: - isCanonicalized has different…
				Pierre-vhAuthorUnsubmitted Done Reply Inline Actions For `-\|v0\|`, the gMIR looks like this: bb.1 (%ir-block.0): liveins: $vgpr0, $vgpr1, $vgpr2 %0:_(s32) = COPY $vgpr0 %3:_(s32) = COPY $vgpr1 %1:_(s16) = G_TRUNC %3:_(s32) %4:_(s32) = COPY $vgpr2 %2:_(s16) = G_TRUNC %4:_(s32) %8:_(s16) = G_FCONSTANT half 0xH8000 %7:_(<2 x s16>) = G_BUILD_VECTOR %8:_(s16), %8:_(s16) %5:_(<2 x s16>) = G_BITCAST %0:_(s32) %6:_(<2 x s16>) = G_FABS %5:_ %18:_(<2 x s16>) = G_FNEG %6:_ %9:_(<2 x s16>) = G_FADD %7:_, %18:_ %19:_(s32) = G_BITCAST %9:_(<2 x s16>) %20:_(s32) = G_CONSTANT i32 16 %21:_(s32) = G_LSHR %19:_, %20:_(s32) %17:_(s16) = G_TRUNC %21:_(s32) %12:_(s32) = G_FPEXT %17:_(s16) %13:_(s32) = G_FPEXT %1:_(s16) %14:_(s32) = G_FPEXT %2:_(s16) %15:_(s32) = G_FMA %12:_, %13:_, %14:_ $vgpr0 = COPY %15:_(s32) SI_RETURN implicit $vgpr0 Could we add a combine to fold `G_FADD (+-)0.0, x` into just `x`? If we add that and another one to fold `G_LSHR + G_SHUFFLEVECTOR (1,0)`, it should address most of the remaining differences. Pierre-vh: For `-\|v0\|`, the gMIR looks like this: ``` bb.1 (%ir-block.0): liveins: $vgpr0, $vgpr1…
				arsenmUnsubmitted Not Done Reply Inline Actions fadd 0 can only be folded out if you don't care about signed zeros and don't care about canonicalizing (I think the existing DAG combine fails to consider this second point). Which test is this in? I don't see how this would have ever been able to fold into an fmax_mix operand. We should have the shift + shuffle combine. isCanonicalized is essentially a list of opcodes with floating point semantics. arsenm: fadd 0 can only be folded out if you don't care about signed zeros and don't care about…
				Pierre-vhAuthorUnsubmitted Done Reply Inline Actions the gMIR was from `v_mad_mix_f32_preextractfabsfneg_f16hi_f16lo_f16lo`. I'll remove the comment saying it should be `-\|v0\|` then. For isCanonicalized, do I just need to add opcodes like in the DAG version? e.g.: switch (Opcode) { case AMDGPU::G_FADD: case AMDGPU::G_FSUB: case AMDGPU::G_FMUL: case AMDGPU::G_FMA: case AMDGPU::G_FMAD: case AMDGPU::G_FDIV: case AMDGPU::G_FREM: case AMDGPU::G_FPOW: case AMDGPU::G_FPEXT: case AMDGPU::G_FPTRUNC: return true; case AMDGPU::G_FNEG: case AMDGPU::G_FMINNUM_IEEE: case AMDGPU::G_FMAXNUM_IEEE: If yes, I get this result in one of the tests: v_mad_mixlo_f16_f16lo_f16lo_f32_clamp_pre_cvt: ; @v_mad_mixlo_f16_f16lo_f16lo_f32_clamp_pre_cvt ; %bb.0: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0) v_cvt_f32_f16_e32 v0, v0 v_cvt_f32_f16_e32 v1, v1 v_mac_f32_e32 v2, v0, v1 v_med3_f32 v0, v2, 0, 1.0 v_cvt_f16_f32_e32 v0, v0 s_setpc_b64 s[30:31] Compared to this for the DAG: ; v_mad_f32 v0, v0, v1, v2 clamp ; v_cvt_f16_f32_e32 v0, v0 ; v_cvt_f32_f16_e32 v0, v0 Which is a really big difference. Pierre-vh: the gMIR was from `v_mad_mix_f32_preextractfabsfneg_f16hi_f16lo_f16lo`. I'll remove the comment…
				; GCN-LABEL: {{^}}v_mad_mix_f32_f16lo_f16lo_f16lo:
				; GFX900: v_mad_mix_f32 v0, v0, v1, v2 op_sel_hi:[1,1,1] ; encoding: [0x00,0x40,0xa0,0xd3,0x00,0x03,0x0a,0x1c]
				; GFX906: v_fma_mix_f32 v0, v0, v1, v2 op_sel_hi:[1,1,1] ; encoding: [0x00,0x40,0xa0,0xd3,0x00,0x03,0x0a,0x1c]
				; VI: v_mac_f32

				; FIXME: Should be v_mad?
				; CI: v_mac_f32
				define float @v_mad_mix_f32_f16lo_f16lo_f16lo(half %src0, half %src1, half %src2) #0 {
				%src0.ext = fpext half %src0 to float
				%src1.ext = fpext half %src1 to float
				%src2.ext = fpext half %src2 to float
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext, float %src1.ext, float %src2.ext)
				ret float %result
				}

				; GCN-LABEL: {{^}}v_mad_mix_f32_f16hi_f16hi_f16hi_int:
				; GFX900: v_mad_mix_f32 v0, v0, v1, v2 op_sel:[1,1,1] op_sel_hi:[1,1,1] ; encoding
				; GFX906: v_fma_mix_f32 v0, v0, v1, v2 op_sel:[1,1,1] op_sel_hi:[1,1,1] ; encoding
				; CIVI: v_mac_f32
				define float @v_mad_mix_f32_f16hi_f16hi_f16hi_int(i32 %src0, i32 %src1, i32 %src2) #0 {
				%src0.hi = lshr i32 %src0, 16
				%src1.hi = lshr i32 %src1, 16
				%src2.hi = lshr i32 %src2, 16
				%src0.i16 = trunc i32 %src0.hi to i16
				%src1.i16 = trunc i32 %src1.hi to i16
				%src2.i16 = trunc i32 %src2.hi to i16
				%src0.fp16 = bitcast i16 %src0.i16 to half
				%src1.fp16 = bitcast i16 %src1.i16 to half
				%src2.fp16 = bitcast i16 %src2.i16 to half
				%src0.ext = fpext half %src0.fp16 to float
				%src1.ext = fpext half %src1.fp16 to float
				%src2.ext = fpext half %src2.fp16 to float
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext, float %src1.ext, float %src2.ext)
				ret float %result
				}

				; GCN-LABEL: {{^}}v_mad_mix_f32_f16hi_f16hi_f16hi_elt:
				; GFX900: v_mad_mix_f32 v0, v0, v1, v2 op_sel:[1,1,1] op_sel_hi:[1,1,1] ; encoding
				; GFX906: v_fma_mix_f32 v0, v0, v1, v2 op_sel:[1,1,1] op_sel_hi:[1,1,1] ; encoding
				; VI: v_mac_f32

				; FIXME: Should be v_mad?
				; CI: v_mac_f32
				define float @v_mad_mix_f32_f16hi_f16hi_f16hi_elt(<2 x half> %src0, <2 x half> %src1, <2 x half> %src2) #0 {
				%src0.hi = extractelement <2 x half> %src0, i32 1
				%src1.hi = extractelement <2 x half> %src1, i32 1
				%src2.hi = extractelement <2 x half> %src2, i32 1
				%src0.ext = fpext half %src0.hi to float
				%src1.ext = fpext half %src1.hi to float
				%src2.ext = fpext half %src2.hi to float
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext, float %src1.ext, float %src2.ext)
				ret float %result
				}

				; GCN-LABEL: {{^}}v_mad_mix_v2f32:
				; GFX900: v_mad_mix_f32 v3, v0, v1, v2 op_sel_hi:[1,1,1]
				; GFX900-NEXT: v_mad_mix_f32 v1, v0, v1, v2 op_sel:[1,1,1] op_sel_hi:[1,1,1]
				; GFX900-NEXT: v_mov_b32_e32 v0, v3

				; GFX906: v_fma_mix_f32 v3, v0, v1, v2 op_sel_hi:[1,1,1]
				; GFX906-NEXT: v_fma_mix_f32 v1, v0, v1, v2 op_sel:[1,1,1] op_sel_hi:[1,1,1]
				; GFX906-NEXT: v_mov_b32_e32 v0, v3

				; CIVI: v_mac_f32
				define <2 x float> @v_mad_mix_v2f32(<2 x half> %src0, <2 x half> %src1, <2 x half> %src2) #0 {
				%src0.ext = fpext <2 x half> %src0 to <2 x float>
				%src1.ext = fpext <2 x half> %src1 to <2 x float>
				%src2.ext = fpext <2 x half> %src2 to <2 x float>
				%result = tail call <2 x float> @llvm.fmuladd.v2f32(<2 x float> %src0.ext, <2 x float> %src1.ext, <2 x float> %src2.ext)
				ret <2 x float> %result
				}

				; GCN-LABEL: {{^}}v_mad_mix_v2f32_shuffle:
				; GCN: s_waitcnt
				; GFX900: v_mad_mix_f32 v3, v0, v1, v2 op_sel:[1,0,1] op_sel_hi:[1,1,1]
				; GFX900-NEXT: v_mad_mix_f32 v1, v0, v1, v2 op_sel:[0,1,0] op_sel_hi:[1,1,1]
				; GFX900-NEXT: v_mov_b32_e32 v0, v3
				; GFX900-NEXT: s_setpc_b64

				; GFX906: v_fma_mix_f32 v3, v0, v1, v2 op_sel:[1,0,1] op_sel_hi:[1,1,1]
				; GFX906-NEXT: v_fma_mix_f32 v1, v0, v1, v2 op_sel:[0,1,0] op_sel_hi:[1,1,1]
				; GFX906-NEXT: v_mov_b32_e32 v0, v3
				; GFX906-NEXT: s_setpc_b64

				; CIVI: v_mac_f32
				define <2 x float> @v_mad_mix_v2f32_shuffle(<2 x half> %src0, <2 x half> %src1, <2 x half> %src2) #0 {
				%src0.shuf = shufflevector <2 x half> %src0, <2 x half> undef, <2 x i32> <i32 1, i32 0>
				%src1.shuf = shufflevector <2 x half> %src1, <2 x half> undef, <2 x i32> <i32 0, i32 1>
				%src2.shuf = shufflevector <2 x half> %src2, <2 x half> undef, <2 x i32> <i32 1, i32 1>
				%src0.ext = fpext <2 x half> %src0.shuf to <2 x float>
				%src1.ext = fpext <2 x half> %src1.shuf to <2 x float>
				%src2.ext = fpext <2 x half> %src2.shuf to <2 x float>
				%result = tail call <2 x float> @llvm.fmuladd.v2f32(<2 x float> %src0.ext, <2 x float> %src1.ext, <2 x float> %src2.ext)
				ret <2 x float> %result
				}

				; GCN-LABEL: {{^}}v_mad_mix_f32_negf16lo_f16lo_f16lo:
				; GFX900: s_waitcnt
				; GFX900-NEXT: v_mad_mix_f32 v0, -v0, v1, v2 op_sel_hi:[1,1,1] ; encoding
				; GFX900-NEXT: s_setpc_b64

				; GFX906: s_waitcnt
				; GFX906-NEXT: v_fma_mix_f32 v0, -v0, v1, v2 op_sel_hi:[1,1,1] ; encoding
				; GFX906-NEXT: s_setpc_b64

				; FIXME: Should be using v_mad
				; CIVI: v_mac_f32_e32
				define float @v_mad_mix_f32_negf16lo_f16lo_f16lo(half %src0, half %src1, half %src2) #0 {
				%src0.ext = fpext half %src0 to float
				%src1.ext = fpext half %src1 to float
				%src2.ext = fpext half %src2 to float
				%src0.ext.neg = fneg float %src0.ext
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext.neg, float %src1.ext, float %src2.ext)
				ret float %result
				}

				; GCN-LABEL: {{^}}v_mad_mix_f32_absf16lo_f16lo_f16lo:
				; GFX900: v_mad_mix_f32 v0, \|v0\|, v1, v2 op_sel_hi:[1,1,1]
				; GFX906: v_fma_mix_f32 v0, \|v0\|, v1, v2 op_sel_hi:[1,1,1]

				; CIVI: v_mad_f32
				define float @v_mad_mix_f32_absf16lo_f16lo_f16lo(half %src0, half %src1, half %src2) #0 {
				%src0.ext = fpext half %src0 to float
				%src1.ext = fpext half %src1 to float
				%src2.ext = fpext half %src2 to float
				%src0.ext.abs = call float @llvm.fabs.f32(float %src0.ext)
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext.abs, float %src1.ext, float %src2.ext)
				ret float %result
				}

				; GCN-LABEL: {{^}}v_mad_mix_f32_negabsf16lo_f16lo_f16lo:
				; GFX900: s_waitcnt
				; GFX900-NEXT: v_mad_mix_f32 v0, -\|v0\|, v1, v2 op_sel_hi:[1,1,1]
				; GFX900-NEXT: s_setpc_b64

				; GFX906: s_waitcnt
				; GFX906-NEXT: v_fma_mix_f32 v0, -\|v0\|, v1, v2 op_sel_hi:[1,1,1]
				; GFX906-NEXT: s_setpc_b64

				; CIVI: v_mad_f32
				define float @v_mad_mix_f32_negabsf16lo_f16lo_f16lo(half %src0, half %src1, half %src2) #0 {
				%src0.ext = fpext half %src0 to float
				%src1.ext = fpext half %src1 to float
				%src2.ext = fpext half %src2 to float
				%src0.ext.abs = call float @llvm.fabs.f32(float %src0.ext)
				%src0.ext.neg.abs = fneg float %src0.ext.abs
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext.neg.abs, float %src1.ext, float %src2.ext)
				ret float %result
				}

				; GCN-LABEL: {{^}}v_mad_mix_f32_f16lo_f16lo_f32:
				; GCN: s_waitcnt
				; GFX900-NEXT: v_mad_mix_f32 v0, v0, v1, v2 op_sel_hi:[1,1,0] ; encoding
				; GFX906-NEXT: v_fma_mix_f32 v0, v0, v1, v2 op_sel_hi:[1,1,0] ; encoding
				; GFX9-NEXT: s_setpc_b64

				; CIVI: v_mad_f32
				define float @v_mad_mix_f32_f16lo_f16lo_f32(half %src0, half %src1, float %src2) #0 {
				%src0.ext = fpext half %src0 to float
				%src1.ext = fpext half %src1 to float
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext, float %src1.ext, float %src2)
				ret float %result
				}

				; GCN-LABEL: {{^}}v_mad_mix_f32_f16lo_f16lo_negf32:
				; GCN: s_waitcnt
				; GFX900-NEXT: v_mad_mix_f32 v0, v0, v1, -v2 op_sel_hi:[1,1,0] ; encoding
				; GFX906-NEXT: v_fma_mix_f32 v0, v0, v1, -v2 op_sel_hi:[1,1,0] ; encoding
				; GFX9-NEXT: s_setpc_b64

				; CIVI: v_mad_f32
				define float @v_mad_mix_f32_f16lo_f16lo_negf32(half %src0, half %src1, float %src2) #0 {
				%src0.ext = fpext half %src0 to float
				%src1.ext = fpext half %src1 to float
				%src2.neg = fneg float %src2
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext, float %src1.ext, float %src2.neg)
				ret float %result
				}

				; GCN-LABEL: {{^}}v_mad_mix_f32_f16lo_f16lo_absf32:
				; GCN: s_waitcnt
				; GFX900-NEXT: v_mad_mix_f32 v0, v0, v1, \|v2\| op_sel_hi:[1,1,0] ; encoding
				; GFX906-NEXT: v_fma_mix_f32 v0, v0, v1, \|v2\| op_sel_hi:[1,1,0] ; encoding
				; GFX9-NEXT: s_setpc_b64

				; CIVI: v_mad_f32
				define float @v_mad_mix_f32_f16lo_f16lo_absf32(half %src0, half %src1, float %src2) #0 {
				%src0.ext = fpext half %src0 to float
				%src1.ext = fpext half %src1 to float
				%src2.abs = call float @llvm.fabs.f32(float %src2)
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext, float %src1.ext, float %src2.abs)
				ret float %result
				}

				; GCN-LABEL: {{^}}v_mad_mix_f32_f16lo_f16lo_negabsf32:
				; GCN: s_waitcnt
				; GFX900-NEXT: v_mad_mix_f32 v0, v0, v1, -\|v2\| op_sel_hi:[1,1,0] ; encoding
				; GFX906-NEXT: v_fma_mix_f32 v0, v0, v1, -\|v2\| op_sel_hi:[1,1,0] ; encoding
				; GFX9-NEXT: s_setpc_b64

				; CIVI: v_mad_f32
				define float @v_mad_mix_f32_f16lo_f16lo_negabsf32(half %src0, half %src1, float %src2) #0 {
				%src0.ext = fpext half %src0 to float
				%src1.ext = fpext half %src1 to float
				%src2.abs = call float @llvm.fabs.f32(float %src2)
				%src2.neg.abs = fneg float %src2.abs
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext, float %src1.ext, float %src2.neg.abs)
				ret float %result
				}

				; TODO: Fold inline immediates. Need to be careful because it is an
				; f16 inline immediate that may be converted to f32, not an actual f32
				; inline immediate.

				; GCN-LABEL: {{^}}v_mad_mix_f32_f16lo_f16lo_f32imm1:
				; GCN: s_waitcnt
				; GFX9: v_mov_b32_e32 [[VREG:v[0-9]+]], 1.0
				; GFX900-NEXT: v_mad_mix_f32 v0, v0, v1, [[VREG]] op_sel_hi:[1,1,0] ; encoding
				; GFX906-NEXT: v_fma_mix_f32 v0, v0, v1, [[VREG]] op_sel_hi:[1,1,0] ; encoding

				; CIVI: v_mad_f32 v0, v0, v1, 1.0
				; GCN-NEXT: s_setpc_b64
				define float @v_mad_mix_f32_f16lo_f16lo_f32imm1(half %src0, half %src1) #0 {
				%src0.ext = fpext half %src0 to float
				%src1.ext = fpext half %src1 to float
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext, float %src1.ext, float 1.0)
				ret float %result
				}

				; GCN-LABEL: {{^}}v_mad_mix_f32_f16lo_f16lo_f32imminv2pi:
				; GCN: s_waitcnt
				; GFX9: v_mov_b32_e32 [[VREG:v[0-9]+]], 0.15915494
				; GFX900: v_mad_mix_f32 v0, v0, v1, [[VREG]] op_sel_hi:[1,1,0] ; encoding
				; GFX906: v_fma_mix_f32 v0, v0, v1, [[VREG]] op_sel_hi:[1,1,0] ; encoding
				; VI: v_mad_f32 v0, v0, v1, 0.15915494
				define float @v_mad_mix_f32_f16lo_f16lo_f32imminv2pi(half %src0, half %src1) #0 {
				%src0.ext = fpext half %src0 to float
				%src1.ext = fpext half %src1 to float
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext, float %src1.ext, float 0x3FC45F3060000000)
				ret float %result
				}

				; Attempt to break inline immediate folding. If the operand is
				; interpreted as f32, the inline immediate is really the f16 inline
				; imm value converted to f32.
				; fpext f16 1/2pi = 0x3e230000
				; f32 1/2pi = 0x3e22f983
				; GCN-LABEL: {{^}}v_mad_mix_f32_f16lo_f16lo_cvtf16imminv2pi:
				; GFX9: v_mov_b32_e32 [[VREG:v[0-9]+]], 0x3e230000
				; GFX900: v_mad_mix_f32 v0, v0, v1, [[VREG]] op_sel_hi:[1,1,0] ; encoding
				; GFX906: v_fma_mix_f32 v0, v0, v1, [[VREG]] op_sel_hi:[1,1,0] ; encoding

				; FIXME: Should be using v_madak_f32?
				; CIVI: v_mov_b32_e32 v0, 0x3e230000
				; CIVI-NEXT: v_mac_f32_e32 v0, v2, v1
				define float @v_mad_mix_f32_f16lo_f16lo_cvtf16imminv2pi(half %src0, half %src1) #0 {
				%src0.ext = fpext half %src0 to float
				%src1.ext = fpext half %src1 to float
				%src2 = fpext half 0xH3118 to float
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext, float %src1.ext, float %src2)
				ret float %result
				}

				; GCN-LABEL: {{^}}v_mad_mix_f32_f16lo_f16lo_cvtf16imm63:
				; GFX9: v_mov_b32_e32 [[VREG:v[0-9]+]], 0x367c0000
				; GFX900: v_mad_mix_f32 v0, v0, v1, [[VREG]] op_sel_hi:[1,1,0] ; encoding
				; GFX906: v_fma_mix_f32 v0, v0, v1, [[VREG]] op_sel_hi:[1,1,0] ; encoding

				; FIXME: Should be using v_madak_f32
				; CIVI: v_mov_b32_e32 v0, 0x367c0000
				; CIVI-NEXT: v_mac_f32_e32 v0, v2, v1
				define float @v_mad_mix_f32_f16lo_f16lo_cvtf16imm63(half %src0, half %src1) #0 {
				%src0.ext = fpext half %src0 to float
				%src1.ext = fpext half %src1 to float
				%src2 = fpext half 0xH003F to float
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext, float %src1.ext, float %src2)
				ret float %result
				}

				; GCN-LABEL: {{^}}v_mad_mix_v2f32_f32imm1:
				; GFX9: s_mov_b32 [[SREG:s[0-9]+]], 1.0
				; GFX900: v_mad_mix_f32 v2, v0, v1, [[SREG]] op_sel_hi:[1,1,0] ; encoding
				; GFX900: v_mad_mix_f32 v1, v0, v1, [[SREG]] op_sel:[1,1,0] op_sel_hi:[1,1,0] ; encoding
				; GFX900: v_mov_b32_e32 v0, v2

				; GFX906: v_fma_mix_f32 v2, v0, v1, [[SREG]] op_sel_hi:[1,1,0] ; encoding
				; GFX906: v_fma_mix_f32 v1, v0, v1, [[SREG]] op_sel:[1,1,0] op_sel_hi:[1,1,0] ; encoding
				; GFX906: v_mov_b32_e32 v0, v2
				define <2 x float> @v_mad_mix_v2f32_f32imm1(<2 x half> %src0, <2 x half> %src1) #0 {
				%src0.ext = fpext <2 x half> %src0 to <2 x float>
				%src1.ext = fpext <2 x half> %src1 to <2 x float>
				%result = tail call <2 x float> @llvm.fmuladd.v2f32(<2 x float> %src0.ext, <2 x float> %src1.ext, <2 x float> <float 1.0, float 1.0>)
				ret <2 x float> %result
				}

				; GCN-LABEL: {{^}}v_mad_mix_v2f32_cvtf16imminv2pi:
				; GFX9: s_mov_b32 [[SREG:s[0-9]+]], 0x3e230000

				; GFX900: v_mad_mix_f32 v2, v0, v1, [[SREG]] op_sel_hi:[1,1,0] ; encoding
				; GFX900: v_mad_mix_f32 v1, v0, v1, [[SREG]] op_sel:[1,1,0] op_sel_hi:[1,1,0] ; encoding
				; GFX900: v_mov_b32_e32 v0, v2

				; GFX906: v_fma_mix_f32 v2, v0, v1, [[SREG]] op_sel_hi:[1,1,0] ; encoding
				; GFX906: v_fma_mix_f32 v1, v0, v1, [[SREG]] op_sel:[1,1,0] op_sel_hi:[1,1,0] ; encoding
				; GFX906: v_mov_b32_e32 v0, v2
				define <2 x float> @v_mad_mix_v2f32_cvtf16imminv2pi(<2 x half> %src0, <2 x half> %src1) #0 {
				%src0.ext = fpext <2 x half> %src0 to <2 x float>
				%src1.ext = fpext <2 x half> %src1 to <2 x float>
				%src2 = fpext <2 x half> <half 0xH3118, half 0xH3118> to <2 x float>
				%result = tail call <2 x float> @llvm.fmuladd.v2f32(<2 x float> %src0.ext, <2 x float> %src1.ext, <2 x float> %src2)
				ret <2 x float> %result
				}

				; GCN-LABEL: {{^}}v_mad_mix_v2f32_f32imminv2pi:
				; GFX9: s_mov_b32 [[SREG:s[0-9]+]], 0.15915494

				; GFX900: v_mad_mix_f32 v2, v0, v1, [[SREG]] op_sel_hi:[1,1,0] ; encoding
				; GFX900: v_mad_mix_f32 v1, v0, v1, [[SREG]] op_sel:[1,1,0] op_sel_hi:[1,1,0] ; encoding
				; GFX900: v_mov_b32_e32 v0, v2

				; GFX906: v_fma_mix_f32 v2, v0, v1, [[SREG]] op_sel_hi:[1,1,0] ; encoding
				; GFX906: v_fma_mix_f32 v1, v0, v1, [[SREG]] op_sel:[1,1,0] op_sel_hi:[1,1,0] ; encoding
				; GFX906: v_mov_b32_e32 v0, v2
				define <2 x float> @v_mad_mix_v2f32_f32imminv2pi(<2 x half> %src0, <2 x half> %src1) #0 {
				%src0.ext = fpext <2 x half> %src0 to <2 x float>
				%src1.ext = fpext <2 x half> %src1 to <2 x float>
				%src2 = fpext <2 x half> <half 0xH3118, half 0xH3118> to <2 x float>
				%result = tail call <2 x float> @llvm.fmuladd.v2f32(<2 x float> %src0.ext, <2 x float> %src1.ext, <2 x float> <float 0x3FC45F3060000000, float 0x3FC45F3060000000>)
				ret <2 x float> %result
				}

				; GCN-LABEL: {{^}}v_mad_mix_clamp_f32_f16hi_f16hi_f16hi_elt:
				; GFX900: v_mad_mix_f32 v0, v0, v1, v2 op_sel:[1,1,1] op_sel_hi:[1,1,1] clamp ; encoding
				; GFX906: v_fma_mix_f32 v0, v0, v1, v2 op_sel:[1,1,1] op_sel_hi:[1,1,1] clamp ; encoding

				; FIXME: Should be using v_mad
				; CIVI: v_mac_f32_e32 v{{[0-9]}}, v{{[0-9]}}, v{{[0-9]}}
				; CIVI-NEXT: v_mul_f32_e64 v{{[0-9]}}, 1.0, v{{[0-9]}} clamp
				define float @v_mad_mix_clamp_f32_f16hi_f16hi_f16hi_elt(<2 x half> %src0, <2 x half> %src1, <2 x half> %src2) #0 {
				%src0.hi = extractelement <2 x half> %src0, i32 1
				%src1.hi = extractelement <2 x half> %src1, i32 1
				%src2.hi = extractelement <2 x half> %src2, i32 1
				%src0.ext = fpext half %src0.hi to float
				%src1.ext = fpext half %src1.hi to float
				%src2.ext = fpext half %src2.hi to float
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext, float %src1.ext, float %src2.ext)
				%max = call float @llvm.maxnum.f32(float %result, float 0.0)
				%clamp = call float @llvm.minnum.f32(float %max, float 1.0)
				ret float %clamp
				}

				; GCN-LABEL: no_mix_simple:
				; GCN: s_waitcnt
				; GCN-NEXT: v_{{mad\|fma}}_f32 v0, v0, v1, v2
				; GCN-NEXT: s_setpc_b64
				define float @no_mix_simple(float %src0, float %src1, float %src2) #0 {
				%result = call float @llvm.fmuladd.f32(float %src0, float %src1, float %src2)
				ret float %result
				}

				; GCN-LABEL: no_mix_simple_fabs:
				; GCN: s_waitcnt
				; CIVI-NEXT: v_mad_f32 v0, \|v0\|, v1, v2
				; GFX900-NEXT: v_mad_f32 v0, \|v0\|, v1, v2
				; GFX906-NEXT: v_fma_f32 v0, \|v0\|, v1, v2
				; GCN-NEXT: s_setpc_b64
				define float @no_mix_simple_fabs(float %src0, float %src1, float %src2) #0 {
				%src0.fabs = call float @llvm.fabs.f32(float %src0)
				%result = call float @llvm.fmuladd.f32(float %src0.fabs, float %src1, float %src2)
				ret float %result
				}

				; FIXME: Should abe able to select in thits case
				; All sources are converted from f16, so it doesn't matter
				; v_mad_mix_f32 flushes.

				; GCN-LABEL: {{^}}v_mad_mix_f32_f16lo_f16lo_f16lo_f32_denormals:
				; GFX900: v_cvt_f32_f16
				; GFX900: v_cvt_f32_f16
				; GFX900: v_cvt_f32_f16
				; GFX900: v_fma_f32
				define float @v_mad_mix_f32_f16lo_f16lo_f16lo_f32_denormals(half %src0, half %src1, half %src2) #1 {
				%src0.ext = fpext half %src0 to float
				%src1.ext = fpext half %src1 to float
				arsenmUnsubmitted Not Done Reply Inline Actions Typo thits arsenm: Typo thits
				%src2.ext = fpext half %src2 to float
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext, float %src1.ext, float %src2.ext)
				ret float %result
				}

				; GCN-LABEL: {{^}}v_mad_mix_f32_f16lo_f16lo_f32_denormals:
				; GFX900: v_cvt_f32_f16
				; GFX900: v_cvt_f32_f16
				; GFX900: v_fma_f32

				; GFX906-NOT: v_cvt_f32_f16
				; GFX906: v_fma_mix_f32 v0, v0, v1, v2 op_sel_hi:[1,1,0]
				define float @v_mad_mix_f32_f16lo_f16lo_f32_denormals(half %src0, half %src1, float %src2) #1 {
				%src0.ext = fpext half %src0 to float
				%src1.ext = fpext half %src1 to float
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext, float %src1.ext, float %src2)
				ret float %result
				}

				; GCN-LABEL: {{^}}v_mad_mix_f32_f16lo_f16lo_f16lo_f32_denormals_fmulfadd:
				; GFX9: v_cvt_f32_f16
				; GFX9: v_cvt_f32_f16
				; GFX9: v_cvt_f32_f16
				; GFX9: v_mul_f32
				; GFX9: v_add_f32
				define float @v_mad_mix_f32_f16lo_f16lo_f16lo_f32_denormals_fmulfadd(half %src0, half %src1, half %src2) #1 {
				%src0.ext = fpext half %src0 to float
				%src1.ext = fpext half %src1 to float
				%src2.ext = fpext half %src2 to float
				%mul = fmul float %src0.ext, %src1.ext
				%result = fadd float %mul, %src2.ext
				ret float %result
				}

				; GCN-LABEL: {{^}}v_mad_mix_f32_f16lo_f16lo_f32_denormals_fmulfadd:
				; GFX9: v_cvt_f32_f16
				; GFX9: v_cvt_f32_f16
				; GFX9: v_mul_f32
				; GFX9: v_add_f32
				define float @v_mad_mix_f32_f16lo_f16lo_f32_denormals_fmulfadd(half %src0, half %src1, float %src2) #1 {
				%src0.ext = fpext half %src0 to float
				%src1.ext = fpext half %src1 to float
				%mul = fmul float %src0.ext, %src1.ext
				%result = fadd float %mul, %src2
				ret float %result
				}

				; GCN-LABEL: {{^}}v_mad_mix_f32_f16lo_f16lo_f16lo_f32_flush_fmulfadd:
				; GCN: s_waitcnt
				; GFX900-NEXT: v_mad_mix_f32 v0, v0, v1, v2 op_sel_hi:[1,1,1] ; encoding
				; GFX906-NEXT: v_fma_mix_f32 v0, v0, v1, v2 op_sel_hi:[1,1,1] ; encoding
				; GFX9-NEXT: s_setpc_b64
				define float @v_mad_mix_f32_f16lo_f16lo_f16lo_f32_flush_fmulfadd(half %src0, half %src1, half %src2) #0 {
				%src0.ext = fpext half %src0 to float
				%src1.ext = fpext half %src1 to float
				%src2.ext = fpext half %src2 to float
				%mul = fmul contract float %src0.ext, %src1.ext
				%result = fadd contract float %mul, %src2.ext
				ret float %result
				}

				; GCN-LABEL: {{^}}v_mad_mix_f32_f16lo_f16lo_f32_flush_fmulfadd:
				; GCN: s_waitcnt
				; GFX900-NEXT: v_mad_mix_f32 v0, v0, v1, v2 op_sel_hi:[1,1,0] ; encoding
				; GFX906-NEXT: v_fma_mix_f32 v0, v0, v1, v2 op_sel_hi:[1,1,0] ; encoding
				; GFX9-NEXT: s_setpc_b64
				define float @v_mad_mix_f32_f16lo_f16lo_f32_flush_fmulfadd(half %src0, half %src1, float %src2) #0 {
				%src0.ext = fpext half %src0 to float
				%src1.ext = fpext half %src1 to float
				%mul = fmul contract float %src0.ext, %src1.ext
				%result = fadd contract float %mul, %src2
				ret float %result
				}

				; GCN-LABEL: {{^}}v_mad_mix_f32_negprecvtf16lo_f16lo_f16lo:
				; GFX9: s_waitcnt
				; GFX900-NEXT: v_mad_mix_f32 v0, -v0, v1, v2 op_sel_hi:[1,1,1] ; encoding
				; GFX906-NEXT: v_fma_mix_f32 v0, -v0, v1, v2 op_sel_hi:[1,1,1] ; encoding
				; GFX9-NEXT: s_setpc_b64

				; FIXME: Should be v_mad?
				; CIVI: v_mac_f32_e32
				define float @v_mad_mix_f32_negprecvtf16lo_f16lo_f16lo(i32 %src0.arg, half %src1, half %src2) #0 {
				%src0.arg.bc = bitcast i32 %src0.arg to <2 x half>
				%src0 = extractelement <2 x half> %src0.arg.bc, i32 0
				%src0.neg = fsub half -0.0, %src0
				%src0.ext = fpext half %src0.neg to float
				%src1.ext = fpext half %src1 to float
				%src2.ext = fpext half %src2 to float
				; %src0.ext.neg = fsub float -0.0, %src0.ext
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext, float %src1.ext, float %src2.ext)
				ret float %result
				}

				; Make sure we don't fold pre-cvt fneg if we already have a fabs
				; GCN-LABEL: {{^}}v_mad_mix_f32_precvtnegf16hi_abs_f16lo_f16lo:
				; GFX900: s_waitcnt
				define float @v_mad_mix_f32_precvtnegf16hi_abs_f16lo_f16lo(i32 %src0.arg, half %src1, half %src2) #0 {
				%src0.arg.bc = bitcast i32 %src0.arg to <2 x half>
				%src0 = extractelement <2 x half> %src0.arg.bc, i32 1
				%src0.neg = fsub half -0.0, %src0
				%src0.ext = fpext half %src0.neg to float
				%src0.ext.abs = call float @llvm.fabs.f32(float %src0.ext)
				%src1.ext = fpext half %src1 to float
				%src2.ext = fpext half %src2 to float
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext.abs, float %src1.ext, float %src2.ext)
				ret float %result
				}

				; GCN-LABEL: {{^}}v_mad_mix_f32_precvtabsf16hi_f16lo_f16lo:
				; GFX9: s_waitcnt
				; GFX900-NEXT: v_mad_mix_f32 v0, \|v0\|, v1, v2 op_sel:[1,0,0] op_sel_hi:[1,1,1]
				; GFX906-NEXT: v_fma_mix_f32 v0, \|v0\|, v1, v2 op_sel:[1,0,0] op_sel_hi:[1,1,1]
				; GFX9-NEXT: s_setpc_b64
				define float @v_mad_mix_f32_precvtabsf16hi_f16lo_f16lo(i32 %src0.arg, half %src1, half %src2) #0 {
				%src0.arg.bc = bitcast i32 %src0.arg to <2 x half>
				%src0 = extractelement <2 x half> %src0.arg.bc, i32 1
				%src0.abs = call half @llvm.fabs.f16(half %src0)
				%src0.ext = fpext half %src0.abs to float
				%src1.ext = fpext half %src1 to float
				%src2.ext = fpext half %src2 to float
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext, float %src1.ext, float %src2.ext)
				ret float %result
				}

				; FIXME: Should be -v0 and without v_mov/v_pk

				; GCN-LABEL: {{^}}v_mad_mix_f32_preextractfneg_f16hi_f16lo_f16lo:
				; GFX9: s_waitcnt
				; GFX9-NEXT: v_mov_b32_e32 v3, 0x80008000
				; GFX9-NEXT: v_pk_add_f16 v0, v3, v0 neg_lo:[0,1] neg_hi:[0,1]
				; GFX900-NEXT: v_mad_mix_f32 v0, v0, v1, v2 op_sel:[1,0,0] op_sel_hi:[1,1,1]
				; GFX906-NEXT: v_fma_mix_f32 v0, v0, v1, v2 op_sel:[1,0,0] op_sel_hi:[1,1,1]
				; GFX9-NEXT: s_setpc_b64
				define float @v_mad_mix_f32_preextractfneg_f16hi_f16lo_f16lo(i32 %src0.arg, half %src1, half %src2) #0 {
				%src0.arg.bc = bitcast i32 %src0.arg to <2 x half>
				%fneg = fsub <2 x half> <half -0.0, half -0.0>, %src0.arg.bc
				%src0 = extractelement <2 x half> %fneg, i32 1
				%src0.ext = fpext half %src0 to float
				%src1.ext = fpext half %src1 to float
				%src2.ext = fpext half %src2 to float
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext, float %src1.ext, float %src2.ext)
				ret float %result
				}

				; GCN-LABEL: {{^}}v_mad_mix_f32_preextractfabs_f16hi_f16lo_f16lo:
				; GFX9: s_waitcnt
				; GFX900-NEXT: v_mad_mix_f32 v0, \|v0\|, v1, v2 op_sel:[1,0,0] op_sel_hi:[1,1,1]
				; GFX906-NEXT: v_fma_mix_f32 v0, \|v0\|, v1, v2 op_sel:[1,0,0] op_sel_hi:[1,1,1]
				; GFX9-NEXT: s_setpc_b64
				define float @v_mad_mix_f32_preextractfabs_f16hi_f16lo_f16lo(i32 %src0.arg, half %src1, half %src2) #0 {
				%src0.arg.bc = bitcast i32 %src0.arg to <2 x half>
				%fabs = call <2 x half> @llvm.fabs.v2f16(<2 x half> %src0.arg.bc)
				%src0 = extractelement <2 x half> %fabs, i32 1
				%src0.ext = fpext half %src0 to float
				%src1.ext = fpext half %src1 to float
				%src2.ext = fpext half %src2 to float
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext, float %src1.ext, float %src2.ext)
				ret float %result
				}

				; GCN-LABEL: {{^}}v_mad_mix_f32_preextractfabsfneg_f16hi_f16lo_f16lo:
				; GFX9: s_waitcnt
				; GFX9-NEXT: v_and_b32_e32 v0, 0x7fff7fff, v0
				; GFX9-NEXT: v_mov_b32_e32 v3, 0x80008000
				; GFX9-NEXT: v_pk_add_f16 v0, v3, v0 neg_lo:[0,1] neg_hi:[0,1]
				; GFX900-NEXT: v_mad_mix_f32 v0, v0, v1, v2 op_sel:[1,0,0] op_sel_hi:[1,1,1]
				; GFX906-NEXT: v_fma_mix_f32 v0, v0, v1, v2 op_sel:[1,0,0] op_sel_hi:[1,1,1]
				; GFX9-NEXT: s_setpc_b64
				define float @v_mad_mix_f32_preextractfabsfneg_f16hi_f16lo_f16lo(i32 %src0.arg, half %src1, half %src2) #0 {
				%src0.arg.bc = bitcast i32 %src0.arg to <2 x half>
				%fabs = call <2 x half> @llvm.fabs.v2f16(<2 x half> %src0.arg.bc)
				%fneg.fabs = fsub <2 x half> <half -0.0, half -0.0>, %fabs
				%src0 = extractelement <2 x half> %fneg.fabs, i32 1
				%src0.ext = fpext half %src0 to float
				%src1.ext = fpext half %src1 to float
				%src2.ext = fpext half %src2 to float
				%result = tail call float @llvm.fmuladd.f32(float %src0.ext, float %src1.ext, float %src2.ext)
				ret float %result
				}

				declare half @llvm.fabs.f16(half) #2
				declare <2 x half> @llvm.fabs.v2f16(<2 x half>) #2
				declare float @llvm.fabs.f32(float) #2
				declare float @llvm.minnum.f32(float, float) #2
				declare float @llvm.maxnum.f32(float, float) #2
				declare float @llvm.fmuladd.f32(float, float, float) #2
				declare <2 x float> @llvm.fmuladd.v2f32(<2 x float>, <2 x float>, <2 x float>) #2

				attributes #0 = { nounwind "denormal-fp-math-f32"="preserve-sign,preserve-sign" }
				attributes #1 = { nounwind "denormal-fp-math-f32"="ieee,ieee" }
				attributes #2 = { nounwind readnone speculatable }

llvm/utils/TableGen/GlobalISelEmitter.cpp

Show First 20 Lines • Show All 2,516 Lines • ▼ Show 20 Lines	if (B.OperandPredicateMatcher::isHigherPriorityThan(*this))
return false;		return false;

if (const InstructionOperandMatcher *BP =		if (const InstructionOperandMatcher *BP =
dyn_cast<InstructionOperandMatcher>(&B))		dyn_cast<InstructionOperandMatcher>(&B))
if (InsnMatcher->isHigherPriorityThan(*BP->InsnMatcher))		if (InsnMatcher->isHigherPriorityThan(*BP->InsnMatcher))
return true;		return true;
return false;		return false;
}		}

		/// Report the maximum number of temporary operands needed by the predicate
		/// matcher.
		unsigned countRendererFns() const override {
		return InsnMatcher->countRendererFns();
		}
		arsenmUnsubmitted Done Reply Inline Actions This looks unrelated arsenm: This looks unrelated
		Pierre-vhAuthorUnsubmitted Done Reply Inline Actions Actually it isn't, it looks like the FMA/MAD patterns expose a bug in GISel. Without that there's a crash (segfault) in `executeMatchTable` because the number of renderer fns is incorrectly reported and it doesn't allocate enough entries in the vector that holds them. It seems like we rarely go above 2 renderers but here there's 4 IIRC Pierre-vh: Actually it isn't, it looks like the FMA/MAD patterns expose a bug in GISel. Without that…
};		};

void InstructionMatcher::optimize() {		void InstructionMatcher::optimize() {
SmallVector<std::unique_ptr<PredicateMatcher>, 8> Stash;		SmallVector<std::unique_ptr<PredicateMatcher>, 8> Stash;
const auto &OpcMatcher = getOpcodeMatcher();		const auto &OpcMatcher = getOpcodeMatcher();

Stash.push_back(predicates_pop_front());		Stash.push_back(predicates_pop_front());
if (Stash.back().get() == &OpcMatcher) {		if (Stash.back().get() == &OpcMatcher) {
▲ Show 20 Lines • Show All 3,773 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU][GlobalISel] Support mad/fma_mix selectionClosedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 462903

llvm/lib/Target/AMDGPU/AMDGPUGISel.td

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.h

llvm/lib/Target/AMDGPU/AMDGPUInstructionSelector.cpp

llvm/lib/Target/AMDGPU/AMDGPURegBankCombiner.cpp

llvm/lib/Target/AMDGPU/VOP3PInstructions.td

llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-add-ext-fma.ll

llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-add-ext-mul.ll

llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-sub-ext-mul.ll

llvm/test/CodeGen/AMDGPU/GlobalISel/combine-fma-sub-ext-neg-mul.ll

llvm/test/CodeGen/AMDGPU/GlobalISel/fmed3.ll

llvm/test/CodeGen/AMDGPU/GlobalISel/mad-mix-hi.ll

llvm/test/CodeGen/AMDGPU/GlobalISel/mad-mix-lo.ll

llvm/test/CodeGen/AMDGPU/GlobalISel/mad-mix.ll

llvm/utils/TableGen/GlobalISelEmitter.cpp

[AMDGPU][GlobalISel] Support mad/fma_mix selection
ClosedPublic