This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AMDGPU/
-
Target/
-
AMDGPU/
9/12
SIISelLowering.cpp
-
test/CodeGen/AMDGPU/
-
CodeGen/
-
AMDGPU/
-
atomic_optimizations_global_pointer.ll
-
atomic_optimizations_local_pointer.ll
3/3
mad_64_32.ll

Differential D123835

AMDGPU/SDAG: Refine the fold to v_mad_[iu]64_[iu]32
ClosedPublic

Authored by nhaehnle on Apr 14 2022, 10:39 PM.

Download Raw Diff

Details

Reviewers

arsenm
rampitec
t-tye
b-sumner
foad

Commits

rG6c2a01ce3a82: AMDGPU/SDAG: Refine the fold to v_mad_[iu]64_[iu]32

Summary

Only fold for uniform values on pre-GFX9 chips. GFX9+ allow us
to keep the calculation entirely on the SALU.

We never fold if the mul has multiple uses, since that effectively
duplicates the (expensive) multiply.

Finally, we expand 64x32 and 64x64 multiplies here as well, if they
feed into an addition. This results in better code generation than
the generic expansion for such multiplies because we end up using
the accumulator of the MAD instructions.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

nhaehnle created this revision.Apr 14 2022, 10:39 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 14 2022, 10:39 PM

Herald added subscribers: hsmhsm, foad, kerbowa and 6 others. · View Herald Transcript

nhaehnle requested review of this revision.Apr 14 2022, 10:39 PM

Herald added a project: Restricted Project. · View Herald TranscriptApr 14 2022, 10:39 PM

Herald added a subscriber: wdng. · View Herald Transcript

nhaehnle added a parent revision: D123833: AMDGPU/SDAG: Factor out the fold (add (mul x, y), y) --> mad_[iu]64_[iu]32.Apr 14 2022, 10:39 PM

Harbormaster completed remote builds in B159781: Diff 423022.Apr 14 2022, 10:40 PM

I would have expected this to put this back together after the generic multiply expansion though.

Also it would be nice to get this one ported to GlobalISel

llvm/test/CodeGen/AMDGPU/mad_64_32.ll
535–536	This is a regression? It looks to be the same cycle count for more code size

rampitec added inline comments.Apr 15 2022, 10:20 AM

llvm/test/CodeGen/AMDGPU/mad_64_32.ll
535–536	Actually since gfx90a v_mad_u64/i64 is full rate, so it is even more cycles in that case.

In D123835#3454046, @arsenm wrote:

I would have expected this to put this back together after the generic multiply expansion though.

So the original code is (add (mul x, y), z). The generic expansion turns that into (add M, z), where the generic multiply expansion expands M into something along the lines of:

lo,hi = umul_lohi (trunc x), (trunc y)
a = mul (trunc x), (shr y)
b = mul (shr x), (trunc y)
M = or (shl (zext (add3 hi, a, b)), 32), (zext lo)

The umul_lohi becomes v_mad_u64_u32 but it's used only as an addition, as you can see in the pre-change lit tests. Reliably reassociating the code sequence above into a form where the pre-change performAddCombine would trigger seems pretty complicated.

Also it would be nice to get this one ported to GlobalISel

Agreed, but it looks like GlobalISel doesn't generate mad_64_32 at all at the moment. So it's more than just porting this particular tweak.

llvm/test/CodeGen/AMDGPU/mad_64_32.ll
535–536	Good to know, I'm going to rework that part.

In D123835#3459341, @nhaehnle wrote:

Agreed, but it looks like GlobalISel doesn't generate mad_64_32 at all at the moment. So it's more than just porting this particular tweak.

Right, that's the point. Might as well start porting combines as they're looked at again

use hasSMulHi()
tweak the heuristics for when not to fold based on multiple uses:
- always fold when multiplication is fast
- otherwise, fold unless there is one non-foldable use or three foldable uses; this is aimed at reducing cycle counts with code density as a tie-breaker

Harbormaster completed remote builds in B160341: Diff 423758.Apr 19 2022, 4:00 PM

arsenm added inline comments.Apr 25 2022, 1:30 PM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
10750	s/number of multiply/number of multiplies/
10756	I don't understand why you're checking this if you bail on not ISD:ADD. I guess it would make sense if you were handling the carry out adds in a separate patch?
10803	Could just hardcode this to i32 instead of going through getShiftAmountConstant
10808–10813	A comment with the DAG formed here would be helpful
10822	Ditto

Only fold for uniform values on pre-GFX9 chips.

This made my brain hurt. I think it might be clearer as "For uniform values, only fold on pre-GFX9 chips." English is really imprecise.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
10756	LHS is the MUL here, not an ADD, so there's really no need to check ResNo.
10804	I don't know if it makes any practical difference, but other code like `AMDGPUTargetLowering::LowerUDIVREM64` uses EXTRACT_ELEMENT to split an i64 into a pair of i32s, and BITCAST(BUILD_VECTOR ...) to reassemble them.

arsenm added inline comments.Apr 26 2022, 6:25 AM

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
10804	Using the shift adds extra steps. The combine on 64 bit shifts will turn this into the vector build

Address review comments

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
10756	Makes sense, thanks.
10804	I'm rearranging the code slightly to use more EXTRACT_ELEMENT and BUILD_VECTOR directly. However, I'm keeping some TRUNCATEs around (instead of extracing the low part) because that turns out to result in better code generation in some tests. Actually also noticed a crash in doing so and fixed that.
10808–10813	Ok.

Harbormaster completed remote builds in B162341: Diff 426533.May 2 2022, 3:39 PM

LGTM.

llvm/lib/Target/AMDGPU/SIISelLowering.cpp
10734	Nit: use range-based loop over uses()?

This revision is now accepted and ready to land.May 4 2022, 2:13 AM

This revision was landed with ongoing or failed builds.May 10 2022, 7:16 AM

Closed by commit rG6c2a01ce3a82: AMDGPU/SDAG: Refine the fold to v_mad_[iu]64_[iu]32 (authored by nhaehnle). · Explain Why

This revision was automatically updated to reflect the committed changes.

nhaehnle added a commit: rG6c2a01ce3a82: AMDGPU/SDAG: Refine the fold to v_mad_[iu]64_[iu]32.

Herald added a subscriber: jsilvanus. · View Herald TranscriptMay 10 2022, 7:16 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AMDGPU/

SIISelLowering.cpp

109 lines

test/

CodeGen/

AMDGPU/

atomic_optimizations_global_pointer.ll

48 lines

atomic_optimizations_local_pointer.ll

62 lines

mad_64_32.ll

130 lines

Diff 428367

llvm/lib/Target/AMDGPU/SIISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 10,685 Lines • ▼ Show 20 Lines	static SDValue getMad64_32(SelectionDAG &DAG, const SDLoc &SL,
SDValue N0, SDValue N1, SDValue N2,		SDValue N0, SDValue N1, SDValue N2,
bool Signed) {		bool Signed) {
unsigned MadOpc = Signed ? AMDGPUISD::MAD_I64_I32 : AMDGPUISD::MAD_U64_U32;		unsigned MadOpc = Signed ? AMDGPUISD::MAD_I64_I32 : AMDGPUISD::MAD_U64_U32;
SDVTList VTs = DAG.getVTList(MVT::i64, MVT::i1);		SDVTList VTs = DAG.getVTList(MVT::i64, MVT::i1);
SDValue Mad = DAG.getNode(MadOpc, SL, VTs, N0, N1, N2);		SDValue Mad = DAG.getNode(MadOpc, SL, VTs, N0, N1, N2);
return DAG.getNode(ISD::TRUNCATE, SL, VT, Mad);		return DAG.getNode(ISD::TRUNCATE, SL, VT, Mad);
}		}

// Fold (add (mul x, y), z) --> (mad_[iu]64_[iu]32 x, y, z).		// Fold (add (mul x, y), z) --> (mad_[iu]64_[iu]32 x, y, z) plus high
		// multiplies, if any.
		//
		// Full 64-bit multiplies that feed into an addition are lowered here instead
		// of using the generic expansion. The generic expansion ends up with
		// a tree of ADD nodes that prevents us from using the "add" part of the
		// MAD instruction. The expansion produced here results in a chain of ADDs
		// instead of a tree.
SDValue SITargetLowering::tryFoldToMad64_32(SDNode *N,		SDValue SITargetLowering::tryFoldToMad64_32(SDNode *N,
DAGCombinerInfo &DCI) const {		DAGCombinerInfo &DCI) const {
assert(N->getOpcode() == ISD::ADD);		assert(N->getOpcode() == ISD::ADD);

SelectionDAG &DAG = DCI.DAG;		SelectionDAG &DAG = DCI.DAG;
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
SDLoc SL(N);		SDLoc SL(N);
SDValue LHS = N->getOperand(0);		SDValue LHS = N->getOperand(0);
SDValue RHS = N->getOperand(1);		SDValue RHS = N->getOperand(1);

if (VT.isVector())		if (VT.isVector())
return SDValue();		return SDValue();

		// S_MUL_HI_[IU]32 was added in gfx9, which allows us to keep the overall
		// result in scalar registers for uniform values.
		if (!N->isDivergent() && Subtarget->hasSMulHi())
		return SDValue();

unsigned NumBits = VT.getScalarSizeInBits();		unsigned NumBits = VT.getScalarSizeInBits();
if (NumBits <= 32 \|\| NumBits > 64)		if (NumBits <= 32 \|\| NumBits > 64)
return SDValue();		return SDValue();

if (LHS.getOpcode() != ISD::MUL) {		if (LHS.getOpcode() != ISD::MUL) {
assert(RHS.getOpcode() == ISD::MUL);		assert(RHS.getOpcode() == ISD::MUL);
std::swap(LHS, RHS);		std::swap(LHS, RHS);
}		}

		// Avoid the fold if it would unduly increase the number of multiplies due to
		// multiple uses, except on hardware with full-rate multiply-add (which is
		// part of full-rate 64-bit ops).
		if (!Subtarget->hasFullRate64Ops()) {
		unsigned NumUsers = 0;
		for (SDNode *Use : LHS->uses()) {
		foadUnsubmitted Not Done Reply Inline Actions Nit: use range-based loop over uses()? foad: Nit: use range-based loop over uses()?
		// There is a use that does not feed into addition, so the multiply can't
		// be removed. We prefer MUL + ADD + ADDC over MAD + MUL.
		if (Use->getOpcode() != ISD::ADD)
		return SDValue();

		// We prefer 2xMAD over MUL + 2xADD + 2xADDC (code density), and prefer
		// MUL + 3xADD + 3xADDC over 3xMAD.
		++NumUsers;
		if (NumUsers >= 3)
		return SDValue();
		}
		}

SDValue MulLHS = LHS.getOperand(0);		SDValue MulLHS = LHS.getOperand(0);
SDValue MulRHS = LHS.getOperand(1);		SDValue MulRHS = LHS.getOperand(1);
SDValue AddRHS = RHS;		SDValue AddRHS = RHS;
		arsenmUnsubmitted Done Reply Inline Actions s/number of multiply/number of multiplies/ arsenm: s/number of multiply/number of multiplies/

// TODO: Maybe restrict if SGPR inputs.		// Always check whether operands are small unsigned values, since that
if (numBitsUnsigned(MulLHS, DAG) <= 32 &&		// knowledge is useful in more cases. Check for small signed values only if
numBitsUnsigned(MulRHS, DAG) <= 32) {		// doing so can unlock a shorter code sequence.
MulLHS = DAG.getZExtOrTrunc(MulLHS, SL, MVT::i32);		bool MulLHSUnsigned32 = numBitsUnsigned(MulLHS, DAG) <= 32;
MulRHS = DAG.getZExtOrTrunc(MulRHS, SL, MVT::i32);		bool MulRHSUnsigned32 = numBitsUnsigned(MulRHS, DAG) <= 32;
		arsenmUnsubmitted Done Reply Inline Actions I don't understand why you're checking this if you bail on not ISD:ADD. I guess it would make sense if you were handling the carry out adds in a separate patch? arsenm: I don't understand why you're checking this if you bail on not ISD:ADD. I guess it would make…
		foadUnsubmitted Done Reply Inline Actions LHS is the MUL here, not an ADD, so there's really no need to check ResNo. foad: LHS is the MUL here, not an ADD, so there's really no need to check ResNo.
		nhaehnleAuthorUnsubmitted Done Reply Inline Actions Makes sense, thanks. nhaehnle: Makes sense, thanks.
AddRHS = DAG.getZExtOrTrunc(AddRHS, SL, MVT::i64);
return getMad64_32(DAG, SL, VT, MulLHS, MulRHS, AddRHS, false);		bool MulSignedLo = false;
		if (!MulLHSUnsigned32 \|\| !MulRHSUnsigned32) {
		MulSignedLo = numBitsSigned(MulLHS, DAG) <= 32 &&
		numBitsSigned(MulRHS, DAG) <= 32;
		}

		// The operands and final result all have the same number of bits. If
		// operands need to be extended, they can be extended with garbage. The
		// resulting garbage in the high bits of the mad_[iu]64_[iu]32 result is
		// truncated away in the end.
		if (VT != MVT::i64) {
		MulLHS = DAG.getNode(ISD::ANY_EXTEND, SL, MVT::i64, MulLHS);
		MulRHS = DAG.getNode(ISD::ANY_EXTEND, SL, MVT::i64, MulRHS);
		AddRHS = DAG.getNode(ISD::ANY_EXTEND, SL, MVT::i64, AddRHS);
		}

		// The basic code generated is conceptually straightforward. Pseudo code:
		//
		// accum = mad_64_32 lhs.lo, rhs.lo, accum
		// accum.hi = add (mul lhs.hi, rhs.lo), accum.hi
		// accum.hi = add (mul lhs.lo, rhs.hi), accum.hi
		//
		// The second and third lines are optional, depending on whether the factors
		// are {sign,zero}-extended or not.
		//
		// The actual DAG is noisier than the pseudo code, but only due to
		// instructions that disassemble values into low and high parts, and
		// assemble the final result.
		SDValue Zero = DAG.getConstant(0, SL, MVT::i32);
		SDValue One = DAG.getConstant(1, SL, MVT::i32);

		auto MulLHSLo = DAG.getNode(ISD::TRUNCATE, SL, MVT::i32, MulLHS);
		auto MulRHSLo = DAG.getNode(ISD::TRUNCATE, SL, MVT::i32, MulRHS);
		SDValue Accum =
		getMad64_32(DAG, SL, MVT::i64, MulLHSLo, MulRHSLo, AddRHS, MulSignedLo);

		if (!MulSignedLo && (!MulLHSUnsigned32 \|\| !MulRHSUnsigned32)) {
		auto AccumLo = DAG.getNode(ISD::EXTRACT_ELEMENT, SL, MVT::i32, Accum, Zero);
		auto AccumHi = DAG.getNode(ISD::EXTRACT_ELEMENT, SL, MVT::i32, Accum, One);

		if (!MulLHSUnsigned32) {
		auto MulLHSHi =
		DAG.getNode(ISD::EXTRACT_ELEMENT, SL, MVT::i32, MulLHS, One);
		SDValue MulHi = DAG.getNode(ISD::MUL, SL, MVT::i32, MulLHSHi, MulRHSLo);
		AccumHi = DAG.getNode(ISD::ADD, SL, MVT::i32, MulHi, AccumHi);
		}
		arsenmUnsubmitted Done Reply Inline Actions Could just hardcode this to i32 instead of going through getShiftAmountConstant arsenm: Could just hardcode this to i32 instead of going through getShiftAmountConstant

		foadUnsubmitted Not Done Reply Inline Actions I don't know if it makes any practical difference, but other code like `AMDGPUTargetLowering::LowerUDIVREM64` uses EXTRACT_ELEMENT to split an i64 into a pair of i32s, and BITCAST(BUILD_VECTOR ...) to reassemble them. foad: I don't know if it makes any practical difference, but other code like `AMDGPUTargetLowering…
		arsenmUnsubmitted Not Done Reply Inline Actions Using the shift adds extra steps. The combine on 64 bit shifts will turn this into the vector build arsenm: Using the shift adds extra steps. The combine on 64 bit shifts will turn this into the vector…
		nhaehnleAuthorUnsubmitted Done Reply Inline Actions I'm rearranging the code slightly to use more EXTRACT_ELEMENT and BUILD_VECTOR directly. However, I'm keeping some TRUNCATEs around (instead of extracing the low part) because that turns out to result in better code generation in some tests. Actually also noticed a crash in doing so and fixed that. nhaehnle: I'm rearranging the code slightly to use more EXTRACT_ELEMENT and BUILD_VECTOR directly.
		if (!MulRHSUnsigned32) {
		auto MulRHSHi =
		DAG.getNode(ISD::EXTRACT_ELEMENT, SL, MVT::i32, MulRHS, One);
		SDValue MulHi = DAG.getNode(ISD::MUL, SL, MVT::i32, MulLHSLo, MulRHSHi);
		AccumHi = DAG.getNode(ISD::ADD, SL, MVT::i32, MulHi, AccumHi);
}		}

if (numBitsSigned(MulLHS, DAG) <= 32 && numBitsSigned(MulRHS, DAG) <= 32) {		Accum = DAG.getBuildVector(MVT::v2i32, SL, {AccumLo, AccumHi});
MulLHS = DAG.getSExtOrTrunc(MulLHS, SL, MVT::i32);		Accum = DAG.getBitcast(MVT::i64, Accum);
		arsenmUnsubmitted Done Reply Inline Actions A comment with the DAG formed here would be helpful arsenm: A comment with the DAG formed here would be helpful
		nhaehnleAuthorUnsubmitted Done Reply Inline Actions Ok. nhaehnle: Ok.
MulRHS = DAG.getSExtOrTrunc(MulRHS, SL, MVT::i32);
AddRHS = DAG.getSExtOrTrunc(AddRHS, SL, MVT::i64);
return getMad64_32(DAG, SL, VT, MulLHS, MulRHS, AddRHS, true);
}		}

return SDValue();		if (VT != MVT::i64)
		Accum = DAG.getNode(ISD::TRUNCATE, SL, VT, Accum);
		return Accum;
}		}

SDValue SITargetLowering::performAddCombine(SDNode *N,		SDValue SITargetLowering::performAddCombine(SDNode *N,
DAGCombinerInfo &DCI) const {		DAGCombinerInfo &DCI) const {
		arsenmUnsubmitted Done Reply Inline Actions Ditto arsenm: Ditto
SelectionDAG &DAG = DCI.DAG;		SelectionDAG &DAG = DCI.DAG;
EVT VT = N->getValueType(0);		EVT VT = N->getValueType(0);
SDLoc SL(N);		SDLoc SL(N);
SDValue LHS = N->getOperand(0);		SDValue LHS = N->getOperand(0);
SDValue RHS = N->getOperand(1);		SDValue RHS = N->getOperand(1);

if (LHS.getOpcode() == ISD::MUL \|\| RHS.getOpcode() == ISD::MUL) {		if (LHS.getOpcode() == ISD::MUL \|\| RHS.getOpcode() == ISD::MUL) {
if (Subtarget->hasMad64_32()) {		if (Subtarget->hasMad64_32()) {
▲ Show 20 Lines • Show All 1,987 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll

	Show First 20 Lines • Show All 824 Lines • ▼ Show 20 Lines
	; GFX8-NEXT: s_mov_b32 s13, s7			; GFX8-NEXT: s_mov_b32 s13, s7
	; GFX8-NEXT: v_add_u32_e32 v1, vcc, s6, v1			; GFX8-NEXT: v_add_u32_e32 v1, vcc, s6, v1
	; GFX8-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)			; GFX8-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
	; GFX8-NEXT: buffer_atomic_add_x2 v[0:1], off, s[12:15], 0 glc			; GFX8-NEXT: buffer_atomic_add_x2 v[0:1], off, s[12:15], 0 glc
	; GFX8-NEXT: s_waitcnt vmcnt(0)			; GFX8-NEXT: s_waitcnt vmcnt(0)
	; GFX8-NEXT: buffer_wbinvl1_vol			; GFX8-NEXT: buffer_wbinvl1_vol
	; GFX8-NEXT: .LBB4_2:			; GFX8-NEXT: .LBB4_2:
	; GFX8-NEXT: s_or_b64 exec, exec, s[2:3]			; GFX8-NEXT: s_or_b64 exec, exec, s[2:3]
				; GFX8-NEXT: v_readfirstlane_b32 s2, v0
				; GFX8-NEXT: v_readfirstlane_b32 s3, v1
				; GFX8-NEXT: v_mov_b32_e32 v0, s2
				; GFX8-NEXT: v_mov_b32_e32 v1, s3
	; GFX8-NEXT: s_waitcnt lgkmcnt(0)			; GFX8-NEXT: s_waitcnt lgkmcnt(0)
	; GFX8-NEXT: v_mul_lo_u32 v4, s1, v2			; GFX8-NEXT: v_mul_lo_u32 v3, s1, v2
	; GFX8-NEXT: v_mad_u64_u32 v[2:3], s[0:1], s0, v2, 0			; GFX8-NEXT: v_mad_u64_u32 v[0:1], s[0:1], s0, v2, v[0:1]
	; GFX8-NEXT: v_readfirstlane_b32 s0, v0
	; GFX8-NEXT: v_readfirstlane_b32 s1, v1
	; GFX8-NEXT: v_add_u32_e32 v1, vcc, v3, v4
	; GFX8-NEXT: v_mov_b32_e32 v3, s1
	; GFX8-NEXT: v_add_u32_e32 v0, vcc, s0, v2
	; GFX8-NEXT: s_mov_b32 s7, 0xf000			; GFX8-NEXT: s_mov_b32 s7, 0xf000
	; GFX8-NEXT: s_mov_b32 s6, -1			; GFX8-NEXT: s_mov_b32 s6, -1
	; GFX8-NEXT: v_addc_u32_e32 v1, vcc, v3, v1, vcc			; GFX8-NEXT: v_add_u32_e32 v1, vcc, v3, v1
	; GFX8-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0			; GFX8-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0
	; GFX8-NEXT: s_endpgm			; GFX8-NEXT: s_endpgm
	;			;
	; GFX9-LABEL: add_i64_uniform:			; GFX9-LABEL: add_i64_uniform:
	; GFX9: ; %bb.0: ; %entry			; GFX9: ; %bb.0: ; %entry
	; GFX9-NEXT: s_load_dwordx4 s[4:7], s[0:1], 0x24			; GFX9-NEXT: s_load_dwordx4 s[4:7], s[0:1], 0x24
	; GFX9-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x34			; GFX9-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x34
	; GFX9-NEXT: s_mov_b64 s[8:9], exec			; GFX9-NEXT: s_mov_b64 s[8:9], exec
	Show All 17 Lines
	; GFX9-NEXT: v_mov_b32_e32 v0, s6			; GFX9-NEXT: v_mov_b32_e32 v0, s6
	; GFX9-NEXT: v_mov_b32_e32 v1, s8			; GFX9-NEXT: v_mov_b32_e32 v1, s8
	; GFX9-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
	; GFX9-NEXT: buffer_atomic_add_x2 v[0:1], off, s[12:15], 0 glc			; GFX9-NEXT: buffer_atomic_add_x2 v[0:1], off, s[12:15], 0 glc
	; GFX9-NEXT: s_waitcnt vmcnt(0)			; GFX9-NEXT: s_waitcnt vmcnt(0)
	; GFX9-NEXT: buffer_wbinvl1_vol			; GFX9-NEXT: buffer_wbinvl1_vol
	; GFX9-NEXT: .LBB4_2:			; GFX9-NEXT: .LBB4_2:
	; GFX9-NEXT: s_or_b64 exec, exec, s[0:1]			; GFX9-NEXT: s_or_b64 exec, exec, s[0:1]
	; GFX9-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-NEXT: v_mul_lo_u32 v4, s3, v2
	; GFX9-NEXT: v_mad_u64_u32 v[2:3], s[0:1], s2, v2, 0
	; GFX9-NEXT: v_readfirstlane_b32 s0, v0			; GFX9-NEXT: v_readfirstlane_b32 s0, v0
	; GFX9-NEXT: v_readfirstlane_b32 s1, v1			; GFX9-NEXT: v_readfirstlane_b32 s1, v1
	; GFX9-NEXT: v_add_u32_e32 v1, v3, v4			; GFX9-NEXT: v_mov_b32_e32 v0, s0
	; GFX9-NEXT: v_mov_b32_e32 v3, s1			; GFX9-NEXT: v_mov_b32_e32 v1, s1
	; GFX9-NEXT: v_add_co_u32_e32 v0, vcc, s0, v2			; GFX9-NEXT: s_waitcnt lgkmcnt(0)
				; GFX9-NEXT: v_mul_lo_u32 v3, s3, v2
				; GFX9-NEXT: v_mad_u64_u32 v[0:1], s[0:1], s2, v2, v[0:1]
	; GFX9-NEXT: s_mov_b32 s7, 0xf000			; GFX9-NEXT: s_mov_b32 s7, 0xf000
	; GFX9-NEXT: s_mov_b32 s6, -1			; GFX9-NEXT: s_mov_b32 s6, -1
	; GFX9-NEXT: v_addc_co_u32_e32 v1, vcc, v3, v1, vcc			; GFX9-NEXT: v_add_u32_e32 v1, v3, v1
	; GFX9-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0			; GFX9-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0
	; GFX9-NEXT: s_endpgm			; GFX9-NEXT: s_endpgm
	;			;
	; GFX1064-LABEL: add_i64_uniform:			; GFX1064-LABEL: add_i64_uniform:
	; GFX1064: ; %bb.0: ; %entry			; GFX1064: ; %bb.0: ; %entry
	; GFX1064-NEXT: s_clause 0x1			; GFX1064-NEXT: s_clause 0x1
	; GFX1064-NEXT: s_load_dwordx4 s[4:7], s[0:1], 0x24			; GFX1064-NEXT: s_load_dwordx4 s[4:7], s[0:1], 0x24
	; GFX1064-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x34			; GFX1064-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x34
	Show All 21 Lines
	; GFX1064-NEXT: s_waitcnt_vscnt null, 0x0			; GFX1064-NEXT: s_waitcnt_vscnt null, 0x0
	; GFX1064-NEXT: buffer_atomic_add_x2 v[0:1], off, s[8:11], 0 glc			; GFX1064-NEXT: buffer_atomic_add_x2 v[0:1], off, s[8:11], 0 glc
	; GFX1064-NEXT: s_waitcnt vmcnt(0)			; GFX1064-NEXT: s_waitcnt vmcnt(0)
	; GFX1064-NEXT: buffer_gl0_inv			; GFX1064-NEXT: buffer_gl0_inv
	; GFX1064-NEXT: buffer_gl1_inv			; GFX1064-NEXT: buffer_gl1_inv
	; GFX1064-NEXT: .LBB4_2:			; GFX1064-NEXT: .LBB4_2:
	; GFX1064-NEXT: s_waitcnt_depctr 0xffe3			; GFX1064-NEXT: s_waitcnt_depctr 0xffe3
	; GFX1064-NEXT: s_or_b64 exec, exec, s[0:1]			; GFX1064-NEXT: s_or_b64 exec, exec, s[0:1]
	; GFX1064-NEXT: s_waitcnt lgkmcnt(0)
	; GFX1064-NEXT: v_mul_lo_u32 v4, s3, v2
	; GFX1064-NEXT: v_mad_u64_u32 v[2:3], s[0:1], s2, v2, 0
	; GFX1064-NEXT: v_readfirstlane_b32 s0, v0			; GFX1064-NEXT: v_readfirstlane_b32 s0, v0
	; GFX1064-NEXT: v_readfirstlane_b32 s1, v1			; GFX1064-NEXT: v_readfirstlane_b32 s1, v1
				; GFX1064-NEXT: s_waitcnt lgkmcnt(0)
				; GFX1064-NEXT: v_mul_lo_u32 v3, s3, v2
	; GFX1064-NEXT: s_mov_b32 s7, 0x31016000			; GFX1064-NEXT: s_mov_b32 s7, 0x31016000
	; GFX1064-NEXT: s_mov_b32 s6, -1			; GFX1064-NEXT: s_mov_b32 s6, -1
	; GFX1064-NEXT: v_add_nc_u32_e32 v1, v3, v4			; GFX1064-NEXT: v_mad_u64_u32 v[0:1], s[0:1], s2, v2, s[0:1]
	; GFX1064-NEXT: v_add_co_u32 v0, vcc, s0, v2			; GFX1064-NEXT: v_add_nc_u32_e32 v1, v3, v1
	; GFX1064-NEXT: v_add_co_ci_u32_e32 v1, vcc, s1, v1, vcc
	; GFX1064-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0			; GFX1064-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0
	; GFX1064-NEXT: s_endpgm			; GFX1064-NEXT: s_endpgm
	;			;
	; GFX1032-LABEL: add_i64_uniform:			; GFX1032-LABEL: add_i64_uniform:
	; GFX1032: ; %bb.0: ; %entry			; GFX1032: ; %bb.0: ; %entry
	; GFX1032-NEXT: s_clause 0x1			; GFX1032-NEXT: s_clause 0x1
	; GFX1032-NEXT: s_load_dwordx4 s[4:7], s[0:1], 0x24			; GFX1032-NEXT: s_load_dwordx4 s[4:7], s[0:1], 0x24
	; GFX1032-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x34			; GFX1032-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x34
	Show All 20 Lines
	; GFX1032-NEXT: s_waitcnt_vscnt null, 0x0			; GFX1032-NEXT: s_waitcnt_vscnt null, 0x0
	; GFX1032-NEXT: buffer_atomic_add_x2 v[0:1], off, s[8:11], 0 glc			; GFX1032-NEXT: buffer_atomic_add_x2 v[0:1], off, s[8:11], 0 glc
	; GFX1032-NEXT: s_waitcnt vmcnt(0)			; GFX1032-NEXT: s_waitcnt vmcnt(0)
	; GFX1032-NEXT: buffer_gl0_inv			; GFX1032-NEXT: buffer_gl0_inv
	; GFX1032-NEXT: buffer_gl1_inv			; GFX1032-NEXT: buffer_gl1_inv
	; GFX1032-NEXT: .LBB4_2:			; GFX1032-NEXT: .LBB4_2:
	; GFX1032-NEXT: s_waitcnt_depctr 0xffe3			; GFX1032-NEXT: s_waitcnt_depctr 0xffe3
	; GFX1032-NEXT: s_or_b32 exec_lo, exec_lo, s0			; GFX1032-NEXT: s_or_b32 exec_lo, exec_lo, s0
	; GFX1032-NEXT: s_waitcnt lgkmcnt(0)
	; GFX1032-NEXT: v_mul_lo_u32 v4, s3, v2
	; GFX1032-NEXT: v_mad_u64_u32 v[2:3], s0, s2, v2, 0
	; GFX1032-NEXT: v_readfirstlane_b32 s0, v0			; GFX1032-NEXT: v_readfirstlane_b32 s0, v0
	; GFX1032-NEXT: v_readfirstlane_b32 s1, v1			; GFX1032-NEXT: v_readfirstlane_b32 s1, v1
				; GFX1032-NEXT: s_waitcnt lgkmcnt(0)
				; GFX1032-NEXT: v_mul_lo_u32 v3, s3, v2
	; GFX1032-NEXT: s_mov_b32 s7, 0x31016000			; GFX1032-NEXT: s_mov_b32 s7, 0x31016000
	; GFX1032-NEXT: s_mov_b32 s6, -1			; GFX1032-NEXT: s_mov_b32 s6, -1
	; GFX1032-NEXT: v_add_nc_u32_e32 v1, v3, v4			; GFX1032-NEXT: v_mad_u64_u32 v[0:1], s0, s2, v2, s[0:1]
	; GFX1032-NEXT: v_add_co_u32 v0, vcc_lo, s0, v2			; GFX1032-NEXT: v_add_nc_u32_e32 v1, v3, v1
	; GFX1032-NEXT: v_add_co_ci_u32_e32 v1, vcc_lo, s1, v1, vcc_lo
	; GFX1032-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0			; GFX1032-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0
	; GFX1032-NEXT: s_endpgm			; GFX1032-NEXT: s_endpgm
	entry:			entry:
	%old = atomicrmw add i64 addrspace(1)* %inout, i64 %additive acq_rel			%old = atomicrmw add i64 addrspace(1)* %inout, i64 %additive acq_rel
	store i64 %old, i64 addrspace(1)* %out			store i64 %old, i64 addrspace(1)* %out
	ret void			ret void
	}			}

	▲ Show 20 Lines • Show All 1,200 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll

	Show First 20 Lines • Show All 962 Lines • ▼ Show 20 Lines
	; GFX8-NEXT: v_add_u32_e32 v1, vcc, s6, v1			; GFX8-NEXT: v_add_u32_e32 v1, vcc, s6, v1
	; GFX8-NEXT: s_mov_b32 m0, -1			; GFX8-NEXT: s_mov_b32 m0, -1
	; GFX8-NEXT: s_waitcnt lgkmcnt(0)			; GFX8-NEXT: s_waitcnt lgkmcnt(0)
	; GFX8-NEXT: ds_add_rtn_u64 v[0:1], v3, v[0:1]			; GFX8-NEXT: ds_add_rtn_u64 v[0:1], v3, v[0:1]
	; GFX8-NEXT: s_waitcnt lgkmcnt(0)			; GFX8-NEXT: s_waitcnt lgkmcnt(0)
	; GFX8-NEXT: .LBB5_2:			; GFX8-NEXT: .LBB5_2:
	; GFX8-NEXT: s_or_b64 exec, exec, s[4:5]			; GFX8-NEXT: s_or_b64 exec, exec, s[4:5]
	; GFX8-NEXT: s_waitcnt lgkmcnt(0)			; GFX8-NEXT: s_waitcnt lgkmcnt(0)
	; GFX8-NEXT: s_mov_b32 s4, s0			; GFX8-NEXT: v_readfirstlane_b32 s4, v0
	; GFX8-NEXT: s_mov_b32 s5, s1			; GFX8-NEXT: v_readfirstlane_b32 s5, v1
	; GFX8-NEXT: v_mul_lo_u32 v4, s3, v2			; GFX8-NEXT: v_mov_b32_e32 v0, s4
	; GFX8-NEXT: v_mad_u64_u32 v[2:3], s[0:1], s2, v2, 0			; GFX8-NEXT: v_mov_b32_e32 v1, s5
	; GFX8-NEXT: v_readfirstlane_b32 s0, v0			; GFX8-NEXT: v_mul_lo_u32 v3, s3, v2
	; GFX8-NEXT: v_readfirstlane_b32 s1, v1			; GFX8-NEXT: v_mad_u64_u32 v[0:1], s[2:3], s2, v2, v[0:1]
	; GFX8-NEXT: v_add_u32_e32 v1, vcc, v3, v4
	; GFX8-NEXT: v_mov_b32_e32 v3, s1
	; GFX8-NEXT: v_add_u32_e32 v0, vcc, s0, v2
	; GFX8-NEXT: s_mov_b32 s7, 0xf000			; GFX8-NEXT: s_mov_b32 s7, 0xf000
	; GFX8-NEXT: s_mov_b32 s6, -1			; GFX8-NEXT: s_mov_b32 s6, -1
	; GFX8-NEXT: v_addc_u32_e32 v1, vcc, v3, v1, vcc			; GFX8-NEXT: s_mov_b32 s4, s0
				; GFX8-NEXT: s_mov_b32 s5, s1
				; GFX8-NEXT: v_add_u32_e32 v1, vcc, v3, v1
	; GFX8-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0			; GFX8-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0
	; GFX8-NEXT: s_endpgm			; GFX8-NEXT: s_endpgm
	;			;
	; GFX9-LABEL: add_i64_uniform:			; GFX9-LABEL: add_i64_uniform:
	; GFX9: ; %bb.0: ; %entry			; GFX9: ; %bb.0: ; %entry
	; GFX9-NEXT: s_load_dwordx4 s[0:3], s[0:1], 0x24			; GFX9-NEXT: s_load_dwordx4 s[0:3], s[0:1], 0x24
	; GFX9-NEXT: s_mov_b64 s[6:7], exec			; GFX9-NEXT: s_mov_b64 s[6:7], exec
	; GFX9-NEXT: v_mbcnt_lo_u32_b32 v0, s6, 0			; GFX9-NEXT: v_mbcnt_lo_u32_b32 v0, s6, 0
	Show All 13 Lines
	; GFX9-NEXT: v_mov_b32_e32 v1, s8			; GFX9-NEXT: v_mov_b32_e32 v1, s8
	; GFX9-NEXT: v_mov_b32_e32 v3, 0			; GFX9-NEXT: v_mov_b32_e32 v3, 0
	; GFX9-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-NEXT: ds_add_rtn_u64 v[0:1], v3, v[0:1]			; GFX9-NEXT: ds_add_rtn_u64 v[0:1], v3, v[0:1]
	; GFX9-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-NEXT: .LBB5_2:			; GFX9-NEXT: .LBB5_2:
	; GFX9-NEXT: s_or_b64 exec, exec, s[4:5]			; GFX9-NEXT: s_or_b64 exec, exec, s[4:5]
	; GFX9-NEXT: s_waitcnt lgkmcnt(0)			; GFX9-NEXT: s_waitcnt lgkmcnt(0)
	; GFX9-NEXT: v_mul_lo_u32 v4, s3, v2			; GFX9-NEXT: v_readfirstlane_b32 s4, v0
	; GFX9-NEXT: v_mad_u64_u32 v[2:3], s[2:3], s2, v2, 0			; GFX9-NEXT: v_readfirstlane_b32 s5, v1
	; GFX9-NEXT: s_mov_b32 s4, s0			; GFX9-NEXT: v_mov_b32_e32 v0, s4
	; GFX9-NEXT: s_mov_b32 s5, s1			; GFX9-NEXT: v_mov_b32_e32 v1, s5
	; GFX9-NEXT: v_readfirstlane_b32 s0, v0			; GFX9-NEXT: v_mul_lo_u32 v3, s3, v2
	; GFX9-NEXT: v_readfirstlane_b32 s1, v1			; GFX9-NEXT: v_mad_u64_u32 v[0:1], s[2:3], s2, v2, v[0:1]
	; GFX9-NEXT: v_add_u32_e32 v1, v3, v4
	; GFX9-NEXT: v_mov_b32_e32 v3, s1
	; GFX9-NEXT: v_add_co_u32_e32 v0, vcc, s0, v2
	; GFX9-NEXT: s_mov_b32 s7, 0xf000			; GFX9-NEXT: s_mov_b32 s7, 0xf000
	; GFX9-NEXT: s_mov_b32 s6, -1			; GFX9-NEXT: s_mov_b32 s6, -1
	; GFX9-NEXT: v_addc_co_u32_e32 v1, vcc, v3, v1, vcc			; GFX9-NEXT: s_mov_b32 s4, s0
				; GFX9-NEXT: s_mov_b32 s5, s1
				; GFX9-NEXT: v_add_u32_e32 v1, v3, v1
	; GFX9-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0			; GFX9-NEXT: buffer_store_dwordx2 v[0:1], off, s[4:7], 0
	; GFX9-NEXT: s_endpgm			; GFX9-NEXT: s_endpgm
	;			;
	; GFX1064-LABEL: add_i64_uniform:			; GFX1064-LABEL: add_i64_uniform:
	; GFX1064: ; %bb.0: ; %entry			; GFX1064: ; %bb.0: ; %entry
	; GFX1064-NEXT: s_load_dwordx4 s[0:3], s[0:1], 0x24			; GFX1064-NEXT: s_load_dwordx4 s[0:3], s[0:1], 0x24
	; GFX1064-NEXT: s_mov_b64 s[6:7], exec			; GFX1064-NEXT: s_mov_b64 s[6:7], exec
	; GFX1064-NEXT: v_mbcnt_lo_u32_b32 v0, s6, 0			; GFX1064-NEXT: v_mbcnt_lo_u32_b32 v0, s6, 0
	Show All 15 Lines
	; GFX1064-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)			; GFX1064-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
	; GFX1064-NEXT: s_waitcnt_vscnt null, 0x0			; GFX1064-NEXT: s_waitcnt_vscnt null, 0x0
	; GFX1064-NEXT: ds_add_rtn_u64 v[0:1], v3, v[0:1]			; GFX1064-NEXT: ds_add_rtn_u64 v[0:1], v3, v[0:1]
	; GFX1064-NEXT: s_waitcnt lgkmcnt(0)			; GFX1064-NEXT: s_waitcnt lgkmcnt(0)
	; GFX1064-NEXT: buffer_gl0_inv			; GFX1064-NEXT: buffer_gl0_inv
	; GFX1064-NEXT: .LBB5_2:			; GFX1064-NEXT: .LBB5_2:
	; GFX1064-NEXT: s_waitcnt_depctr 0xffe3			; GFX1064-NEXT: s_waitcnt_depctr 0xffe3
	; GFX1064-NEXT: s_or_b64 exec, exec, s[4:5]			; GFX1064-NEXT: s_or_b64 exec, exec, s[4:5]
				; GFX1064-NEXT: v_readfirstlane_b32 s4, v0
				; GFX1064-NEXT: v_readfirstlane_b32 s5, v1
	; GFX1064-NEXT: s_waitcnt lgkmcnt(0)			; GFX1064-NEXT: s_waitcnt lgkmcnt(0)
	; GFX1064-NEXT: v_mul_lo_u32 v4, s3, v2			; GFX1064-NEXT: v_mul_lo_u32 v3, s3, v2
	; GFX1064-NEXT: v_mad_u64_u32 v[2:3], s[2:3], s2, v2, 0			; GFX1064-NEXT: v_mad_u64_u32 v[0:1], s[2:3], s2, v2, s[4:5]
	; GFX1064-NEXT: v_readfirstlane_b32 s2, v0
	; GFX1064-NEXT: v_readfirstlane_b32 s4, v1
	; GFX1064-NEXT: s_mov_b32 s3, 0x31016000			; GFX1064-NEXT: s_mov_b32 s3, 0x31016000
	; GFX1064-NEXT: v_add_nc_u32_e32 v1, v3, v4
	; GFX1064-NEXT: v_add_co_u32 v0, vcc, s2, v2
	; GFX1064-NEXT: s_mov_b32 s2, -1			; GFX1064-NEXT: s_mov_b32 s2, -1
	; GFX1064-NEXT: v_add_co_ci_u32_e32 v1, vcc, s4, v1, vcc			; GFX1064-NEXT: v_add_nc_u32_e32 v1, v3, v1
	; GFX1064-NEXT: buffer_store_dwordx2 v[0:1], off, s[0:3], 0			; GFX1064-NEXT: buffer_store_dwordx2 v[0:1], off, s[0:3], 0
	; GFX1064-NEXT: s_endpgm			; GFX1064-NEXT: s_endpgm
	;			;
	; GFX1032-LABEL: add_i64_uniform:			; GFX1032-LABEL: add_i64_uniform:
	; GFX1032: ; %bb.0: ; %entry			; GFX1032: ; %bb.0: ; %entry
	; GFX1032-NEXT: s_load_dwordx4 s[0:3], s[0:1], 0x24			; GFX1032-NEXT: s_load_dwordx4 s[0:3], s[0:1], 0x24
	; GFX1032-NEXT: s_mov_b32 s5, exec_lo			; GFX1032-NEXT: s_mov_b32 s5, exec_lo
	; GFX1032-NEXT: ; implicit-def: $vgpr0_vgpr1			; GFX1032-NEXT: ; implicit-def: $vgpr0_vgpr1
	Show All 14 Lines
	; GFX1032-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)			; GFX1032-NEXT: s_waitcnt vmcnt(0) lgkmcnt(0)
	; GFX1032-NEXT: s_waitcnt_vscnt null, 0x0			; GFX1032-NEXT: s_waitcnt_vscnt null, 0x0
	; GFX1032-NEXT: ds_add_rtn_u64 v[0:1], v3, v[0:1]			; GFX1032-NEXT: ds_add_rtn_u64 v[0:1], v3, v[0:1]
	; GFX1032-NEXT: s_waitcnt lgkmcnt(0)			; GFX1032-NEXT: s_waitcnt lgkmcnt(0)
	; GFX1032-NEXT: buffer_gl0_inv			; GFX1032-NEXT: buffer_gl0_inv
	; GFX1032-NEXT: .LBB5_2:			; GFX1032-NEXT: .LBB5_2:
	; GFX1032-NEXT: s_waitcnt_depctr 0xffe3			; GFX1032-NEXT: s_waitcnt_depctr 0xffe3
	; GFX1032-NEXT: s_or_b32 exec_lo, exec_lo, s4			; GFX1032-NEXT: s_or_b32 exec_lo, exec_lo, s4
				; GFX1032-NEXT: v_readfirstlane_b32 s4, v0
				; GFX1032-NEXT: v_readfirstlane_b32 s5, v1
	; GFX1032-NEXT: s_waitcnt lgkmcnt(0)			; GFX1032-NEXT: s_waitcnt lgkmcnt(0)
	; GFX1032-NEXT: v_mul_lo_u32 v4, s3, v2			; GFX1032-NEXT: v_mul_lo_u32 v3, s3, v2
	; GFX1032-NEXT: v_mad_u64_u32 v[2:3], s2, s2, v2, 0
	; GFX1032-NEXT: v_readfirstlane_b32 s2, v0
	; GFX1032-NEXT: v_readfirstlane_b32 s4, v1
	; GFX1032-NEXT: s_mov_b32 s3, 0x31016000			; GFX1032-NEXT: s_mov_b32 s3, 0x31016000
	; GFX1032-NEXT: v_add_nc_u32_e32 v1, v3, v4			; GFX1032-NEXT: v_mad_u64_u32 v[0:1], s2, s2, v2, s[4:5]
	; GFX1032-NEXT: v_add_co_u32 v0, vcc_lo, s2, v2
	; GFX1032-NEXT: s_mov_b32 s2, -1			; GFX1032-NEXT: s_mov_b32 s2, -1
	; GFX1032-NEXT: v_add_co_ci_u32_e32 v1, vcc_lo, s4, v1, vcc_lo			; GFX1032-NEXT: v_add_nc_u32_e32 v1, v3, v1
	; GFX1032-NEXT: buffer_store_dwordx2 v[0:1], off, s[0:3], 0			; GFX1032-NEXT: buffer_store_dwordx2 v[0:1], off, s[0:3], 0
	; GFX1032-NEXT: s_endpgm			; GFX1032-NEXT: s_endpgm
	entry:			entry:
	%old = atomicrmw add i64 addrspace(3)* @local_var64, i64 %additive acq_rel			%old = atomicrmw add i64 addrspace(3)* @local_var64, i64 %additive acq_rel
	store i64 %old, i64 addrspace(1)* %out			store i64 %old, i64 addrspace(1)* %out
	ret void			ret void
	}			}

	▲ Show 20 Lines • Show All 3,590 Lines • Show Last 20 Lines

llvm/test/CodeGen/AMDGPU/mad_64_32.ll

Show First 20 Lines • Show All 205 Lines • ▼ Show 20 Lines	; GFX9-NEXT: s_setpc_b64 s[30:31]
%mad = add i128 %mul, %arg2		%mad = add i128 %mul, %arg2
ret i128 %mad		ret i128 %mad
}		}

define i63 @mad_i64_i32_sextops_i32_i63(i32 %arg0, i32 %arg1, i63 %arg2) #0 {		define i63 @mad_i64_i32_sextops_i32_i63(i32 %arg0, i32 %arg1, i63 %arg2) #0 {
; CI-LABEL: mad_i64_i32_sextops_i32_i63:		; CI-LABEL: mad_i64_i32_sextops_i32_i63:
; CI: ; %bb.0:		; CI: ; %bb.0:
; CI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; CI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; CI-NEXT: v_lshl_b64 v[2:3], v[2:3], 1
; CI-NEXT: v_ashr_i64 v[2:3], v[2:3], 1
; CI-NEXT: v_mad_i64_i32 v[0:1], s[4:5], v0, v1, v[2:3]		; CI-NEXT: v_mad_i64_i32 v[0:1], s[4:5], v0, v1, v[2:3]
; CI-NEXT: s_setpc_b64 s[30:31]		; CI-NEXT: s_setpc_b64 s[30:31]
;		;
; SI-LABEL: mad_i64_i32_sextops_i32_i63:		; SI-LABEL: mad_i64_i32_sextops_i32_i63:
; SI: ; %bb.0:		; SI: ; %bb.0:
; SI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; SI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; SI-NEXT: v_mul_lo_u32 v4, v0, v1		; SI-NEXT: v_mul_lo_u32 v4, v0, v1
; SI-NEXT: v_mul_hi_i32 v1, v0, v1		; SI-NEXT: v_mul_hi_i32 v1, v0, v1
; SI-NEXT: v_add_i32_e32 v0, vcc, v4, v2		; SI-NEXT: v_add_i32_e32 v0, vcc, v4, v2
; SI-NEXT: v_addc_u32_e32 v1, vcc, v1, v3, vcc		; SI-NEXT: v_addc_u32_e32 v1, vcc, v1, v3, vcc
; SI-NEXT: s_setpc_b64 s[30:31]		; SI-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX9-LABEL: mad_i64_i32_sextops_i32_i63:		; GFX9-LABEL: mad_i64_i32_sextops_i32_i63:
; GFX9: ; %bb.0:		; GFX9: ; %bb.0:
; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX9-NEXT: v_lshlrev_b64 v[2:3], 1, v[2:3]
; GFX9-NEXT: v_ashrrev_i64 v[2:3], 1, v[2:3]
; GFX9-NEXT: v_mad_i64_i32 v[0:1], s[4:5], v0, v1, v[2:3]		; GFX9-NEXT: v_mad_i64_i32 v[0:1], s[4:5], v0, v1, v[2:3]
; GFX9-NEXT: s_setpc_b64 s[30:31]		; GFX9-NEXT: s_setpc_b64 s[30:31]
%sext0 = sext i32 %arg0 to i63		%sext0 = sext i32 %arg0 to i63
%sext1 = sext i32 %arg1 to i63		%sext1 = sext i32 %arg1 to i63
%mul = mul i63 %sext0, %sext1		%mul = mul i63 %sext0, %sext1
%mad = add i63 %mul, %arg2		%mad = add i63 %mul, %arg2
ret i63 %mad		ret i63 %mad
}		}

define i63 @mad_i64_i32_sextops_i31_i63(i31 %arg0, i31 %arg1, i63 %arg2) #0 {		define i63 @mad_i64_i32_sextops_i31_i63(i31 %arg0, i31 %arg1, i63 %arg2) #0 {
; CI-LABEL: mad_i64_i32_sextops_i31_i63:		; CI-LABEL: mad_i64_i32_sextops_i31_i63:
; CI: ; %bb.0:		; CI: ; %bb.0:
; CI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; CI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; CI-NEXT: v_lshl_b64 v[2:3], v[2:3], 1
; CI-NEXT: v_bfe_i32 v1, v1, 0, 31		; CI-NEXT: v_bfe_i32 v1, v1, 0, 31
; CI-NEXT: v_ashr_i64 v[2:3], v[2:3], 1
; CI-NEXT: v_bfe_i32 v0, v0, 0, 31		; CI-NEXT: v_bfe_i32 v0, v0, 0, 31
; CI-NEXT: v_mad_i64_i32 v[0:1], s[4:5], v0, v1, v[2:3]		; CI-NEXT: v_mad_i64_i32 v[0:1], s[4:5], v0, v1, v[2:3]
; CI-NEXT: s_setpc_b64 s[30:31]		; CI-NEXT: s_setpc_b64 s[30:31]
;		;
; SI-LABEL: mad_i64_i32_sextops_i31_i63:		; SI-LABEL: mad_i64_i32_sextops_i31_i63:
; SI: ; %bb.0:		; SI: ; %bb.0:
; SI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; SI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; SI-NEXT: v_lshlrev_b32_e32 v4, 1, v0		; SI-NEXT: v_lshlrev_b32_e32 v4, 1, v0
; SI-NEXT: v_lshlrev_b32_e32 v1, 1, v1		; SI-NEXT: v_lshlrev_b32_e32 v1, 1, v1
; SI-NEXT: v_ashr_i64 v[4:5], v[3:4], 33		; SI-NEXT: v_ashr_i64 v[4:5], v[3:4], 33
; SI-NEXT: v_ashr_i64 v[0:1], v[0:1], 33		; SI-NEXT: v_ashr_i64 v[0:1], v[0:1], 33
; SI-NEXT: v_mul_lo_u32 v1, v4, v0		; SI-NEXT: v_mul_lo_u32 v1, v4, v0
; SI-NEXT: v_mul_hi_i32 v4, v4, v0		; SI-NEXT: v_mul_hi_i32 v4, v4, v0
; SI-NEXT: v_add_i32_e32 v0, vcc, v1, v2		; SI-NEXT: v_add_i32_e32 v0, vcc, v1, v2
; SI-NEXT: v_addc_u32_e32 v1, vcc, v4, v3, vcc		; SI-NEXT: v_addc_u32_e32 v1, vcc, v4, v3, vcc
; SI-NEXT: s_setpc_b64 s[30:31]		; SI-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX9-LABEL: mad_i64_i32_sextops_i31_i63:		; GFX9-LABEL: mad_i64_i32_sextops_i31_i63:
; GFX9: ; %bb.0:		; GFX9: ; %bb.0:
; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX9-NEXT: v_lshlrev_b64 v[2:3], 1, v[2:3]
; GFX9-NEXT: v_ashrrev_i64 v[2:3], 1, v[2:3]
; GFX9-NEXT: v_bfe_i32 v1, v1, 0, 31		; GFX9-NEXT: v_bfe_i32 v1, v1, 0, 31
; GFX9-NEXT: v_bfe_i32 v0, v0, 0, 31		; GFX9-NEXT: v_bfe_i32 v0, v0, 0, 31
; GFX9-NEXT: v_mad_i64_i32 v[0:1], s[4:5], v0, v1, v[2:3]		; GFX9-NEXT: v_mad_i64_i32 v[0:1], s[4:5], v0, v1, v[2:3]
; GFX9-NEXT: s_setpc_b64 s[30:31]		; GFX9-NEXT: s_setpc_b64 s[30:31]
%sext0 = sext i31 %arg0 to i63		%sext0 = sext i31 %arg0 to i63
%sext1 = sext i31 %arg1 to i63		%sext1 = sext i31 %arg1 to i63
%mul = mul i63 %sext0, %sext1		%mul = mul i63 %sext0, %sext1
%mad = add i63 %mul, %arg2		%mad = add i63 %mul, %arg2
ret i63 %mad		ret i63 %mad
}		}

define i64 @mad_i64_i32_extops_i32_i64(i32 %arg0, i32 %arg1, i64 %arg2) #0 {		define i64 @mad_i64_i32_extops_i32_i64(i32 %arg0, i32 %arg1, i64 %arg2) #0 {
; CI-LABEL: mad_i64_i32_extops_i32_i64:		; CI-LABEL: mad_i64_i32_extops_i32_i64:
; CI: ; %bb.0:		; CI: ; %bb.0:
; CI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; CI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; CI-NEXT: v_ashrrev_i32_e32 v4, 31, v0		; CI-NEXT: v_ashrrev_i32_e32 v4, 31, v0
; CI-NEXT: v_mul_lo_u32 v4, v4, v1		; CI-NEXT: v_mul_lo_u32 v4, v4, v1
; CI-NEXT: v_mad_u64_u32 v[0:1], s[4:5], v0, v1, 0		; CI-NEXT: v_mad_u64_u32 v[0:1], s[4:5], v0, v1, v[2:3]
; CI-NEXT: v_add_i32_e32 v1, vcc, v1, v4		; CI-NEXT: v_add_i32_e32 v1, vcc, v4, v1
; CI-NEXT: v_add_i32_e32 v0, vcc, v0, v2
; CI-NEXT: v_addc_u32_e32 v1, vcc, v1, v3, vcc
; CI-NEXT: s_setpc_b64 s[30:31]		; CI-NEXT: s_setpc_b64 s[30:31]
;		;
; SI-LABEL: mad_i64_i32_extops_i32_i64:		; SI-LABEL: mad_i64_i32_extops_i32_i64:
; SI: ; %bb.0:		; SI: ; %bb.0:
; SI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; SI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; SI-NEXT: v_ashrrev_i32_e32 v4, 31, v0		; SI-NEXT: v_ashrrev_i32_e32 v4, 31, v0
; SI-NEXT: v_mul_hi_u32 v5, v0, v1		; SI-NEXT: v_mul_hi_u32 v5, v0, v1
; SI-NEXT: v_mul_lo_u32 v4, v4, v1		; SI-NEXT: v_mul_lo_u32 v4, v4, v1
; SI-NEXT: v_mul_lo_u32 v0, v0, v1		; SI-NEXT: v_mul_lo_u32 v0, v0, v1
; SI-NEXT: v_add_i32_e32 v1, vcc, v5, v4		; SI-NEXT: v_add_i32_e32 v1, vcc, v5, v4
; SI-NEXT: v_add_i32_e32 v0, vcc, v0, v2		; SI-NEXT: v_add_i32_e32 v0, vcc, v0, v2
; SI-NEXT: v_addc_u32_e32 v1, vcc, v1, v3, vcc		; SI-NEXT: v_addc_u32_e32 v1, vcc, v1, v3, vcc
; SI-NEXT: s_setpc_b64 s[30:31]		; SI-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX9-LABEL: mad_i64_i32_extops_i32_i64:		; GFX9-LABEL: mad_i64_i32_extops_i32_i64:
; GFX9: ; %bb.0:		; GFX9: ; %bb.0:
; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX9-NEXT: v_ashrrev_i32_e32 v4, 31, v0		; GFX9-NEXT: v_ashrrev_i32_e32 v4, 31, v0
; GFX9-NEXT: v_mul_lo_u32 v4, v4, v1		; GFX9-NEXT: v_mul_lo_u32 v4, v4, v1
; GFX9-NEXT: v_mad_u64_u32 v[0:1], s[4:5], v0, v1, 0		; GFX9-NEXT: v_mad_u64_u32 v[0:1], s[4:5], v0, v1, v[2:3]
; GFX9-NEXT: v_add_u32_e32 v1, v1, v4		; GFX9-NEXT: v_add_u32_e32 v1, v4, v1
; GFX9-NEXT: v_add_co_u32_e32 v0, vcc, v0, v2
; GFX9-NEXT: v_addc_co_u32_e32 v1, vcc, v1, v3, vcc
; GFX9-NEXT: s_setpc_b64 s[30:31]		; GFX9-NEXT: s_setpc_b64 s[30:31]
%ext0 = sext i32 %arg0 to i64		%ext0 = sext i32 %arg0 to i64
%ext1 = zext i32 %arg1 to i64		%ext1 = zext i32 %arg1 to i64
%mul = mul i64 %ext0, %ext1		%mul = mul i64 %ext0, %ext1
%mad = add i64 %mul, %arg2		%mad = add i64 %mul, %arg2
ret i64 %mad		ret i64 %mad
}		}

Show All 24 Lines	; GFX9-NEXT: s_setpc_b64 s[30:31]
%add = add i64 %mul, %arg2		%add = add i64 %mul, %arg2
ret i64 %add		ret i64 %add
}		}

define i64 @mad_u64_u32_bitops_lhs_mask_small(i64 %arg0, i64 %arg1, i64 %arg2) #0 {		define i64 @mad_u64_u32_bitops_lhs_mask_small(i64 %arg0, i64 %arg1, i64 %arg2) #0 {
; CI-LABEL: mad_u64_u32_bitops_lhs_mask_small:		; CI-LABEL: mad_u64_u32_bitops_lhs_mask_small:
; CI: ; %bb.0:		; CI: ; %bb.0:
; CI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; CI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; CI-NEXT: v_and_b32_e32 v1, 1, v1		; CI-NEXT: v_and_b32_e32 v3, 1, v1
; CI-NEXT: v_mul_lo_u32 v3, v1, v2		; CI-NEXT: v_mad_u64_u32 v[0:1], s[4:5], v0, v2, v[4:5]
; CI-NEXT: v_mad_u64_u32 v[0:1], s[4:5], v0, v2, 0		; CI-NEXT: v_mul_lo_u32 v2, v3, v2
; CI-NEXT: v_add_i32_e32 v1, vcc, v1, v3		; CI-NEXT: v_add_i32_e32 v1, vcc, v2, v1
; CI-NEXT: v_add_i32_e32 v0, vcc, v0, v4
; CI-NEXT: v_addc_u32_e32 v1, vcc, v1, v5, vcc
; CI-NEXT: s_setpc_b64 s[30:31]		; CI-NEXT: s_setpc_b64 s[30:31]
;		;
; SI-LABEL: mad_u64_u32_bitops_lhs_mask_small:		; SI-LABEL: mad_u64_u32_bitops_lhs_mask_small:
; SI: ; %bb.0:		; SI: ; %bb.0:
; SI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; SI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; SI-NEXT: v_and_b32_e32 v1, 1, v1		; SI-NEXT: v_and_b32_e32 v1, 1, v1
; SI-NEXT: v_mul_hi_u32 v3, v0, v2		; SI-NEXT: v_mul_hi_u32 v3, v0, v2
; SI-NEXT: v_mul_lo_u32 v1, v1, v2		; SI-NEXT: v_mul_lo_u32 v1, v1, v2
; SI-NEXT: v_mul_lo_u32 v0, v0, v2		; SI-NEXT: v_mul_lo_u32 v0, v0, v2
; SI-NEXT: v_add_i32_e32 v1, vcc, v3, v1		; SI-NEXT: v_add_i32_e32 v1, vcc, v3, v1
; SI-NEXT: v_add_i32_e32 v0, vcc, v0, v4		; SI-NEXT: v_add_i32_e32 v0, vcc, v0, v4
; SI-NEXT: v_addc_u32_e32 v1, vcc, v1, v5, vcc		; SI-NEXT: v_addc_u32_e32 v1, vcc, v1, v5, vcc
; SI-NEXT: s_setpc_b64 s[30:31]		; SI-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX9-LABEL: mad_u64_u32_bitops_lhs_mask_small:		; GFX9-LABEL: mad_u64_u32_bitops_lhs_mask_small:
; GFX9: ; %bb.0:		; GFX9: ; %bb.0:
; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX9-NEXT: v_and_b32_e32 v1, 1, v1		; GFX9-NEXT: v_and_b32_e32 v3, 1, v1
; GFX9-NEXT: v_mul_lo_u32 v3, v1, v2		; GFX9-NEXT: v_mad_u64_u32 v[0:1], s[4:5], v0, v2, v[4:5]
; GFX9-NEXT: v_mad_u64_u32 v[0:1], s[4:5], v0, v2, 0		; GFX9-NEXT: v_mul_lo_u32 v2, v3, v2
; GFX9-NEXT: v_add_u32_e32 v1, v1, v3		; GFX9-NEXT: v_add_u32_e32 v1, v2, v1
; GFX9-NEXT: v_add_co_u32_e32 v0, vcc, v0, v4
; GFX9-NEXT: v_addc_co_u32_e32 v1, vcc, v1, v5, vcc
; GFX9-NEXT: s_setpc_b64 s[30:31]		; GFX9-NEXT: s_setpc_b64 s[30:31]
%trunc.lhs = and i64 %arg0, 8589934591		%trunc.lhs = and i64 %arg0, 8589934591
%trunc.rhs = and i64 %arg1, 4294967295		%trunc.rhs = and i64 %arg1, 4294967295
%mul = mul i64 %trunc.lhs, %trunc.rhs		%mul = mul i64 %trunc.lhs, %trunc.rhs
%add = add i64 %mul, %arg2		%add = add i64 %mul, %arg2
ret i64 %add		ret i64 %add
}		}

define i64 @mad_u64_u32_bitops_rhs_mask_small(i64 %arg0, i64 %arg1, i64 %arg2) #0 {		define i64 @mad_u64_u32_bitops_rhs_mask_small(i64 %arg0, i64 %arg1, i64 %arg2) #0 {
; CI-LABEL: mad_u64_u32_bitops_rhs_mask_small:		; CI-LABEL: mad_u64_u32_bitops_rhs_mask_small:
; CI: ; %bb.0:		; CI: ; %bb.0:
; CI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; CI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; CI-NEXT: v_and_b32_e32 v1, 1, v3		; CI-NEXT: v_mov_b32_e32 v6, v0
; CI-NEXT: v_mul_lo_u32 v3, v0, v1		; CI-NEXT: v_and_b32_e32 v3, 1, v3
; CI-NEXT: v_mad_u64_u32 v[0:1], s[4:5], v0, v2, 0		; CI-NEXT: v_mad_u64_u32 v[0:1], s[4:5], v6, v2, v[4:5]
; CI-NEXT: v_add_i32_e32 v1, vcc, v1, v3		; CI-NEXT: v_mul_lo_u32 v2, v6, v3
; CI-NEXT: v_add_i32_e32 v0, vcc, v0, v4		; CI-NEXT: v_add_i32_e32 v1, vcc, v2, v1
; CI-NEXT: v_addc_u32_e32 v1, vcc, v1, v5, vcc
; CI-NEXT: s_setpc_b64 s[30:31]		; CI-NEXT: s_setpc_b64 s[30:31]
;		;
; SI-LABEL: mad_u64_u32_bitops_rhs_mask_small:		; SI-LABEL: mad_u64_u32_bitops_rhs_mask_small:
; SI: ; %bb.0:		; SI: ; %bb.0:
; SI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; SI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; SI-NEXT: v_and_b32_e32 v1, 1, v3		; SI-NEXT: v_and_b32_e32 v1, 1, v3
; SI-NEXT: v_mul_hi_u32 v3, v0, v2		; SI-NEXT: v_mul_hi_u32 v3, v0, v2
; SI-NEXT: v_mul_lo_u32 v1, v0, v1		; SI-NEXT: v_mul_lo_u32 v1, v0, v1
; SI-NEXT: v_mul_lo_u32 v0, v0, v2		; SI-NEXT: v_mul_lo_u32 v0, v0, v2
; SI-NEXT: v_add_i32_e32 v1, vcc, v3, v1		; SI-NEXT: v_add_i32_e32 v1, vcc, v3, v1
; SI-NEXT: v_add_i32_e32 v0, vcc, v0, v4		; SI-NEXT: v_add_i32_e32 v0, vcc, v0, v4
; SI-NEXT: v_addc_u32_e32 v1, vcc, v1, v5, vcc		; SI-NEXT: v_addc_u32_e32 v1, vcc, v1, v5, vcc
; SI-NEXT: s_setpc_b64 s[30:31]		; SI-NEXT: s_setpc_b64 s[30:31]
;		;
; GFX9-LABEL: mad_u64_u32_bitops_rhs_mask_small:		; GFX9-LABEL: mad_u64_u32_bitops_rhs_mask_small:
; GFX9: ; %bb.0:		; GFX9: ; %bb.0:
; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; GFX9-NEXT: v_and_b32_e32 v1, 1, v3		; GFX9-NEXT: v_mov_b32_e32 v6, v0
; GFX9-NEXT: v_mul_lo_u32 v3, v0, v1		; GFX9-NEXT: v_and_b32_e32 v3, 1, v3
; GFX9-NEXT: v_mad_u64_u32 v[0:1], s[4:5], v0, v2, 0		; GFX9-NEXT: v_mad_u64_u32 v[0:1], s[4:5], v6, v2, v[4:5]
; GFX9-NEXT: v_add_u32_e32 v1, v1, v3		; GFX9-NEXT: v_mul_lo_u32 v2, v6, v3
; GFX9-NEXT: v_add_co_u32_e32 v0, vcc, v0, v4		; GFX9-NEXT: v_add_u32_e32 v1, v2, v1
; GFX9-NEXT: v_addc_co_u32_e32 v1, vcc, v1, v5, vcc
; GFX9-NEXT: s_setpc_b64 s[30:31]		; GFX9-NEXT: s_setpc_b64 s[30:31]
%trunc.lhs = and i64 %arg0, 4294967295		%trunc.lhs = and i64 %arg0, 4294967295
%trunc.rhs = and i64 %arg1, 8589934591		%trunc.rhs = and i64 %arg1, 8589934591
%mul = mul i64 %trunc.lhs, %trunc.rhs		%mul = mul i64 %trunc.lhs, %trunc.rhs
%add = add i64 %mul, %arg2		%add = add i64 %mul, %arg2
ret i64 %add		ret i64 %add
}		}

▲ Show 20 Lines • Show All 92 Lines • ▼ Show 20 Lines
;		;
; GFX9-LABEL: mad_i64_i32_uniform:		; GFX9-LABEL: mad_i64_i32_uniform:
; GFX9: ; %bb.0:		; GFX9: ; %bb.0:
; GFX9-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x2c		; GFX9-NEXT: s_load_dwordx2 s[2:3], s[0:1], 0x2c
; GFX9-NEXT: s_load_dwordx2 s[4:5], s[0:1], 0x34		; GFX9-NEXT: s_load_dwordx2 s[4:5], s[0:1], 0x34
; GFX9-NEXT: s_load_dwordx2 s[6:7], s[0:1], 0x24		; GFX9-NEXT: s_load_dwordx2 s[6:7], s[0:1], 0x24
; GFX9-NEXT: v_mov_b32_e32 v2, 0		; GFX9-NEXT: v_mov_b32_e32 v2, 0
; GFX9-NEXT: s_waitcnt lgkmcnt(0)		; GFX9-NEXT: s_waitcnt lgkmcnt(0)
; GFX9-NEXT: v_mov_b32_e32 v3, s3		; GFX9-NEXT: s_mul_i32 s0, s2, s3
; GFX9-NEXT: v_pk_mov_b32 v[0:1], s[4:5], s[4:5] op_sel:[0,1]		; GFX9-NEXT: s_mul_hi_u32 s1, s2, s3
; GFX9-NEXT: v_mad_u64_u32 v[0:1], s[0:1], s2, v3, v[0:1]		; GFX9-NEXT: s_add_u32 s0, s0, s4
		; GFX9-NEXT: s_addc_u32 s1, s1, s5
		; GFX9-NEXT: v_pk_mov_b32 v[0:1], s[0:1], s[0:1] op_sel:[0,1]
; GFX9-NEXT: global_store_dwordx2 v2, v[0:1], s[6:7]		; GFX9-NEXT: global_store_dwordx2 v2, v[0:1], s[6:7]
; GFX9-NEXT: s_endpgm		; GFX9-NEXT: s_endpgm
%ext0 = zext i32 %arg0 to i64		%ext0 = zext i32 %arg0 to i64
%ext1 = zext i32 %arg1 to i64		%ext1 = zext i32 %arg1 to i64
%mul = mul i64 %ext0, %ext1		%mul = mul i64 %ext0, %ext1
%mad = add i64 %mul, %arg2		%mad = add i64 %mul, %arg2
store i64 %mad, i64 addrspace(1)* %out		store i64 %mad, i64 addrspace(1)* %out
ret void		ret void
}		}

define i64 @mad_i64_i32_twice(i32 %arg0, i32 %arg1, i64 %arg2, i64 %arg3) #0 {		define i64 @mad_i64_i32_twice(i32 %arg0, i32 %arg1, i64 %arg2, i64 %arg3) #0 {
; CI-LABEL: mad_i64_i32_twice:		; CI-LABEL: mad_i64_i32_twice:
; CI: ; %bb.0:		; CI: ; %bb.0:
; CI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; CI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; CI-NEXT: v_mad_i64_i32 v[2:3], s[4:5], v0, v1, v[2:3]		; CI-NEXT: v_mad_i64_i32 v[2:3], s[4:5], v0, v1, v[2:3]
; CI-NEXT: v_mad_i64_i32 v[0:1], s[4:5], v0, v1, v[4:5]		; CI-NEXT: v_mad_i64_i32 v[0:1], s[4:5], v0, v1, v[4:5]
; CI-NEXT: v_xor_b32_e32 v1, v3, v1		; CI-NEXT: v_xor_b32_e32 v1, v3, v1
		arsenmUnsubmitted Done Reply Inline Actions This is a regression? It looks to be the same cycle count for more code size arsenm: This is a regression? It looks to be the same cycle count for more code size
		rampitecUnsubmitted Done Reply Inline Actions Actually since gfx90a v_mad_u64/i64 is full rate, so it is even more cycles in that case. rampitec: Actually since gfx90a v_mad_u64/i64 is full rate, so it is even more cycles in that case.
		nhaehnleAuthorUnsubmitted Done Reply Inline Actions Good to know, I'm going to rework that part. nhaehnle: Good to know, I'm going to rework that part.
; CI-NEXT: v_xor_b32_e32 v0, v2, v0		; CI-NEXT: v_xor_b32_e32 v0, v2, v0
; CI-NEXT: s_setpc_b64 s[30:31]		; CI-NEXT: s_setpc_b64 s[30:31]
;		;
; SI-LABEL: mad_i64_i32_twice:		; SI-LABEL: mad_i64_i32_twice:
; SI: ; %bb.0:		; SI: ; %bb.0:
; SI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; SI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; SI-NEXT: v_mul_lo_u32 v6, v0, v1		; SI-NEXT: v_mul_lo_u32 v6, v0, v1
; SI-NEXT: v_mul_hi_i32 v0, v0, v1		; SI-NEXT: v_mul_hi_i32 v0, v0, v1
Show All 21 Lines	; GFX9-NEXT: s_setpc_b64 s[30:31]
%out = xor i64 %mad1, %mad2		%out = xor i64 %mad1, %mad2
ret i64 %out		ret i64 %out
}		}

define i64 @mad_i64_i32_thrice(i32 %arg0, i32 %arg1, i64 %arg2, i64 %arg3, i64 %arg4) #0 {		define i64 @mad_i64_i32_thrice(i32 %arg0, i32 %arg1, i64 %arg2, i64 %arg3, i64 %arg4) #0 {
; CI-LABEL: mad_i64_i32_thrice:		; CI-LABEL: mad_i64_i32_thrice:
; CI: ; %bb.0:		; CI: ; %bb.0:
; CI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; CI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; CI-NEXT: v_mad_i64_i32 v[2:3], s[4:5], v0, v1, v[2:3]		; CI-NEXT: v_mad_i64_i32 v[0:1], s[4:5], v0, v1, 0
; CI-NEXT: v_mad_i64_i32 v[4:5], s[4:5], v0, v1, v[4:5]		; CI-NEXT: v_add_i32_e32 v2, vcc, v0, v2
; CI-NEXT: v_mad_i64_i32 v[0:1], s[4:5], v0, v1, v[6:7]		; CI-NEXT: v_addc_u32_e32 v3, vcc, v1, v3, vcc
		; CI-NEXT: v_add_i32_e32 v4, vcc, v0, v4
		; CI-NEXT: v_addc_u32_e32 v5, vcc, v1, v5, vcc
		; CI-NEXT: v_add_i32_e32 v0, vcc, v0, v6
		; CI-NEXT: v_addc_u32_e32 v1, vcc, v1, v7, vcc
; CI-NEXT: v_xor_b32_e32 v3, v3, v5		; CI-NEXT: v_xor_b32_e32 v3, v3, v5
; CI-NEXT: v_xor_b32_e32 v2, v2, v4		; CI-NEXT: v_xor_b32_e32 v2, v2, v4
; CI-NEXT: v_xor_b32_e32 v1, v3, v1		; CI-NEXT: v_xor_b32_e32 v1, v3, v1
; CI-NEXT: v_xor_b32_e32 v0, v2, v0		; CI-NEXT: v_xor_b32_e32 v0, v2, v0
; CI-NEXT: s_setpc_b64 s[30:31]		; CI-NEXT: s_setpc_b64 s[30:31]
;		;
; SI-LABEL: mad_i64_i32_thrice:		; SI-LABEL: mad_i64_i32_thrice:
; SI: ; %bb.0:		; SI: ; %bb.0:
Show All 33 Lines	; GFX9-NEXT: s_setpc_b64 s[30:31]
%out = xor i64 %out.p, %mad3		%out = xor i64 %out.p, %mad3
ret i64 %out		ret i64 %out
}		}

define i64 @mad_i64_i32_secondary_use(i32 %arg0, i32 %arg1, i64 %arg2) #0 {		define i64 @mad_i64_i32_secondary_use(i32 %arg0, i32 %arg1, i64 %arg2) #0 {
; CI-LABEL: mad_i64_i32_secondary_use:		; CI-LABEL: mad_i64_i32_secondary_use:
; CI: ; %bb.0:		; CI: ; %bb.0:
; CI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; CI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; CI-NEXT: v_mad_i64_i32 v[4:5], s[4:5], v0, v1, 0		; CI-NEXT: v_mad_i64_i32 v[0:1], s[4:5], v0, v1, 0
; CI-NEXT: v_mad_i64_i32 v[0:1], s[4:5], v0, v1, v[2:3]		; CI-NEXT: v_add_i32_e32 v2, vcc, v0, v2
; CI-NEXT: v_xor_b32_e32 v1, v1, v5		; CI-NEXT: v_addc_u32_e32 v3, vcc, v1, v3, vcc
; CI-NEXT: v_xor_b32_e32 v0, v0, v4		; CI-NEXT: v_xor_b32_e32 v1, v3, v1
		; CI-NEXT: v_xor_b32_e32 v0, v2, v0
; CI-NEXT: s_setpc_b64 s[30:31]		; CI-NEXT: s_setpc_b64 s[30:31]
;		;
; SI-LABEL: mad_i64_i32_secondary_use:		; SI-LABEL: mad_i64_i32_secondary_use:
; SI: ; %bb.0:		; SI: ; %bb.0:
; SI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)		; SI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
; SI-NEXT: v_mul_lo_u32 v4, v0, v1		; SI-NEXT: v_mul_lo_u32 v4, v0, v1
; SI-NEXT: v_mul_hi_i32 v0, v0, v1		; SI-NEXT: v_mul_hi_i32 v0, v0, v1
; SI-NEXT: v_add_i32_e32 v2, vcc, v4, v2		; SI-NEXT: v_add_i32_e32 v2, vcc, v4, v2
Show All 13 Lines	; GFX9-NEXT: s_setpc_b64 s[30:31]
%sext0 = sext i32 %arg0 to i64		%sext0 = sext i32 %arg0 to i64
%sext1 = sext i32 %arg1 to i64		%sext1 = sext i32 %arg1 to i64
%mul = mul i64 %sext0, %sext1		%mul = mul i64 %sext0, %sext1
%mad = add i64 %mul, %arg2		%mad = add i64 %mul, %arg2
%out = xor i64 %mad, %mul		%out = xor i64 %mad, %mul
ret i64 %out		ret i64 %out
}		}

		define i48 @mad_i48_i48(i48 %arg0, i48 %arg1, i48 %arg2) #0 {
		; CI-LABEL: mad_i48_i48:
		; CI: ; %bb.0:
		; CI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
		; CI-NEXT: v_mov_b32_e32 v6, v1
		; CI-NEXT: v_mov_b32_e32 v7, v0
		; CI-NEXT: v_mad_u64_u32 v[0:1], s[4:5], v7, v2, v[4:5]
		; CI-NEXT: v_mul_lo_u32 v2, v6, v2
		; CI-NEXT: v_mul_lo_u32 v3, v7, v3
		; CI-NEXT: v_add_i32_e32 v1, vcc, v2, v1
		; CI-NEXT: v_add_i32_e32 v1, vcc, v3, v1
		; CI-NEXT: s_setpc_b64 s[30:31]
		;
		; SI-LABEL: mad_i48_i48:
		; SI: ; %bb.0:
		; SI-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
		; SI-NEXT: v_mul_lo_u32 v3, v0, v3
		; SI-NEXT: v_mul_hi_u32 v6, v0, v2
		; SI-NEXT: v_mul_lo_u32 v1, v1, v2
		; SI-NEXT: v_mul_lo_u32 v0, v0, v2
		; SI-NEXT: v_add_i32_e32 v3, vcc, v6, v3
		; SI-NEXT: v_add_i32_e32 v1, vcc, v3, v1
		; SI-NEXT: v_add_i32_e32 v0, vcc, v0, v4
		; SI-NEXT: v_addc_u32_e32 v1, vcc, v1, v5, vcc
		; SI-NEXT: s_setpc_b64 s[30:31]
		;
		; GFX9-LABEL: mad_i48_i48:
		; GFX9: ; %bb.0:
		; GFX9-NEXT: s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
		; GFX9-NEXT: v_mov_b32_e32 v6, v1
		; GFX9-NEXT: v_mov_b32_e32 v7, v0
		; GFX9-NEXT: v_mad_u64_u32 v[0:1], s[4:5], v7, v2, v[4:5]
		; GFX9-NEXT: v_mul_lo_u32 v3, v7, v3
		; GFX9-NEXT: v_mul_lo_u32 v2, v6, v2
		; GFX9-NEXT: v_add3_u32 v1, v2, v1, v3
		; GFX9-NEXT: s_setpc_b64 s[30:31]
		%m = mul i48 %arg0, %arg1
		%a = add i48 %m, %arg2
		ret i48 %a
		}

attributes #0 = { nounwind }		attributes #0 = { nounwind }
attributes #1 = { nounwind readnone speculatable }		attributes #1 = { nounwind readnone speculatable }