This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Pre-allocate WWM registers to reduce VGPR pressure.
ClosedPublic

Authored by sheredom on Mar 13 2019, 6:27 AM.

Download Raw Diff

Details

Reviewers

nhaehnle
dstuttard
cwabbott
arsenm
tpr

Commits

rG0a30f33ce21d: [AMDGPU] Pre-allocate WWM registers to reduce VGPR pressure.
rL357400: [AMDGPU] Pre-allocate WWM registers to reduce VGPR pressure.

Summary

This change incorporates an effort by Connor Abbot to change how we deal with WWM operations potentially trashing valid values in inactive lanes.

Previously, the SIFixWWMLiveness pass would work out which registers were being trashed within WWM regions, and ensure that the register allocator did not have any values it was depending on resident in those registers if the WWM section would trash them. This worked perfectly well, but would cause sometimes severe register pressure when the WWM section resided before divergent control flow (or at least that is where I mostly observed it).

This fix instead runs through the WWM sections and pre allocates some registers for WWM. It then reserves these registers so that the register allocator cannot use them. This results in a significant register saving on some WWM shaders I'm working with (130 -> 104 VGPRs, with just this change!).

This change was entirely thought up by @cwabbott (I claim no credit for the original idea!), but I had some time to look at it and so we agreed that I could give it a final polish for submission.

Diff Detail

Event Timeline

sheredom created this revision.Mar 13 2019, 6:27 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 13 2019, 6:27 AM

Herald added subscribers: llvm-commits, jdoerfert, jfb and 8 others. · View Herald Transcript

I really don't like introducing new, dynamically reserved registers for this. It's going to introduce hell for dealing with any kind of ABI, and reserved registers are generally a bad idea. There's also nothing guaranteeing there are any free registers available to reserve, since you are just grabbing totally unused ones. This is going to just hit some variant of the problem I've been working on solving for handling SGPR->VGPR spills. Can WWM code be moved into a bundle or something?

include/llvm/IR/IntrinsicsAMDGPU.td
1363–1365	This is a separate fix that can be split into its own patch
lib/Target/AMDGPU/SIPreAllocateWWMRegs.cpp
53	You can remove this
223–228	I don't like this hardcoded opcode check. Why is S_OR_SAVEEXEC_B64 special?

In D59295#1427734, @arsenm wrote:

I really don't like introducing new, dynamically reserved registers for this. It's going to introduce hell for dealing with any kind of ABI, and reserved registers are generally a bad idea. There's also nothing guaranteeing there are any free registers available to reserve, since you are just grabbing totally unused ones. This is going to just hit some variant of the problem I've been working on solving for handling SGPR->VGPR spills. Can WWM code be moved into a bundle or something?

No, since the problem the current pass and this new pass are trying to solve affects register allocation for code that is arbitrarily far away from the original WWM sequence. For a more detailed explanation, I can't do any better than the comment at the start of SIFixWWMLiveness.cpp. (There will also be problems if RA decides to split a live interval inside a WWM sequence, which can be fixed by bundling it, but that's a completely different problem).

While it might seem dangerous, in practice this works out, since WWM sequences that the frontends currently emit only require a few registers. Hence this pass is guaranteed to succeed, even if there's very high register pressure. If allocating the registers needed for the WWM sequence fails and RA decides to spill something inside there, then you're pretty much toast anyways since the same invalid-lane-clobbering concerns would reappear.

One idea would be to add a way to tell RA that a certain live range absolutely cannot be split (and probably boost its priority as well, lest we fail to allocate it), pre-allocate one or more of these unsplittable registers for WWM, make every definition in the WWM sequence a partial definition, and add fake definitions of the WWM registers in the closest block with uniform control flow that dominates the WWM sequence in order to prevent definitions whose invalid lanes could be clobbered from using the WWM registers. This gives RA a little more flexibility and means that potentially some other operations could use the WWM registers, but you still basically wind up preallocating them.

In D59295#1427826, @cwabbott wrote:

One idea would be to add a way to tell RA that a certain live range absolutely cannot be split (and probably boost its priority as well, lest we fail to allocate it), pre-allocate one or more of these unsplittable registers for WWM, make every definition in the WWM sequence a partial definition, and add fake definitions of the WWM registers in the closest block with uniform control flow that dominates the WWM sequence in order to prevent definitions whose invalid lanes could be clobbered from using the WWM registers. This gives RA a little more flexibility and means that potentially some other operations could use the WWM registers, but you still basically wind up preallocating them.

Declaring a certain live range unsplittable is impossible. LiveIntervals sort of supports it, but not all the allocators use it. Particularly, FastRegAlloc doesn't really track liveness and spills all values live out of a block. It's important that anything works correctly without LIveIntervals. If you can express the constraints with some series of uses and defs, that would be preferable.

My main concerns are making sure this works with:

-O0/fastregalloc
Presence of calls
Inline asm or any other physical register constraints

test/CodeGen/AMDGPU/wwm-reserved.ll
3	Needs a -O0 run line
58	Needs some cases with control flow

In D59295#1427864, @arsenm wrote:

If you can express the constraints with some series of uses and defs, that would be preferable.

Unfortunately that's just not possible. The best thing we could come up with that does this was the current SIFixWWMLiveness pass, and even with subsequent modifications, it's still pretty terrible in practice. We just can't get decent performance without doing something like this hack.

Actually, now that I think about it, I believe we realized that SIFixWWMLiveness has a giant hole in that if any of the extra live ranges we insert are split, it'll fall over. I don't think anyone has come up with a way to express the constraints only with extra defs and uses in a way that always works, and I'm not sure it's possible. The issue is that we're lying to LLVM RA by pretending that vector instructions always fully clobber their destinations, and while before we were careful to never write to any inactive channels in order to keep up the charade, but WWM instructions force us to deal with it somehow. Fully informing LLVM of what's going on would involve marking every vector instruction as partially clobbering its destination, even the move instructions and load/store instructions LLVM emits during RA, which of course would tank performance unless LLVM is taught about predicated liveness -- but of course that's a whole lot of work that opens another can of worms (register pressure is suddenly not that meaningful anymore...).

In D59295#1427874, @cwabbott wrote:

The best thing we could come up with that does this was the current SIFixWWMLiveness pass, and even with subsequent modifications, it's still pretty terrible in practice.

cwabbott added inline comments.Mar 13 2019, 12:17 PM

lib/Target/AMDGPU/SIPreAllocateWWMRegs.cpp
223–228	IIRC I wrote this since WholeQuadMode currently just emits a S_OR_SAVEEXEC_B64 foo, -1 in order to start WWM, so you have to check it manually in order to know when WWM starts. Maybe it's better to got back and add a ENTER_WWM pseudoinstruction like the pre-existing EXIT_WWM.

Addressed all review comments.

sheredom marked 6 inline comments as done.Mar 18 2019, 6:43 AM

sheredom added inline comments.

lib/Target/AMDGPU/SIPreAllocateWWMRegs.cpp
223–228	I've added an ENTER_WWM like was suggested!
test/CodeGen/AMDGPU/wwm-reserved.ll
58	Added a CFG and a called function case.

Remove the explicit pass name.

sheredom marked an inline comment as done.Mar 18 2019, 6:44 AM

arsenm added inline comments.Mar 18 2019, 7:33 AM

test/CodeGen/AMDGPU/wwm-reserved.ll
14	Do these tests really need all of this? Can you just have a few sample instructions?

sheredom marked an inline comment as done.Mar 18 2019, 7:36 AM

sheredom added inline comments.

test/CodeGen/AMDGPU/wwm-reserved.ll
14	I can probably cut it down to just a single DPP inst in each WWM section - good idea!

Reduce the number of DPP calls in the test for cleanliness, and reintroduce the convergent on WWM. The CFG test contains the bug that was exposed by the lack of convergent on WWM, LLVM will sink the WWM statement out of the branch which totally messes up all calculations.

sheredom marked an inline comment as done.Mar 18 2019, 8:38 AM

sheredom marked 2 inline comments as done.

sheredom added inline comments.

include/llvm/IR/IntrinsicsAMDGPU.td
1363–1365	So I tried to remove this (forgetting why I needed it) and LLVM will sink the WWM out of the branch which totally messes up the WWM calculation. So this is actually a requirement for the patch, not a separate thing.

sheredom marked an inline comment as done.Mar 18 2019, 9:18 AM

arsenm added inline comments.Mar 18 2019, 9:46 AM

include/llvm/IR/IntrinsicsAMDGPU.td
1363–1365	You can commit that first then

sheredom marked 2 inline comments as done.Mar 19 2019, 1:45 AM

sheredom added inline comments.

include/llvm/IR/IntrinsicsAMDGPU.td
1363–1365	https://reviews.llvm.org/D59536

sheredom marked 2 inline comments as done.Mar 19 2019, 10:14 AM

sheredom added inline comments.

include/llvm/IR/IntrinsicsAMDGPU.td
1363–1365	I've submitted that other change, so this should be good to go once I get a sign off!

sheredom added a reviewer: arsenm.Mar 19 2019, 10:15 AM

arsenm added inline comments.Mar 20 2019, 11:34 AM

lib/Target/AMDGPU/SIPreAllocateWWMRegs.cpp
154	Can't you just hardcoded this to v_mov_b32?
156	MRI->reg_instructions()? It would also save the check that the operand matches later
160	.isCopy()
183–184	I know we have a helper to do this somewhere

arsenm added inline comments.Mar 20 2019, 11:48 AM

lib/Target/AMDGPU/SIPreAllocateWWMRegs.cpp
154	I'm confused on the constraints for the WWM intrinsic, or lack thereof. The WWM instruction just uses "unknown" and the intrinsic allows any type. Can this be a 64-bit register or greater?

sheredom marked 3 inline comments as done.Mar 21 2019, 2:11 AM

sheredom added inline comments.

lib/Target/AMDGPU/SIPreAllocateWWMRegs.cpp
154	We only use WWM with 32-bit & 64-bit types in our stack - nothing else.
183–184	Couldn't find it myself in the code.

Fix review comments.

sheredom marked 3 inline comments as done.Mar 21 2019, 2:13 AM

Update for two reasons:

was missing V_SET_INACTIVE_B64 in the set inactive check, causing miscompile with double/int64
changed where the mov's are injected to tie them explicitly to WWM statements rather than all copy's within a WWM section (which produces better codegen and actually matches more the intent of the WWM intrinsic in the first place).

sheredom marked an inline comment as done.Mar 22 2019, 5:49 AM

Found a fun little bug whereby the phys vgprs were being coalesced onto previous instructions, and then shouldClusterMemOps was assuming only virt regs. Added a workaround for that.

sheredom added a reviewer: tpr.Mar 29 2019, 8:59 AM

LGTM modulo the wrong license on the new file.

lib/Target/AMDGPU/SIPreAllocateWWMRegs.cpp
6	Wrong license.

This revision is now accepted and ready to land.Mar 29 2019, 1:51 PM

Can you add a test where allocation is impossible? e.g. use

call void asm sideeffect "",
  "~{v0},~{v1},~{v2},~{v3},~{v4},~{v5},~{v6},~{v7},~{v8},~{v9}
  ,~{v10},~{v11},~{v12},~{v13},~{v14},~{v15},~{v16},~{v17},~{v18},~{v19}
  ,~{v20},~{v21},~{v22},~{v23},~{v24},~{v25},~{v26},~{v27},~{v28},~{v29}
  ,~{v30},~{v31},~{v32},~{v33},~{v34},~{v35},~{v36},~{v37},~{v38},~{v39}
  ,~{v40},~{v41},~{v42},~{v43},~{v44},~{v45},~{v46},~{v47},~{v48},~{v49}
  ,~{v50},~{v51},~{v52},~{v53},~{v54},~{v55},~{v56},~{v57},~{v58},~{v59}
  ,~{v60},~{v61},~{v62},~{v63},~{v64},~{v65},~{v66},~{v67},~{v68},~{v69}
  ,~{v70},~{v71},~{v72},~{v73},~{v74},~{v75},~{v76},~{v77},~{v78},~{v79}
  ,~{v80},~{v81},~{v82},~{v83},~{v84},~{v85},~{v86},~{v87},~{v88},~{v89}
  ,~{v90},~{v91},~{v92},~{v93},~{v94},~{v95},~{v96},~{v97},~{v98},~{v99}
  ,~{v100},~{v101},~{v102},~{v103},~{v104},~{v105},~{v106},~{v107},~{v108},~{v109}
  ,~{v110},~{v111},~{v112},~{v113},~{v114},~{v115},~{v116},~{v117},~{v118},~{v119}
  ,~{v120},~{v121},~{v122},~{v123},~{v124},~{v125},~{v126},~{v127},~{v128},~{v129}
  ,~{v130},~{v131},~{v132},~{v133},~{v134},~{v135},~{v136},~{v137},~{v138},~{v139}
  ,~{v140},~{v141},~{v142},~{v143},~{v144},~{v145},~{v146},~{v147},~{v148},~{v149}
  ,~{v150},~{v151},~{v152},~{v153},~{v154},~{v155},~{v156},~{v157},~{v158},~{v159}
  ,~{v160},~{v161},~{v162},~{v163},~{v164},~{v165},~{v166},~{v167},~{v168},~{v169}
  ,~{v170},~{v171},~{v172},~{v173},~{v174},~{v175},~{v176},~{v177},~{v178},~{v179}
  ,~{v180},~{v181},~{v182},~{v183},~{v184},~{v185},~{v186},~{v187},~{v188},~{v189}
  ,~{v190},~{v191},~{v192},~{v193},~{v194},~{v195},~{v196},~{v197},~{v198},~{v199}
  ,~{v200},~{v201},~{v202},~{v203},~{v204},~{v205},~{v206},~{v207},~{v208},~{v209}
  ,~{v210},~{v211},~{v212},~{v213},~{v214},~{v215},~{v216},~{v217},~{v218},~{v219}
  ,~{v220},~{v221},~{v222},~{v223},~{v224},~{v225},~{v226},~{v227},~{v228},~{v229}
  ,~{v230},~{v231},~{v232},~{v233},~{v234},~{v235},~{v236},~{v237},~{v238},~{v239}
  ,~{v240},~{v241},~{v242},~{v243},~{v244},~{v245},~{v246},~{v247},~{v248},~{v249}
  ,~{v250},~{v251},~{v252},~{v253},~{v254},~{v255}"() #1

This revision now requires changes to proceed.Apr 1 2019, 8:15 AM

This revision was not accepted when it landed; it landed in state Needs Revision.Apr 1 2019, 8:19 AM

Closed by commit rL357400: [AMDGPU] Pre-allocate WWM registers to reduce VGPR pressure. (authored by sheredom). · Explain Why

This revision was automatically updated to reflect the committed changes.

arsenm added inline comments.Apr 1 2019, 8:19 AM

lib/Target/AMDGPU/SIInstrInfo.cpp
454–458	This is a separate change
lib/Target/AMDGPU/SIMachineFunctionInfo.h
260	This should be serialized into YAML. Maybe this should also be in terms of RegUnits?
lib/Target/AMDGPU/SIPreAllocateWWMRegs.cpp
104	Is the isPhysRegUsed check really necessary? It's going to break if you have multiple WWM sections

Revision Contents

Path

Size

include/

llvm/

IR/

IntrinsicsAMDGPU.td

2 lines

lib/

Target/

AMDGPU/

AMDGPU.h

6 lines

AMDGPUTargetMachine.cpp

14 lines

2 lines

10 lines

7 lines

SIMachineFunctionInfo.h

5 lines

SIPreAllocateWWMRegs.cpp

251 lines

SIRegisterInfo.cpp

4 lines

SIWholeQuadMode.cpp

3 lines

test/

CodeGen/

AMDGPU/

atomic_optimizations_buffer.ll

5 lines

atomic_optimizations_global_pointer.ll

8 lines

atomic_optimizations_local_pointer.ll

4 lines

atomic_optimizations_raw_buffer.ll

4 lines

atomic_optimizations_struct_buffer.ll

4 lines

fix-wwm-liveness.mir

indirect-addressing-term.ll

29 lines

wqm.mir

2 lines

wwm-reserved.ll

118 lines

Diff 191099

include/llvm/IR/IntrinsicsAMDGPU.td

	Show First 20 Lines • Show All 1,354 Lines • ▼ Show 20 Lines
	// If false, set EXEC=0 for the current thread until the end of program.			// If false, set EXEC=0 for the current thread until the end of program.
	def int_amdgcn_kill : Intrinsic<[], [llvm_i1_ty], []>;			def int_amdgcn_kill : Intrinsic<[], [llvm_i1_ty], []>;

	// Copies the active channels of the source value to the destination value,			// Copies the active channels of the source value to the destination value,
	// with the guarantee that the source value is computed as if the entire			// with the guarantee that the source value is computed as if the entire
	// program were executed in Whole Wavefront Mode, i.e. with all channels			// program were executed in Whole Wavefront Mode, i.e. with all channels
	// enabled, with a few exceptions: - Phi nodes with require WWM return an			// enabled, with a few exceptions: - Phi nodes with require WWM return an
	// undefined value.			// undefined value.
	def int_amdgcn_wwm : Intrinsic<[llvm_any_ty],			def int_amdgcn_wwm : Intrinsic<[llvm_any_ty],
	[LLVMMatchType<0>], [IntrNoMem, IntrSpeculatable]			[LLVMMatchType<0>], [IntrNoMem, IntrSpeculatable, IntrConvergent]
	>;			>;
				arsenmUnsubmitted Done Reply Inline Actions This is a separate fix that can be split into its own patch arsenm: This is a separate fix that can be split into its own patch
				sheredomAuthorUnsubmitted Done Reply Inline Actions So I tried to remove this (forgetting why I needed it) and LLVM will sink the WWM out of the branch which totally messes up the WWM calculation. So this is actually a requirement for the patch, not a separate thing. sheredom: So I tried to remove this (forgetting why I needed it) and LLVM will sink the WWM out of the…
				arsenmUnsubmitted Done Reply Inline Actions You can commit that first then arsenm: You can commit that first then
				sheredomAuthorUnsubmitted Done Reply Inline Actions https://reviews.llvm.org/D59536 sheredom: https://reviews.llvm.org/D59536
				sheredomAuthorUnsubmitted Done Reply Inline Actions I've submitted that other change, so this should be good to go once I get a sign off! sheredom: I've submitted that other change, so this should be good to go once I get a sign off!

	// Given a value, copies it while setting all the inactive lanes to a given			// Given a value, copies it while setting all the inactive lanes to a given
	// value. Note that OpenGL helper lanes are considered active, so if the			// value. Note that OpenGL helper lanes are considered active, so if the
	// program ever uses WQM, then the instruction and the first source will be			// program ever uses WQM, then the instruction and the first source will be
	// computed in WQM.			// computed in WQM.
	def int_amdgcn_set_inactive :			def int_amdgcn_set_inactive :
	Intrinsic<[llvm_anyint_ty],			Intrinsic<[llvm_anyint_ty],
	[LLVMMatchType<0>, // value to be copied			[LLVMMatchType<0>, // value to be copied
	▲ Show 20 Lines • Show All 202 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPU.h

	Show First 20 Lines • Show All 45 Lines • ▼ Show 20 Lines
	FunctionPass *createSIShrinkInstructionsPass();			FunctionPass *createSIShrinkInstructionsPass();
	FunctionPass *createSILoadStoreOptimizerPass();			FunctionPass *createSILoadStoreOptimizerPass();
	FunctionPass *createSIWholeQuadModePass();			FunctionPass *createSIWholeQuadModePass();
	FunctionPass *createSIFixControlFlowLiveIntervalsPass();			FunctionPass *createSIFixControlFlowLiveIntervalsPass();
	FunctionPass *createSIOptimizeExecMaskingPreRAPass();			FunctionPass *createSIOptimizeExecMaskingPreRAPass();
	FunctionPass *createSIFixSGPRCopiesPass();			FunctionPass *createSIFixSGPRCopiesPass();
	FunctionPass *createSIMemoryLegalizerPass();			FunctionPass *createSIMemoryLegalizerPass();
	FunctionPass *createSIInsertWaitcntsPass();			FunctionPass *createSIInsertWaitcntsPass();
	FunctionPass *createSIFixWWMLivenessPass();			FunctionPass *createSIPreAllocateWWMRegsPass();
	FunctionPass *createSIFormMemoryClausesPass();			FunctionPass *createSIFormMemoryClausesPass();
	FunctionPass *createAMDGPUSimplifyLibCallsPass(const TargetOptions &);			FunctionPass *createAMDGPUSimplifyLibCallsPass(const TargetOptions &);
	FunctionPass *createAMDGPUUseNativeCallsPass();			FunctionPass *createAMDGPUUseNativeCallsPass();
	FunctionPass *createAMDGPUCodeGenPreparePass();			FunctionPass *createAMDGPUCodeGenPreparePass();
	FunctionPass *createAMDGPUMachineCFGStructurizerPass();			FunctionPass *createAMDGPUMachineCFGStructurizerPass();
	FunctionPass *createAMDGPURewriteOutArgumentsPass();			FunctionPass *createAMDGPURewriteOutArgumentsPass();
	FunctionPass *createSIModeRegisterPass();			FunctionPass *createSIModeRegisterPass();

	▲ Show 20 Lines • Show All 80 Lines • ▼ Show 20 Lines
	extern char &SILowerControlFlowID;			extern char &SILowerControlFlowID;

	void initializeSIInsertSkipsPass(PassRegistry &);			void initializeSIInsertSkipsPass(PassRegistry &);
	extern char &SIInsertSkipsPassID;			extern char &SIInsertSkipsPassID;

	void initializeSIOptimizeExecMaskingPass(PassRegistry &);			void initializeSIOptimizeExecMaskingPass(PassRegistry &);
	extern char &SIOptimizeExecMaskingID;			extern char &SIOptimizeExecMaskingID;

	void initializeSIFixWWMLivenessPass(PassRegistry &);			void initializeSIPreAllocateWWMRegsPass(PassRegistry &);
	extern char &SIFixWWMLivenessID;			extern char &SIPreAllocateWWMRegsID;

	void initializeAMDGPUSimplifyLibCallsPass(PassRegistry &);			void initializeAMDGPUSimplifyLibCallsPass(PassRegistry &);
	extern char &AMDGPUSimplifyLibCallsID;			extern char &AMDGPUSimplifyLibCallsID;

	void initializeAMDGPUUseNativeCallsPass(PassRegistry &);			void initializeAMDGPUUseNativeCallsPass(PassRegistry &);
	extern char &AMDGPUUseNativeCallsID;			extern char &AMDGPUUseNativeCallsID;

	void initializeSIAddIMGInitPass(PassRegistry &);			void initializeSIAddIMGInitPass(PassRegistry &);
	▲ Show 20 Lines • Show All 135 Lines • Show Last 20 Lines

lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

Show First 20 Lines • Show All 197 Lines • ▼ Show 20 Lines	extern "C" void LLVMInitializeAMDGPUTarget() {
initializeSIAnnotateControlFlowPass(*PR);		initializeSIAnnotateControlFlowPass(*PR);
initializeSIInsertWaitcntsPass(*PR);		initializeSIInsertWaitcntsPass(*PR);
initializeSIModeRegisterPass(*PR);		initializeSIModeRegisterPass(*PR);
initializeSIWholeQuadModePass(*PR);		initializeSIWholeQuadModePass(*PR);
initializeSILowerControlFlowPass(*PR);		initializeSILowerControlFlowPass(*PR);
initializeSIInsertSkipsPass(*PR);		initializeSIInsertSkipsPass(*PR);
initializeSIMemoryLegalizerPass(*PR);		initializeSIMemoryLegalizerPass(*PR);
initializeSIOptimizeExecMaskingPass(*PR);		initializeSIOptimizeExecMaskingPass(*PR);
initializeSIFixWWMLivenessPass(*PR);		initializeSIPreAllocateWWMRegsPass(*PR);
initializeSIFormMemoryClausesPass(*PR);		initializeSIFormMemoryClausesPass(*PR);
initializeAMDGPUUnifyDivergentExitNodesPass(*PR);		initializeAMDGPUUnifyDivergentExitNodesPass(*PR);
initializeAMDGPUAAWrapperPassPass(*PR);		initializeAMDGPUAAWrapperPassPass(*PR);
initializeAMDGPUExternalAAWrapperPass(*PR);		initializeAMDGPUExternalAAWrapperPass(*PR);
initializeAMDGPUUseNativeCallsPass(*PR);		initializeAMDGPUUseNativeCallsPass(*PR);
initializeAMDGPUSimplifyLibCallsPass(*PR);		initializeAMDGPUSimplifyLibCallsPass(*PR);
initializeAMDGPUInlinerPass(*PR);		initializeAMDGPUInlinerPass(*PR);
}		}
▲ Show 20 Lines • Show All 654 Lines • ▼ Show 20 Lines	void GCNPassConfig::addFastRegAlloc(FunctionPass *RegAllocPass) {
// FIXME: We have to disable the verifier here because of PHIElimination +		// FIXME: We have to disable the verifier here because of PHIElimination +
// TwoAddressInstructions disabling it.		// TwoAddressInstructions disabling it.

// This must be run immediately after phi elimination and before		// This must be run immediately after phi elimination and before
// TwoAddressInstructions, otherwise the processing of the tied operand of		// TwoAddressInstructions, otherwise the processing of the tied operand of
// SI_ELSE will introduce a copy of the tied operand source after the else.		// SI_ELSE will introduce a copy of the tied operand source after the else.
insertPass(&PHIEliminationID, &SILowerControlFlowID, false);		insertPass(&PHIEliminationID, &SILowerControlFlowID, false);

// This must be run after SILowerControlFlow, since it needs to use the		// This must be run just before RegisterCoalescing, which runs just after
// machine-level CFG, but before register allocation.		// TwoAddressInstructions.
insertPass(&SILowerControlFlowID, &SIFixWWMLivenessID, false);		insertPass(&TwoAddressInstructionPassID, &SIPreAllocateWWMRegsID, false);

TargetPassConfig::addFastRegAlloc(RegAllocPass);		TargetPassConfig::addFastRegAlloc(RegAllocPass);
}		}

void GCNPassConfig::addOptimizedRegAlloc(FunctionPass *RegAllocPass) {		void GCNPassConfig::addOptimizedRegAlloc(FunctionPass *RegAllocPass) {
insertPass(&MachineSchedulerID, &SIOptimizeExecMaskingPreRAID);		insertPass(&MachineSchedulerID, &SIOptimizeExecMaskingPreRAID);

insertPass(&SIOptimizeExecMaskingPreRAID, &SIFormMemoryClausesID);		insertPass(&SIOptimizeExecMaskingPreRAID, &SIFormMemoryClausesID);

// This must be run immediately after phi elimination and before		// This must be run immediately after phi elimination and before
// TwoAddressInstructions, otherwise the processing of the tied operand of		// TwoAddressInstructions, otherwise the processing of the tied operand of
// SI_ELSE will introduce a copy of the tied operand source after the else.		// SI_ELSE will introduce a copy of the tied operand source after the else.
insertPass(&PHIEliminationID, &SILowerControlFlowID, false);		insertPass(&PHIEliminationID, &SILowerControlFlowID, false);

// This must be run after SILowerControlFlow, since it needs to use the		// This must be run just before RegisterCoalescing, which runs just after
// machine-level CFG, but before register allocation.		// TwoAddressInstructions.
insertPass(&SILowerControlFlowID, &SIFixWWMLivenessID, false);		insertPass(&TwoAddressInstructionPassID, &SIPreAllocateWWMRegsID, false);

TargetPassConfig::addOptimizedRegAlloc(RegAllocPass);		TargetPassConfig::addOptimizedRegAlloc(RegAllocPass);
}		}

void GCNPassConfig::addPostRegAlloc() {		void GCNPassConfig::addPostRegAlloc() {
addPass(&SIFixVGPRCopiesID);		addPass(&SIFixVGPRCopiesID);
if (getOptLevel() > CodeGenOpt::None)		if (getOptLevel() > CodeGenOpt::None)
addPass(&SIOptimizeExecMaskingID);		addPass(&SIOptimizeExecMaskingID);
▲ Show 20 Lines • Show All 103 Lines • Show Last 20 Lines

lib/Target/AMDGPU/CMakeLists.txt

Show First 20 Lines • Show All 89 Lines • ▼ Show 20 Lines	add_llvm_target(AMDGPUCodeGen
R600OptimizeVectorRegisters.cpp		R600OptimizeVectorRegisters.cpp
R600Packetizer.cpp		R600Packetizer.cpp
R600RegisterInfo.cpp		R600RegisterInfo.cpp
SIAddIMGInit.cpp		SIAddIMGInit.cpp
SIAnnotateControlFlow.cpp		SIAnnotateControlFlow.cpp
SIFixSGPRCopies.cpp		SIFixSGPRCopies.cpp
SIFixupVectorISel.cpp		SIFixupVectorISel.cpp
SIFixVGPRCopies.cpp		SIFixVGPRCopies.cpp
SIFixWWMLiveness.cpp		SIPreAllocateWWMRegs.cpp
SIFoldOperands.cpp		SIFoldOperands.cpp
SIFormMemoryClauses.cpp		SIFormMemoryClauses.cpp
SIFrameLowering.cpp		SIFrameLowering.cpp
SIInsertSkips.cpp		SIInsertSkips.cpp
SIInsertWaitcnts.cpp		SIInsertWaitcnts.cpp
SIInstrInfo.cpp		SIInstrInfo.cpp
SIISelLowering.cpp		SIISelLowering.cpp
SILoadStoreOptimizer.cpp		SILoadStoreOptimizer.cpp
Show All 22 Lines

lib/Target/AMDGPU/SIFixWWMLiveness.cpp

This file was deleted.

	//===-- SIFixWWMLiveness.cpp - Fix WWM live intervals ---------===//
	//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//
	//===----------------------------------------------------------------------===//
	//
	/// \file
	/// Computations in WWM can overwrite values in inactive channels for
	/// variables that the register allocator thinks are dead. This pass adds fake
	/// uses of those variables to their def(s) to make sure that they aren't
	/// overwritten.
	///
	/// As an example, consider this snippet:
	/// %vgpr0 = V_MOV_B32_e32 0.0
	/// if (...) {
	/// %vgpr1 = ...
	/// %vgpr2 = WWM killed %vgpr1
	/// ... = killed %vgpr2
	/// %vgpr0 = V_MOV_B32_e32 1.0
	/// }
	/// ... = %vgpr0
	///
	/// The live intervals of %vgpr0 don't overlap with those of %vgpr1. Normally,
	/// we can safely allocate %vgpr0 and %vgpr1 in the same register, since
	/// writing %vgpr1 would only write to channels that would be clobbered by the
	/// second write to %vgpr0 anyways. But if %vgpr1 is written with WWM enabled,
	/// it would clobber even the inactive channels for which the if-condition is
	/// false, for which %vgpr0 is supposed to be 0. This pass adds an implicit use
	/// of %vgpr0 to its def to make sure they aren't allocated to the
	/// same register.
	///
	/// In general, we need to figure out what registers might have their inactive
	/// channels which are eventually used accidentally clobbered by a WWM
	/// instruction. We do that by spotting three separate cases of registers:
	///
	/// 1. A "then phi": the value resulting from phi elimination of a phi node at
	/// the end of an if..endif. If there is WWM code in the "then", then we
	/// make the def at the end of the "then" branch a partial def by adding an
	/// implicit use of the register.
	///
	/// 2. A "loop exit register": a value written inside a loop but used outside the
	/// loop, where there is WWM code inside the loop (the case in the example
	/// above). We add an implicit_def of the register in the loop pre-header,
	/// and make the original def a partial def by adding an implicit use of the
	/// register.
	///
	/// 3. A "loop exit phi": the value resulting from phi elimination of a phi node
	/// in a loop header. If there is WWM code inside the loop, then we make all
	/// defs inside the loop partial defs by adding an implicit use of the
	/// register on each one.
	///
	/// Note that we do not need to consider an if..else..endif phi. We only need to
	/// consider non-uniform control flow, and control flow structurization would
	/// have transformed a non-uniform if..else..endif into two if..endifs.
	///
	/// The analysis to detect these cases relies on a property of the MIR
	/// arising from this pass running straight after PHIElimination and before any
	/// coalescing: that any virtual register with more than one definition must be
	/// the new register added to lower a phi node by PHIElimination.
	///
	/// FIXME: We should detect whether a register in one of the above categories is
	/// already live at the WWM code before deciding to add the implicit uses to
	/// synthesize its liveness.
	///
	/// FIXME: I believe this whole scheme may be flawed due to the possibility of
	/// the register allocator doing live interval splitting.
	///
	//===----------------------------------------------------------------------===//

	#include "AMDGPU.h"
	#include "AMDGPUSubtarget.h"
	#include "SIInstrInfo.h"
	#include "SIRegisterInfo.h"
	#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
	#include "llvm/ADT/DepthFirstIterator.h"
	#include "llvm/ADT/SparseBitVector.h"
	#include "llvm/CodeGen/LiveIntervals.h"
	#include "llvm/CodeGen/MachineDominators.h"
	#include "llvm/CodeGen/MachineFunctionPass.h"
	#include "llvm/CodeGen/MachineLoopInfo.h"
	#include "llvm/CodeGen/Passes.h"
	#include "llvm/CodeGen/TargetRegisterInfo.h"

	using namespace llvm;

	#define DEBUG_TYPE "si-fix-wwm-liveness"

	namespace {

	class SIFixWWMLiveness : public MachineFunctionPass {
	private:
	MachineDominatorTree *DomTree;
	MachineLoopInfo *LoopInfo;
	LiveIntervals *LIS = nullptr;
	const SIInstrInfo *TII;
	const SIRegisterInfo *TRI;
	MachineRegisterInfo *MRI;

	std::vector<MachineInstr *> WWMs;
	std::vector<MachineOperand *> ThenDefs;
	std::vector<std::pair<MachineOperand , MachineLoop >> LoopExitDefs;
	std::vector<std::pair<MachineOperand , MachineLoop >> LoopPhiDefs;

	public:
	static char ID;

	SIFixWWMLiveness() : MachineFunctionPass(ID) {
	initializeSIFixWWMLivenessPass(*PassRegistry::getPassRegistry());
	}

	bool runOnMachineFunction(MachineFunction &MF) override;

	StringRef getPassName() const override { return "SI Fix WWM Liveness"; }

	void getAnalysisUsage(AnalysisUsage &AU) const override {
	AU.addRequiredID(MachineDominatorsID);
	AU.addRequiredID(MachineLoopInfoID);
	// Should preserve the same set that TwoAddressInstructions does.
	AU.addPreserved<SlotIndexes>();
	AU.addPreserved<LiveIntervals>();
	AU.addPreservedID(LiveVariablesID);
	AU.addPreservedID(MachineLoopInfoID);
	AU.addPreservedID(MachineDominatorsID);
	AU.setPreservesCFG();
	MachineFunctionPass::getAnalysisUsage(AU);
	}

	private:
	void processDef(MachineOperand &DefOpnd);
	bool processThenDef(MachineOperand *DefOpnd);
	bool processLoopExitDef(MachineOperand DefOpnd, MachineLoop Loop);
	bool processLoopPhiDef(MachineOperand DefOpnd, MachineLoop Loop);
	};

	} // End anonymous namespace.

	INITIALIZE_PASS_BEGIN(SIFixWWMLiveness, DEBUG_TYPE,
	"SI fix WWM liveness", false, false)
	INITIALIZE_PASS_DEPENDENCY(MachineDominatorTree)
	INITIALIZE_PASS_DEPENDENCY(MachineLoopInfo)
	INITIALIZE_PASS_END(SIFixWWMLiveness, DEBUG_TYPE,
	"SI fix WWM liveness", false, false)

	char SIFixWWMLiveness::ID = 0;

	char &llvm::SIFixWWMLivenessID = SIFixWWMLiveness::ID;

	FunctionPass *llvm::createSIFixWWMLivenessPass() {
	return new SIFixWWMLiveness();
	}

	bool SIFixWWMLiveness::runOnMachineFunction(MachineFunction &MF) {
	LLVM_DEBUG(dbgs() << "SIFixWWMLiveness: function " << MF.getName() << "\n");
	bool Modified = false;

	// This doesn't actually need LiveIntervals, but we can preserve them.
	LIS = getAnalysisIfAvailable<LiveIntervals>();

	const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();

	TII = ST.getInstrInfo();
	TRI = &TII->getRegisterInfo();
	MRI = &MF.getRegInfo();

	DomTree = &getAnalysis<MachineDominatorTree>();
	LoopInfo = &getAnalysis<MachineLoopInfo>();

	// Scan the function to find the WWM sections and the candidate registers for
	// having liveness modified.
	for (MachineBasicBlock &MBB : MF) {
	for (MachineInstr &MI : MBB) {
	if (MI.getOpcode() == AMDGPU::EXIT_WWM)
	WWMs.push_back(&MI);
	else {
	for (MachineOperand &DefOpnd : MI.defs()) {
	if (DefOpnd.isReg()) {
	unsigned Reg = DefOpnd.getReg();
	if (TRI->isVGPR(*MRI, Reg))
	processDef(DefOpnd);
	}
	}
	}
	}
	}
	if (!WWMs.empty()) {
	// Synthesize liveness over WWM sections as required.
	for (auto ThenDef : ThenDefs)
	Modified \|= processThenDef(ThenDef);
	for (auto LoopExitDef : LoopExitDefs)
	Modified \|= processLoopExitDef(LoopExitDef.first, LoopExitDef.second);
	for (auto LoopPhiDef : LoopPhiDefs)
	Modified \|= processLoopPhiDef(LoopPhiDef.first, LoopPhiDef.second);
	}

	WWMs.clear();
	ThenDefs.clear();
	LoopExitDefs.clear();
	LoopPhiDefs.clear();

	return Modified;
	}

	// During the function scan, process an operand that defines a VGPR.
	// This categorizes the register and puts it in the appropriate list for later
	// use when processing a WWM section.
	void SIFixWWMLiveness::processDef(MachineOperand &DefOpnd) {
	unsigned Reg = DefOpnd.getReg();
	// Get all the defining instructions. For convenience, make Defs[0] the def
	// we are on now.
	SmallVector<const MachineInstr *, 4> Defs;
	Defs.push_back(DefOpnd.getParent());
	for (auto &MI : MRI->def_instructions(Reg)) {
	if (&MI != DefOpnd.getParent())
	Defs.push_back(&MI);
	}
	// Check whether this def dominates all the others. If not, ignore this def.
	// Either it is going to be processed when the scan encounters its other def
	// that dominates all defs, or there is no def that dominates all others.
	// The latter case is an eliminated phi from an if..else..endif or similar,
	// which must be for uniform control flow so can be ignored.
	// Because this pass runs shortly after PHIElimination, we assume that any
	// multi-def register is a lowered phi, and thus has each def in a separate
	// basic block.
	for (unsigned I = 1; I != Defs.size(); ++I) {
	if (!DomTree->dominates(Defs[0]->getParent(), Defs[I]->getParent()))
	return;
	}
	// Check for the case of an if..endif lowered phi: It has two defs, one
	// dominates the other, and there is a single use in a successor of the
	// dominant def.
	// Later we will spot any WWM code inside
	// the "then" clause and turn the second def into a partial def so its
	// liveness goes through the WWM code in the "then" clause.
	if (Defs.size() == 2) {
	auto DomDefBlock = Defs[0]->getParent();
	if (DomDefBlock->succ_size() == 2 && MRI->hasOneUse(Reg)) {
	auto UseBlock = MRI->use_begin(Reg)->getParent()->getParent();
	for (auto Succ : DomDefBlock->successors()) {
	if (Succ == UseBlock) {
	LLVM_DEBUG(dbgs() << printReg(Reg, TRI) << " is a then phi reg\n");
	ThenDefs.push_back(&DefOpnd);
	return;
	}
	}
	}
	}
	// Check for the case of a non-lowered-phi register (single def) that exits
	// a loop, that is, it has a use that is outside a loop that the def is
	// inside. We find the outermost loop that the def is inside but a use is
	// outside. Later we will spot any WWM code inside that loop and then make
	// the def a partial def so its liveness goes round the loop and through the
	// WWM code.
	if (Defs.size() == 1) {
	auto Loop = LoopInfo->getLoopFor(Defs[0]->getParent());
	if (!Loop)
	return;
	bool IsLoopExit = false;
	for (auto &Use : MRI->use_instructions(Reg)) {
	auto UseBlock = Use.getParent();
	if (Loop->contains(UseBlock))
	continue;
	IsLoopExit = true;
	while (auto Parent = Loop->getParentLoop()) {
	if (Parent->contains(UseBlock))
	break;
	Loop = Parent;
	}
	}
	if (!IsLoopExit)
	return;
	LLVM_DEBUG(dbgs() << printReg(Reg, TRI)
	<< " is a loop exit reg with loop header at "
	<< "bb." << Loop->getHeader()->getNumber() << "\n");
	LoopExitDefs.push_back(std::pair<MachineOperand , MachineLoop >(
	&DefOpnd, Loop));
	return;
	}
	// Check for the case of a lowered single-preheader-loop phi, that is, a
	// multi-def register where the dominating def is in the loop pre-header and
	// all other defs are in backedges. Later we will spot any WWM code inside
	// that loop and then make the backedge defs partial defs so the liveness
	// goes through the WWM code.
	// Note that we are ignoring multi-preheader loops on the basis that the
	// structurizer does not allow that for non-uniform loops.
	// There must be a single use in the loop header.
	if (!MRI->hasOneUse(Reg))
	return;
	auto UseBlock = MRI->use_begin(Reg)->getParent()->getParent();
	auto Loop = LoopInfo->getLoopFor(UseBlock);
	if (!Loop \|\| Loop->getHeader() != UseBlock
	\|\| Loop->contains(Defs[0]->getParent())) {
	LLVM_DEBUG(dbgs() << printReg(Reg, TRI)
	<< " is multi-def but single use not in loop header\n");
	return;
	}
	for (unsigned I = 1; I != Defs.size(); ++I) {
	if (!Loop->contains(Defs[I]->getParent()))
	return;
	}
	LLVM_DEBUG(dbgs() << printReg(Reg, TRI)
	<< " is a loop phi reg with loop header at "
	<< "bb." << Loop->getHeader()->getNumber() << "\n");
	LoopPhiDefs.push_back(
	std::pair<MachineOperand , MachineLoop >(&DefOpnd, Loop));
	}

	// Process a then phi def: It has two defs, one dominates the other, and there
	// is a single use in a successor of the dominant def. Here we spot any WWM
	// code inside the "then" clause and turn the second def into a partial def so
	// its liveness goes through the WWM code in the "then" clause.
	bool SIFixWWMLiveness::processThenDef(MachineOperand *DefOpnd) {
	LLVM_DEBUG(dbgs() << "Processing then def: " << *DefOpnd->getParent());
	if (DefOpnd->getParent()->getOpcode() == TargetOpcode::IMPLICIT_DEF) {
	// Ignore if dominating def is undef.
	LLVM_DEBUG(dbgs() << " ignoring as dominating def is undef\n");
	return false;
	}
	unsigned Reg = DefOpnd->getReg();
	// Get the use block, which is the endif block.
	auto UseBlock = MRI->use_instr_begin(Reg)->getParent();
	// Check whether there is WWM code inside the then branch. The WWM code must
	// be dominated by the if but not dominated by the endif.
	bool ContainsWWM = false;
	for (auto WWM : WWMs) {
	if (DomTree->dominates(DefOpnd->getParent()->getParent(), WWM->getParent())
	&& !DomTree->dominates(UseBlock, WWM->getParent())) {
	LLVM_DEBUG(dbgs() << " contains WWM: " << *WWM);
	ContainsWWM = true;
	break;
	}
	}
	if (!ContainsWWM)
	return false;
	// Get the other def.
	MachineInstr *OtherDef = nullptr;
	for (auto &MI : MRI->def_instructions(Reg)) {
	if (&MI != DefOpnd->getParent())
	OtherDef = &MI;
	}
	// Make it a partial def.
	OtherDef->addOperand(MachineOperand::CreateReg(Reg, false, /isImp=/true));
	LLVM_DEBUG(dbgs() << *OtherDef);
	return true;
	}

	// Process a loop exit def, that is, a register with a single use in a loop
	// that has a use outside the loop. Here we spot any WWM code inside that loop
	// and then make the def a partial def so its liveness goes round the loop and
	// through the WWM code.
	bool SIFixWWMLiveness::processLoopExitDef(MachineOperand *DefOpnd,
	MachineLoop *Loop) {
	LLVM_DEBUG(dbgs() << "Processing loop exit def: " << *DefOpnd->getParent());
	// Check whether there is WWM code inside the loop.
	bool ContainsWWM = false;
	for (auto WWM : WWMs) {
	if (Loop->contains(WWM->getParent())) {
	LLVM_DEBUG(dbgs() << " contains WWM: " << *WWM);
	ContainsWWM = true;
	break;
	}
	}
	if (!ContainsWWM)
	return false;
	unsigned Reg = DefOpnd->getReg();
	// Add a new implicit_def in loop preheader(s).
	for (auto Pred : Loop->getHeader()->predecessors()) {
	if (!Loop->contains(Pred)) {
	auto ImplicitDef = BuildMI(*Pred, Pred->getFirstTerminator(), DebugLoc(),
	TII->get(TargetOpcode::IMPLICIT_DEF), Reg);
	LLVM_DEBUG(dbgs() << *ImplicitDef);
	(void)ImplicitDef;
	}
	}
	// Make the original def partial.
	DefOpnd->getParent()->addOperand(MachineOperand::CreateReg(
	Reg, false, /isImp=/true));
	LLVM_DEBUG(dbgs() << *DefOpnd->getParent());
	return true;
	}

	// Process a loop phi def, that is, a multi-def register where the dominating
	// def is in the loop pre-header and all other defs are in backedges. Here we
	// spot any WWM code inside that loop and then make the backedge defs partial
	// defs so the liveness goes through the WWM code.
	bool SIFixWWMLiveness::processLoopPhiDef(MachineOperand *DefOpnd,
	MachineLoop *Loop) {
	LLVM_DEBUG(dbgs() << "Processing loop phi def: " << *DefOpnd->getParent());
	// Check whether there is WWM code inside the loop.
	bool ContainsWWM = false;
	for (auto WWM : WWMs) {
	if (Loop->contains(WWM->getParent())) {
	LLVM_DEBUG(dbgs() << " contains WWM: " << *WWM);
	ContainsWWM = true;
	break;
	}
	}
	if (!ContainsWWM)
	return false;
	unsigned Reg = DefOpnd->getReg();
	// Remove kill mark from uses.
	for (auto &Use : MRI->use_operands(Reg))
	Use.setIsKill(false);
	// Make all defs except the dominating one partial defs.
	SmallVector<MachineInstr *, 4> Defs;
	for (auto &Def : MRI->def_instructions(Reg))
	Defs.push_back(&Def);
	for (auto Def : Defs) {
	if (DefOpnd->getParent() == Def)
	continue;
	Def->addOperand(MachineOperand::CreateReg(Reg, false, /isImp=/true));
	LLVM_DEBUG(dbgs() << *Def);
	}
	return true;
	}

lib/Target/AMDGPU/SIInstrInfo.cpp

Show First 20 Lines • Show All 445 Lines • ▼ Show 20 Lines	bool SIInstrInfo::shouldClusterMemOps(MachineOperand &BaseOp1,

// The unit of this value is bytes.		// The unit of this value is bytes.
// FIXME: This needs finer tuning.		// FIXME: This needs finer tuning.
unsigned LoadClusterThreshold = 16;		unsigned LoadClusterThreshold = 16;

const MachineRegisterInfo &MRI =		const MachineRegisterInfo &MRI =
FirstLdSt.getParent()->getParent()->getRegInfo();		FirstLdSt.getParent()->getParent()->getRegInfo();
const TargetRegisterClass *DstRC = MRI.getRegClass(FirstDst->getReg());		const TargetRegisterClass *DstRC = MRI.getRegClass(FirstDst->getReg());

return (NumLoads * (RI.getRegSizeInBits(*DstRC) / 8)) <= LoadClusterThreshold;		return (NumLoads * (RI.getRegSizeInBits(*DstRC) / 8)) <= LoadClusterThreshold;
}		}

// FIXME: This behaves strangely. If, for example, you have 32 load + stores,		// FIXME: This behaves strangely. If, for example, you have 32 load + stores,
		arsenmUnsubmitted Not Done Reply Inline Actions This is a separate change arsenm: This is a separate change
// the first 16 loads will be interleaved with the stores, and the next 16 will		// the first 16 loads will be interleaved with the stores, and the next 16 will
// be clustered as expected. It should really split into 2 16 store batches.		// be clustered as expected. It should really split into 2 16 store batches.
//		//
// Loads are clustered until this returns false, rather than trying to schedule		// Loads are clustered until this returns false, rather than trying to schedule
// groups of stores. This also means we have to deal with saying different		// groups of stores. This also means we have to deal with saying different
// address space loads should be clustered, and ones which might cause bank		// address space loads should be clustered, and ones which might cause bank
// conflicts.		// conflicts.
//		//
▲ Show 20 Lines • Show All 814 Lines • ▼ Show 20 Lines	else
MIB.add(MI.getOperand(2));		MIB.add(MI.getOperand(2));

Bundler.append(MIB);		Bundler.append(MIB);
finalizeBundle(MBB, Bundler.begin());		finalizeBundle(MBB, Bundler.begin());

MI.eraseFromParent();		MI.eraseFromParent();
break;		break;
}		}
		case AMDGPU::ENTER_WWM: {
		// This only gets its own opcode so that SIPreAllocateWWMRegs can tell when
		// WWM is entered.
		MI.setDesc(get(AMDGPU::S_OR_SAVEEXEC_B64));
		break;
		}
case AMDGPU::EXIT_WWM: {		case AMDGPU::EXIT_WWM: {
// This only gets its own opcode so that SIFixWWMLiveness can tell when WWM		// This only gets its own opcode so that SIPreAllocateWWMRegs can tell when
// is exited.		// WWM is exited.
MI.setDesc(get(AMDGPU::S_MOV_B64));		MI.setDesc(get(AMDGPU::S_MOV_B64));
break;		break;
}		}
case TargetOpcode::BUNDLE: {		case TargetOpcode::BUNDLE: {
if (!MI.mayLoad())		if (!MI.mayLoad())
return false;		return false;

// If it is a load it must be a memory clause		// If it is a load it must be a memory clause
▲ Show 20 Lines • Show All 4,360 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIInstructions.td

	Show First 20 Lines • Show All 115 Lines • ▼ Show 20 Lines
	// the instruction that defines $src0 (which is run in WWM) doesn't			// the instruction that defines $src0 (which is run in WWM) doesn't
	// accidentally clobber inactive channels of $vdst.			// accidentally clobber inactive channels of $vdst.
	let Constraints = "@earlyclobber $vdst" in {			let Constraints = "@earlyclobber $vdst" in {
	def WWM : PseudoInstSI <(outs unknown:$vdst), (ins unknown:$src0)>;			def WWM : PseudoInstSI <(outs unknown:$vdst), (ins unknown:$src0)>;
	}			}

	} // End let hasSideEffects = 0, mayLoad = 0, mayStore = 0, Uses = [EXEC]			} // End let hasSideEffects = 0, mayLoad = 0, mayStore = 0, Uses = [EXEC]

				def ENTER_WWM : SPseudoInstSI <(outs SReg_64:$sdst), (ins i64imm:$src0)> {
				let Defs = [EXEC];
				let hasSideEffects = 0;
				let mayLoad = 0;
				let mayStore = 0;
				}

	def EXIT_WWM : SPseudoInstSI <(outs SReg_64:$sdst), (ins SReg_64:$src0)> {			def EXIT_WWM : SPseudoInstSI <(outs SReg_64:$sdst), (ins SReg_64:$src0)> {
	let hasSideEffects = 0;			let hasSideEffects = 0;
	let mayLoad = 0;			let mayLoad = 0;
	let mayStore = 0;			let mayStore = 0;
	}			}

	// Invert the exec mask and overwrite the inactive lanes of dst with inactive,			// Invert the exec mask and overwrite the inactive lanes of dst with inactive,
	// restoring it after we're done.			// restoring it after we're done.
	▲ Show 20 Lines • Show All 1,534 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIMachineFunctionInfo.h

Show All 16 Lines
#include "AMDGPUMachineFunction.h"		#include "AMDGPUMachineFunction.h"
#include "MCTargetDesc/AMDGPUMCTargetDesc.h"		#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
#include "SIInstrInfo.h"		#include "SIInstrInfo.h"
#include "SIRegisterInfo.h"		#include "SIRegisterInfo.h"
#include "llvm/ADT/ArrayRef.h"		#include "llvm/ADT/ArrayRef.h"
#include "llvm/ADT/DenseMap.h"		#include "llvm/ADT/DenseMap.h"
#include "llvm/ADT/Optional.h"		#include "llvm/ADT/Optional.h"
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
		#include "llvm/ADT/SparseBitVector.h"
#include "llvm/CodeGen/MIRYamlMapping.h"		#include "llvm/CodeGen/MIRYamlMapping.h"
#include "llvm/CodeGen/PseudoSourceValue.h"		#include "llvm/CodeGen/PseudoSourceValue.h"
#include "llvm/CodeGen/TargetInstrInfo.h"		#include "llvm/CodeGen/TargetInstrInfo.h"
#include "llvm/MC/MCRegisterInfo.h"		#include "llvm/MC/MCRegisterInfo.h"
#include "llvm/Support/ErrorHandling.h"		#include "llvm/Support/ErrorHandling.h"
#include <array>		#include <array>
#include <cassert>		#include <cassert>
#include <utility>		#include <utility>
▲ Show 20 Lines • Show All 218 Lines • ▼ Show 20 Lines	struct SGPRSpillVGPRCSR {

// If the VGPR is a CSR, the stack slot used to save/restore it in the		// If the VGPR is a CSR, the stack slot used to save/restore it in the
// prolog/epilog.		// prolog/epilog.
Optional<int> FI;		Optional<int> FI;

SGPRSpillVGPRCSR(unsigned V, Optional<int> F) : VGPR(V), FI(F) {}		SGPRSpillVGPRCSR(unsigned V, Optional<int> F) : VGPR(V), FI(F) {}
};		};

		SparseBitVector<> WWMReservedRegs;
		arsenmUnsubmitted Not Done Reply Inline Actions This should be serialized into YAML. Maybe this should also be in terms of RegUnits? arsenm: This should be serialized into YAML. Maybe this should also be in terms of RegUnits?

		void ReserveWWMRegister(unsigned reg) { WWMReservedRegs.set(reg); }

private:		private:
// SGPR->VGPR spilling support.		// SGPR->VGPR spilling support.
using SpillRegMask = std::pair<unsigned, unsigned>;		using SpillRegMask = std::pair<unsigned, unsigned>;

// Track VGPR + wave index for each subregister of the SGPR spilled to		// Track VGPR + wave index for each subregister of the SGPR spilled to
// frameindex key.		// frameindex key.
DenseMap<int, std::vector<SpilledReg>> SGPRToVGPRSpills;		DenseMap<int, std::vector<SpilledReg>> SGPRToVGPRSpills;
unsigned NumVGPRSpillLanes = 0;		unsigned NumVGPRSpillLanes = 0;
▲ Show 20 Lines • Show All 414 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIPreAllocateWWMRegs.cpp

This file was added.

				//===-- SIPreAllocateWWMRegs.cpp - WWM Register Pre-allocation ---------===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				tprUnsubmitted Not Done Reply Inline Actions Wrong license. tpr: Wrong license.
				//
				//===----------------------------------------------------------------------===//

				#include "AMDGPU.h"
				#include "AMDGPUSubtarget.h"
				#include "SIInstrInfo.h"
				#include "SIRegisterInfo.h"
				#include "SIMachineFunctionInfo.h"
				#include "MCTargetDesc/AMDGPUMCTargetDesc.h"
				#include "llvm/ADT/PostOrderIterator.h"
				#include "llvm/CodeGen/VirtRegMap.h"
				#include "llvm/CodeGen/LiveInterval.h"
				#include "llvm/CodeGen/LiveIntervals.h"
				#include "llvm/CodeGen/LiveRegMatrix.h"
				#include "llvm/CodeGen/MachineDominators.h"
				#include "llvm/CodeGen/MachineFunctionPass.h"
				#include "llvm/CodeGen/RegisterClassInfo.h"

				using namespace llvm;

				#define DEBUG_TYPE "si-pre-allocate-wwm-regs"

				namespace {

				class SIPreAllocateWWMRegs : public MachineFunctionPass {
				private:
				const SIInstrInfo *TII;
				const SIRegisterInfo *TRI;
				MachineRegisterInfo *MRI;
				LiveIntervals *LIS;
				LiveRegMatrix *Matrix;
				VirtRegMap *VRM;
				RegisterClassInfo RegClassInfo;

				std::vector<unsigned> RegsToRewrite;

				public:
				static char ID;

				SIPreAllocateWWMRegs() : MachineFunctionPass(ID) {
				initializeSIPreAllocateWWMRegsPass(*PassRegistry::getPassRegistry());
				}

				bool runOnMachineFunction(MachineFunction &MF) override;

				void getAnalysisUsage(AnalysisUsage &AU) const override {
				AU.addRequired<LiveIntervals>();
				arsenmUnsubmitted Done Reply Inline Actions You can remove this arsenm: You can remove this
				AU.addPreserved<LiveIntervals>();
				AU.addRequired<VirtRegMap>();
				AU.addRequired<LiveRegMatrix>();
				AU.addPreserved<SlotIndexes>();
				AU.setPreservesCFG();
				MachineFunctionPass::getAnalysisUsage(AU);
				}

				private:
				bool processDef(MachineOperand &MO);
				void rewriteRegs(MachineFunction &MF);
				};

				} // End anonymous namespace.

				INITIALIZE_PASS_BEGIN(SIPreAllocateWWMRegs, DEBUG_TYPE,
				"SI Pre-allocate WWM Registers", false, false)
				INITIALIZE_PASS_DEPENDENCY(LiveIntervals)
				INITIALIZE_PASS_DEPENDENCY(VirtRegMap)
				INITIALIZE_PASS_DEPENDENCY(LiveRegMatrix)
				INITIALIZE_PASS_END(SIPreAllocateWWMRegs, DEBUG_TYPE,
				"SI Pre-allocate WWM Registers", false, false)

				char SIPreAllocateWWMRegs::ID = 0;

				char &llvm::SIPreAllocateWWMRegsID = SIPreAllocateWWMRegs::ID;

				FunctionPass *llvm::createSIPreAllocateWWMRegsPass() {
				return new SIPreAllocateWWMRegs();
				}

				bool SIPreAllocateWWMRegs::processDef(MachineOperand &MO) {
				if (!MO.isReg())
				return false;

				unsigned Reg = MO.getReg();

				if (!TRI->isVGPR(*MRI, Reg))
				return false;

				if (TRI->isPhysicalRegister(Reg))
				return false;

				if (VRM->hasPhys(Reg))
				return false;

				LiveInterval &LI = LIS->getInterval(Reg);

				for (unsigned PhysReg : RegClassInfo.getOrder(MRI->getRegClass(Reg))) {
				if (!MRI->isPhysRegUsed(PhysReg) &&
				Matrix->checkInterference(LI, PhysReg) == LiveRegMatrix::IK_Free) {
				arsenmUnsubmitted Not Done Reply Inline Actions Is the isPhysRegUsed check really necessary? It's going to break if you have multiple WWM sections arsenm: Is the isPhysRegUsed check really necessary? It's going to break if you have multiple WWM…
				Matrix->assign(LI, PhysReg);
				assert(PhysReg != 0);
				RegsToRewrite.push_back(Reg);
				return true;
				}
				}

				llvm_unreachable("physreg not found for WWM expression");
				return false;
				}

				void SIPreAllocateWWMRegs::rewriteRegs(MachineFunction &MF) {
				for (MachineBasicBlock &MBB : MF) {
				for (MachineInstr &MI : MBB) {
				for (MachineOperand &MO : MI.operands()) {
				if (!MO.isReg())
				continue;

				const unsigned VirtReg = MO.getReg();
				if (TRI->isPhysicalRegister(VirtReg))
				continue;

				if (!VRM->hasPhys(VirtReg))
				continue;

				unsigned PhysReg = VRM->getPhys(VirtReg);
				const unsigned SubReg = MO.getSubReg();
				if (SubReg != 0) {
				PhysReg = TRI->getSubReg(PhysReg, SubReg);
				MO.setSubReg(0);
				}

				MO.setReg(PhysReg);
				MO.setIsRenamable(false);
				}
				}
				}

				SIMachineFunctionInfo *MFI = MF.getInfo<SIMachineFunctionInfo>();

				for (unsigned Reg : RegsToRewrite) {
				LIS->removeInterval(Reg);

				const unsigned PhysReg = VRM->getPhys(Reg);
				assert(PhysReg != 0);
				MFI->ReserveWWMRegister(PhysReg);

				// Need to turn any COPY's into MOV's when the source register is one of our
				// physical registers.
				const unsigned MovOp = TII->getMovOpcode(TRI->getPhysRegClass(PhysReg));
				arsenmUnsubmitted Done Reply Inline Actions Can't you just hardcoded this to v_mov_b32? arsenm: Can't you just hardcoded this to v_mov_b32?
				arsenmUnsubmitted Done Reply Inline Actions I'm confused on the constraints for the WWM intrinsic, or lack thereof. The WWM instruction just uses "unknown" and the intrinsic allows any type. Can this be a 64-bit register or greater? arsenm: I'm confused on the constraints for the WWM intrinsic, or lack thereof. The WWM instruction…
				sheredomAuthorUnsubmitted Done Reply Inline Actions We only use WWM with 32-bit & 64-bit types in our stack - nothing else. sheredom: We only use WWM with 32-bit & 64-bit types in our stack - nothing else.

				for (MachineOperand &MO : MRI->reg_operands(PhysReg)) {
				arsenmUnsubmitted Done Reply Inline Actions MRI->reg_instructions()? It would also save the check that the operand matches later arsenm: MRI->reg_instructions()? It would also save the check that the operand matches later
				MachineInstr &MI = *MO.getParent();

				// Only check copies.
				if (MI.getOpcode() != AMDGPU::COPY) {
				arsenmUnsubmitted Done Reply Inline Actions .isCopy() arsenm: .isCopy()
				continue;
				}

				assert(MI.getNumOperands() >= 2);

				MachineOperand &Dst = MI.getOperand(0);

				// If the destination is a physical register, skip.
				if (!Dst.isReg() \|\| TRI->isPhysicalRegister(Dst.getReg())) {
				continue;
				}

				MachineOperand &Src = MI.getOperand(1);

				// If the source wasn't our physical register, skip.
				if (!Src.isReg() \|\| (PhysReg != Src.getReg())) {
				continue;
				}

				// Change the MI into a mov.
				MI.setDesc(TII->get(MovOp));

				// And make it implicitly depend on exec (like all VALU movs should do).
				MI.addOperand(MachineOperand::CreateReg(AMDGPU::EXEC, false, true));
				arsenmUnsubmitted Done Reply Inline Actions I know we have a helper to do this somewhere arsenm: I know we have a helper to do this somewhere
				sheredomAuthorUnsubmitted Done Reply Inline Actions Couldn't find it myself in the code. sheredom: Couldn't find it myself in the code.
				}
				}

				RegsToRewrite.clear();

				// Update the set of reserved registers to include WWM ones.
				MRI->freezeReservedRegs(MF);
				}

				bool SIPreAllocateWWMRegs::runOnMachineFunction(MachineFunction &MF) {
				LLVM_DEBUG(dbgs() << "SIPreAllocateWWMRegs: function " << MF.getName() << "\n");

				const GCNSubtarget &ST = MF.getSubtarget<GCNSubtarget>();

				TII = ST.getInstrInfo();
				TRI = &TII->getRegisterInfo();
				MRI = &MF.getRegInfo();

				LIS = &getAnalysis<LiveIntervals>();
				Matrix = &getAnalysis<LiveRegMatrix>();
				VRM = &getAnalysis<VirtRegMap>();

				RegClassInfo.runOnMachineFunction(MF);

				bool RegsAssigned = false;

				// We use a reverse post-order traversal of the control-flow graph to
				// guarantee that we visit definitions in dominance order. Since WWM
				// expressions are guaranteed to never involve phi nodes, and we can only
				// escape WWM through the special WWM instruction, this means that this is a
				// perfect elimination order, so we can never do any better.
				ReversePostOrderTraversal<MachineFunction*> RPOT(&MF);

				for (MachineBasicBlock *MBB : RPOT) {
				bool InWWM = false;
				for (MachineInstr &MI : *MBB) {
				if (MI.getOpcode() == AMDGPU::V_SET_INACTIVE_B32)
				RegsAssigned \|= processDef(MI.getOperand(0));

				if (MI.getOpcode() == AMDGPU::ENTER_WWM) {
				LLVM_DEBUG(dbgs() << "entering WWM region: " << MI << "\n");
				InWWM = true;
				continue;
				}
				arsenmUnsubmitted Done Reply Inline Actions I don't like this hardcoded opcode check. Why is S_OR_SAVEEXEC_B64 special? arsenm: I don't like this hardcoded opcode check. Why is S_OR_SAVEEXEC_B64 special?
				cwabbottUnsubmitted Done Reply Inline Actions IIRC I wrote this since WholeQuadMode currently just emits a S_OR_SAVEEXEC_B64 foo, -1 in order to start WWM, so you have to check it manually in order to know when WWM starts. Maybe it's better to got back and add a ENTER_WWM pseudoinstruction like the pre-existing EXIT_WWM. cwabbott: IIRC I wrote this since WholeQuadMode currently just emits a S_OR_SAVEEXEC_B64 foo, -1 in order…
				sheredomAuthorUnsubmitted Done Reply Inline Actions I've added an ENTER_WWM like was suggested! sheredom: I've added an ENTER_WWM like was suggested!

				if (MI.getOpcode() == AMDGPU::EXIT_WWM) {
				LLVM_DEBUG(dbgs() << "exiting WWM region: " << MI << "\n");
				InWWM = false;
				}

				if (!InWWM)
				continue;

				LLVM_DEBUG(dbgs() << "processing " << MI << "\n");

				for (MachineOperand &DefOpnd : MI.defs()) {
				RegsAssigned \|= processDef(DefOpnd);
				}
				}
				}

				if (!RegsAssigned)
				return false;

				rewriteRegs(MF);
				return true;
				}

lib/Target/AMDGPU/SIRegisterInfo.cpp

Show First 20 Lines • Show All 221 Lines • ▼ Show 20 Lines	BitVector SIRegisterInfo::getReservedRegs(const MachineFunction &MF) const {
}		}

unsigned FrameReg = MFI->getFrameOffsetReg();		unsigned FrameReg = MFI->getFrameOffsetReg();
if (FrameReg != AMDGPU::NoRegister) {		if (FrameReg != AMDGPU::NoRegister) {
reserveRegisterTuples(Reserved, FrameReg);		reserveRegisterTuples(Reserved, FrameReg);
assert(!isSubRegister(ScratchRSrcReg, FrameReg));		assert(!isSubRegister(ScratchRSrcReg, FrameReg));
}		}

		for (unsigned Reg : MFI->WWMReservedRegs) {
		reserveRegisterTuples(Reserved, Reg);
		}

return Reserved;		return Reserved;
}		}

bool SIRegisterInfo::requiresRegisterScavenging(const MachineFunction &Fn) const {		bool SIRegisterInfo::requiresRegisterScavenging(const MachineFunction &Fn) const {
const SIMachineFunctionInfo *Info = Fn.getInfo<SIMachineFunctionInfo>();		const SIMachineFunctionInfo *Info = Fn.getInfo<SIMachineFunctionInfo>();
if (Info->isEntryFunction()) {		if (Info->isEntryFunction()) {
const MachineFrameInfo &MFI = Fn.getFrameInfo();		const MachineFrameInfo &MFI = Fn.getFrameInfo();
return MFI.hasStackObjects() \|\| MFI.hasCalls();		return MFI.hasStackObjects() \|\| MFI.hasCalls();
▲ Show 20 Lines • Show All 1,430 Lines • Show Last 20 Lines

lib/Target/AMDGPU/SIWholeQuadMode.cpp

	Show First 20 Lines • Show All 648 Lines • ▼ Show 20 Lines
	}			}

	void SIWholeQuadMode::toWWM(MachineBasicBlock &MBB,			void SIWholeQuadMode::toWWM(MachineBasicBlock &MBB,
	MachineBasicBlock::iterator Before,			MachineBasicBlock::iterator Before,
	unsigned SaveOrig) {			unsigned SaveOrig) {
	MachineInstr *MI;			MachineInstr *MI;

	assert(SaveOrig);			assert(SaveOrig);
	MI = BuildMI(MBB, Before, DebugLoc(), TII->get(AMDGPU::S_OR_SAVEEXEC_B64),			MI = BuildMI(MBB, Before, DebugLoc(), TII->get(AMDGPU::ENTER_WWM), SaveOrig)
	SaveOrig)
	.addImm(-1);			.addImm(-1);
	LIS->InsertMachineInstrInMaps(*MI);			LIS->InsertMachineInstrInMaps(*MI);
	}			}

	void SIWholeQuadMode::fromWWM(MachineBasicBlock &MBB,			void SIWholeQuadMode::fromWWM(MachineBasicBlock &MBB,
	MachineBasicBlock::iterator Before,			MachineBasicBlock::iterator Before,
	unsigned SavedOrig) {			unsigned SavedOrig) {
	MachineInstr *MI;			MachineInstr *MI;
	▲ Show 20 Lines • Show All 239 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll

Show First 20 Lines • Show All 106 Lines • ▼ Show 20 Lines	entry:
ret void		ret void
}		}

; GCN-LABEL: sub_i32_varying_vdata:		; GCN-LABEL: sub_i32_varying_vdata:
; GFX7LESS-NOT: v_mbcnt_lo_u32_b32		; GFX7LESS-NOT: v_mbcnt_lo_u32_b32
; GFX7LESS-NOT: v_mbcnt_hi_u32_b32		; GFX7LESS-NOT: v_mbcnt_hi_u32_b32
; GFX7LESS-NOT: s_bcnt1_i32_b64		; GFX7LESS-NOT: s_bcnt1_i32_b64
; GFX7LESS: buffer_atomic_sub v{{[0-9]+}}		; GFX7LESS: buffer_atomic_sub v{{[0-9]+}}
; GFX8MORE: v_mov_b32_dpp v[[wave_shr1:[0-9]+]], v[[sub_value:[0-9]+]] wave_shr:1 row_mask:0xf bank_mask:0xf		; GFX8MORE: v_mov_b32_dpp v[[wave_shr1:[0-9]+]], v{{[0-9]+}} wave_shr:1 row_mask:0xf bank_mask:0xf
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v[[wave_shr1]] row_shr:1 row_mask:0xf bank_mask:0xf		; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v[[wave_shr1]] row_shr:1 row_mask:0xf bank_mask:0xf
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v[[wave_shr1]] row_shr:2 row_mask:0xf bank_mask:0xf		; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v[[wave_shr1]] row_shr:2 row_mask:0xf bank_mask:0xf
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v[[wave_shr1]] row_shr:3 row_mask:0xf bank_mask:0xf		; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v[[wave_shr1]] row_shr:3 row_mask:0xf bank_mask:0xf
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v{{[0-9]+}} row_shr:4 row_mask:0xf bank_mask:0xe		; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v{{[0-9]+}} row_shr:4 row_mask:0xf bank_mask:0xe
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v{{[0-9]+}} row_shr:8 row_mask:0xf bank_mask:0xc		; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v{{[0-9]+}} row_shr:8 row_mask:0xf bank_mask:0xc
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v{{[0-9]+}} row_bcast:15 row_mask:0xa bank_mask:0xf		; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v{{[0-9]+}} row_bcast:15 row_mask:0xa bank_mask:0xf
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v{{[0-9]+}} row_bcast:31 row_mask:0xc bank_mask:0xf		; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v{{[0-9]+}} row_bcast:31 row_mask:0xc bank_mask:0xf
; GFX8MORE: v_sub_u32_e32 v[[sub_value]],{{( vcc,)?}} v[[sub_value]], v{{[0-9]+}}		; GFX8MORE: v_readlane_b32 s[[scalar_value:[0-9]+]], v{{[0-9]+}}, 63
; GFX8MORE: v_readlane_b32 s[[scalar_value:[0-9]+]], v[[sub_value]], 63
; GFX8MORE: v_mov_b32{{(_e[0-9]+)?}} v[[value:[0-9]+]], s[[scalar_value]]		; GFX8MORE: v_mov_b32{{(_e[0-9]+)?}} v[[value:[0-9]+]], s[[scalar_value]]
; GFX8MORE: buffer_atomic_sub v[[value]]		; GFX8MORE: buffer_atomic_sub v[[value]]
define amdgpu_kernel void @sub_i32_varying_vdata(i32 addrspace(1)* %out, <4 x i32> %inout) {		define amdgpu_kernel void @sub_i32_varying_vdata(i32 addrspace(1)* %out, <4 x i32> %inout) {
entry:		entry:
%lane = call i32 @llvm.amdgcn.workitem.id.x()		%lane = call i32 @llvm.amdgcn.workitem.id.x()
%old = call i32 @llvm.amdgcn.buffer.atomic.sub(i32 %lane, <4 x i32> %inout, i32 0, i32 0, i1 0)		%old = call i32 @llvm.amdgcn.buffer.atomic.sub(i32 %lane, <4 x i32> %inout, i32 0, i32 0, i1 0)
store i32 %old, i32 addrspace(1)* %out		store i32 %old, i32 addrspace(1)* %out
ret void		ret void
Show All 14 Lines

test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll

; RUN: llc -march=amdgcn -mtriple=amdgcn---amdgiz -amdgpu-atomic-optimizations=true -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GFX7LESS %s		; RUN: llc -march=amdgcn -mtriple=amdgcn---amdgiz -amdgpu-atomic-optimizations=true -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GFX7LESS %s
; RUN: llc -march=amdgcn -mtriple=amdgcn---amdgiz -mcpu=tonga -mattr=-flat-for-global -amdgpu-atomic-optimizations=true -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GFX8MORE %s		; RUN: llc -march=amdgcn -mtriple=amdgcn---amdgiz -mcpu=tonga -mattr=-flat-for-global -amdgpu-atomic-optimizations=true -amdgpu-dpp-combine=false -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GFX8MORE %s
; RUN: llc -march=amdgcn -mtriple=amdgcn---amdgiz -mcpu=gfx900 -mattr=-flat-for-global -amdgpu-atomic-optimizations=true -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GFX8MORE %s		; RUN: llc -march=amdgcn -mtriple=amdgcn---amdgiz -mcpu=gfx900 -mattr=-flat-for-global -amdgpu-atomic-optimizations=true -amdgpu-dpp-combine=false -verify-machineinstrs < %s \| FileCheck -enable-var-scope -check-prefixes=GCN,GFX8MORE %s

declare i32 @llvm.amdgcn.workitem.id.x()		declare i32 @llvm.amdgcn.workitem.id.x()

; Show that what the atomic optimization pass will do for global pointers.		; Show that what the atomic optimization pass will do for global pointers.

; GCN-LABEL: add_i32_constant:		; GCN-LABEL: add_i32_constant:
; GCN: v_cmp_ne_u32_e64 s{{\[}}[[exec_lo:[0-9]+]]:[[exec_hi:[0-9]+]]{{\]}}, 1, 0		; GCN: v_cmp_ne_u32_e64 s{{\[}}[[exec_lo:[0-9]+]]:[[exec_hi:[0-9]+]]{{\]}}, 1, 0
; GCN: v_mbcnt_lo_u32_b32{{(_e[0-9]+)?}} v[[mbcnt_lo:[0-9]+]], s[[exec_lo]], 0		; GCN: v_mbcnt_lo_u32_b32{{(_e[0-9]+)?}} v[[mbcnt_lo:[0-9]+]], s[[exec_lo]], 0
▲ Show 20 Lines • Show All 116 Lines • ▼ Show 20 Lines	entry:
ret void		ret void
}		}

; GCN-LABEL: sub_i32_varying:		; GCN-LABEL: sub_i32_varying:
; GFX7LESS-NOT: v_mbcnt_lo_u32_b32		; GFX7LESS-NOT: v_mbcnt_lo_u32_b32
; GFX7LESS-NOT: v_mbcnt_hi_u32_b32		; GFX7LESS-NOT: v_mbcnt_hi_u32_b32
; GFX7LESS-NOT: s_bcnt1_i32_b64		; GFX7LESS-NOT: s_bcnt1_i32_b64
; GFX7LESS: buffer_atomic_sub v{{[0-9]+}}		; GFX7LESS: buffer_atomic_sub v{{[0-9]+}}
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v[[sub_value:[0-9]+]] wave_shr:1 row_mask:0xf bank_mask:0xf		; GFX8MORE: v_readlane_b32 s[[scalar_value:[0-9]+]], v{{[0-9]+}}, 63
; GFX8MORE: v_sub_u32_e32 v[[sub_value]],{{( vcc,)?}} v[[sub_value]], v{{[0-9]+}}
; GFX8MORE: v_readlane_b32 s[[scalar_value:[0-9]+]], v[[sub_value]], 63
; GFX8MORE: v_mov_b32{{(_e[0-9]+)?}} v[[value:[0-9]+]], s[[scalar_value]]		; GFX8MORE: v_mov_b32{{(_e[0-9]+)?}} v[[value:[0-9]+]], s[[scalar_value]]
; GFX8MORE: buffer_atomic_sub v[[value]]		; GFX8MORE: buffer_atomic_sub v[[value]]
define amdgpu_kernel void @sub_i32_varying(i32 addrspace(1)* %out, i32 addrspace(1)* %inout) {		define amdgpu_kernel void @sub_i32_varying(i32 addrspace(1)* %out, i32 addrspace(1)* %inout) {
entry:		entry:
%lane = call i32 @llvm.amdgcn.workitem.id.x()		%lane = call i32 @llvm.amdgcn.workitem.id.x()
%old = atomicrmw sub i32 addrspace(1)* %inout, i32 %lane acq_rel		%old = atomicrmw sub i32 addrspace(1)* %inout, i32 %lane acq_rel
store i32 %old, i32 addrspace(1)* %out		store i32 %old, i32 addrspace(1)* %out
ret void		ret void
▲ Show 20 Lines • Show All 45 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll

Show First 20 Lines • Show All 130 Lines • ▼ Show 20 Lines	entry:
ret void		ret void
}		}

; GCN-LABEL: sub_i32_varying:		; GCN-LABEL: sub_i32_varying:
; GFX7LESS-NOT: v_mbcnt_lo_u32_b32		; GFX7LESS-NOT: v_mbcnt_lo_u32_b32
; GFX7LESS-NOT: v_mbcnt_hi_u32_b32		; GFX7LESS-NOT: v_mbcnt_hi_u32_b32
; GFX7LESS-NOT: s_bcnt1_i32_b64		; GFX7LESS-NOT: s_bcnt1_i32_b64
; GFX7LESS: ds_sub_rtn_u32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}		; GFX7LESS: ds_sub_rtn_u32 v{{[0-9]+}}, v{{[0-9]+}}, v{{[0-9]+}}
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v[[sub_value:[0-9]+]] wave_shr:1 row_mask:0xf bank_mask:0xf		; GFX8MORE: v_readlane_b32 s[[scalar_value:[0-9]+]], v{{[0-9]+}}, 63
; GFX8MORE: v_sub_u32_e32 v[[sub_value]],{{( vcc,)?}} v[[sub_value]], v{{[0-9]+}}
; GFX8MORE: v_readlane_b32 s[[scalar_value:[0-9]+]], v[[sub_value]], 63
; GFX8MORE: v_mov_b32{{(_e[0-9]+)?}} v[[value:[0-9]+]], s[[scalar_value]]		; GFX8MORE: v_mov_b32{{(_e[0-9]+)?}} v[[value:[0-9]+]], s[[scalar_value]]
; GFX8MORE: ds_sub_rtn_u32 v{{[0-9]+}}, v{{[0-9]+}}, v[[value]]		; GFX8MORE: ds_sub_rtn_u32 v{{[0-9]+}}, v{{[0-9]+}}, v[[value]]
define amdgpu_kernel void @sub_i32_varying(i32 addrspace(1)* %out) {		define amdgpu_kernel void @sub_i32_varying(i32 addrspace(1)* %out) {
entry:		entry:
%lane = call i32 @llvm.amdgcn.workitem.id.x()		%lane = call i32 @llvm.amdgcn.workitem.id.x()
%old = atomicrmw sub i32 addrspace(3)* @local_var32, i32 %lane acq_rel		%old = atomicrmw sub i32 addrspace(3)* @local_var32, i32 %lane acq_rel
store i32 %old, i32 addrspace(1)* %out		store i32 %old, i32 addrspace(1)* %out
ret void		ret void
▲ Show 20 Lines • Show All 45 Lines • Show Last 20 Lines

test/CodeGen/AMDGPU/atomic_optimizations_raw_buffer.ll

Show First 20 Lines • Show All 98 Lines • ▼ Show 20 Lines	entry:
ret void		ret void
}		}

; GCN-LABEL: sub_i32_varying_vdata:		; GCN-LABEL: sub_i32_varying_vdata:
; GFX7LESS-NOT: v_mbcnt_lo_u32_b32		; GFX7LESS-NOT: v_mbcnt_lo_u32_b32
; GFX7LESS-NOT: v_mbcnt_hi_u32_b32		; GFX7LESS-NOT: v_mbcnt_hi_u32_b32
; GFX7LESS-NOT: s_bcnt1_i32_b64		; GFX7LESS-NOT: s_bcnt1_i32_b64
; GFX7LESS: buffer_atomic_sub v{{[0-9]+}}		; GFX7LESS: buffer_atomic_sub v{{[0-9]+}}
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v[[sub_value:[0-9]+]] wave_shr:1 row_mask:0xf bank_mask:0xf		; GFX8MORE: v_readlane_b32 s[[scalar_value:[0-9]+]], v{{[0-9]+}}, 63
; GFX8MORE: v_sub_u32_e32 v[[sub_value]],{{( vcc,)?}} v[[sub_value]], v{{[0-9]+}}
; GFX8MORE: v_readlane_b32 s[[scalar_value:[0-9]+]], v[[sub_value]], 63
; GFX8MORE: v_mov_b32{{(_e[0-9]+)?}} v[[value:[0-9]+]], s[[scalar_value]]		; GFX8MORE: v_mov_b32{{(_e[0-9]+)?}} v[[value:[0-9]+]], s[[scalar_value]]
; GFX8MORE: buffer_atomic_sub v[[value]]		; GFX8MORE: buffer_atomic_sub v[[value]]
define amdgpu_kernel void @sub_i32_varying_vdata(i32 addrspace(1)* %out, <4 x i32> %inout) {		define amdgpu_kernel void @sub_i32_varying_vdata(i32 addrspace(1)* %out, <4 x i32> %inout) {
entry:		entry:
%lane = call i32 @llvm.amdgcn.workitem.id.x()		%lane = call i32 @llvm.amdgcn.workitem.id.x()
%old = call i32 @llvm.amdgcn.raw.buffer.atomic.sub(i32 %lane, <4 x i32> %inout, i32 0, i32 0, i32 0)		%old = call i32 @llvm.amdgcn.raw.buffer.atomic.sub(i32 %lane, <4 x i32> %inout, i32 0, i32 0, i32 0)
store i32 %old, i32 addrspace(1)* %out		store i32 %old, i32 addrspace(1)* %out
ret void		ret void
Show All 14 Lines

test/CodeGen/AMDGPU/atomic_optimizations_struct_buffer.ll

Show First 20 Lines • Show All 111 Lines • ▼ Show 20 Lines	entry:
ret void		ret void
}		}

; GCN-LABEL: sub_i32_varying_vdata:		; GCN-LABEL: sub_i32_varying_vdata:
; GFX7LESS-NOT: v_mbcnt_lo_u32_b32		; GFX7LESS-NOT: v_mbcnt_lo_u32_b32
; GFX7LESS-NOT: v_mbcnt_hi_u32_b32		; GFX7LESS-NOT: v_mbcnt_hi_u32_b32
; GFX7LESS-NOT: s_bcnt1_i32_b64		; GFX7LESS-NOT: s_bcnt1_i32_b64
; GFX7LESS: buffer_atomic_sub v{{[0-9]+}}		; GFX7LESS: buffer_atomic_sub v{{[0-9]+}}
; GFX8MORE: v_mov_b32_dpp v{{[0-9]+}}, v[[sub_value:[0-9]+]] wave_shr:1 row_mask:0xf bank_mask:0xf		; GFX8MORE: v_readlane_b32 s[[scalar_value:[0-9]+]], v{{[0-9]+}}, 63
; GFX8MORE: v_sub_u32_e32 v[[sub_value]],{{( vcc,)?}} v[[sub_value]], v{{[0-9]+}}
; GFX8MORE: v_readlane_b32 s[[scalar_value:[0-9]+]], v[[sub_value]], 63
; GFX8MORE: v_mov_b32{{(_e[0-9]+)?}} v[[value:[0-9]+]], s[[scalar_value]]		; GFX8MORE: v_mov_b32{{(_e[0-9]+)?}} v[[value:[0-9]+]], s[[scalar_value]]
; GFX8MORE: buffer_atomic_sub v[[value]]		; GFX8MORE: buffer_atomic_sub v[[value]]
define amdgpu_kernel void @sub_i32_varying_vdata(i32 addrspace(1)* %out, <4 x i32> %inout) {		define amdgpu_kernel void @sub_i32_varying_vdata(i32 addrspace(1)* %out, <4 x i32> %inout) {
entry:		entry:
%lane = call i32 @llvm.amdgcn.workitem.id.x()		%lane = call i32 @llvm.amdgcn.workitem.id.x()
%old = call i32 @llvm.amdgcn.struct.buffer.atomic.sub(i32 %lane, <4 x i32> %inout, i32 0, i32 0, i32 0, i32 0)		%old = call i32 @llvm.amdgcn.struct.buffer.atomic.sub(i32 %lane, <4 x i32> %inout, i32 0, i32 0, i32 0, i32 0)
store i32 %old, i32 addrspace(1)* %out		store i32 %old, i32 addrspace(1)* %out
ret void		ret void
Show All 27 Lines

test/CodeGen/AMDGPU/fix-wwm-liveness.mir

This file was deleted.

	# RUN: llc -march=amdgcn -verify-machineinstrs -run-pass si-fix-wwm-liveness -o - %s \| FileCheck %s

	# Test a then phi value.
	#CHECK: test_wwm_liveness_then_phi
	#CHECK: %21:vgpr_32 = V_MOV_B32_e32 1, implicit $exec, implicit %21

	---
	name: test_wwm_liveness_then_phi
	alignment: 0
	exposesReturnsTwice: false
	legalized: false
	regBankSelected: false
	selected: false
	tracksRegLiveness: true
	registers:
	- { id: 0, class: sreg_64, preferred-register: '' }
	- { id: 1, class: sgpr_32, preferred-register: '' }
	- { id: 2, class: sgpr_32, preferred-register: '' }
	- { id: 3, class: vgpr_32, preferred-register: '' }
	- { id: 4, class: vgpr_32, preferred-register: '' }
	- { id: 5, class: vgpr_32, preferred-register: '' }
	- { id: 6, class: vgpr_32, preferred-register: '' }
	- { id: 7, class: vgpr_32, preferred-register: '' }
	- { id: 8, class: sreg_64, preferred-register: '$vcc' }
	- { id: 9, class: sreg_64, preferred-register: '' }
	- { id: 10, class: sreg_32_xm0, preferred-register: '' }
	- { id: 11, class: sreg_64, preferred-register: '' }
	- { id: 12, class: sreg_32_xm0, preferred-register: '' }
	- { id: 13, class: sreg_32_xm0, preferred-register: '' }
	- { id: 14, class: sreg_32_xm0, preferred-register: '' }
	- { id: 15, class: sreg_128, preferred-register: '' }
	- { id: 16, class: vgpr_32, preferred-register: '' }
	- { id: 17, class: vgpr_32, preferred-register: '' }
	- { id: 18, class: vgpr_32, preferred-register: '' }
	- { id: 19, class: sreg_64, preferred-register: '' }
	- { id: 20, class: sreg_64, preferred-register: '' }
	- { id: 21, class: vgpr_32, preferred-register: '' }
	- { id: 22, class: sreg_64, preferred-register: '' }
	- { id: 23, class: sreg_64, preferred-register: '' }
	liveins:
	body: \|
	bb.0:
	successors: %bb.1(0x40000000), %bb.2(0x40000000)

	%21 = V_MOV_B32_e32 0, implicit $exec
	%5 = V_MBCNT_LO_U32_B32_e64 -1, 0, implicit $exec
	%6 = V_MBCNT_HI_U32_B32_e32 -1, killed %5, implicit $exec
	%8 = V_CMP_GT_U32_e64 32, killed %6, implicit $exec
	%22 = COPY $exec, implicit-def $exec
	%23 = S_AND_B64 %22, %8, implicit-def dead $scc
	%0 = S_XOR_B64 %23, %22, implicit-def dead $scc
	$exec = S_MOV_B64_term killed %23
	SI_MASK_BRANCH %bb.2, implicit $exec
	S_BRANCH %bb.1

	bb.1:
	successors: %bb.2(0x80000000)

	%13 = S_MOV_B32 61440
	%14 = S_MOV_B32 -1
	%15 = REG_SEQUENCE undef %12, 1, undef %10, 2, killed %14, 3, killed %13, 4
	%19 = COPY $exec
	$exec = S_MOV_B64 -1
	%16 = BUFFER_LOAD_DWORD_OFFSET %15, 0, 0, 0, 0, 0, implicit $exec :: (volatile load 4)
	%17 = V_ADD_F32_e32 1065353216, killed %16, implicit $exec
	$exec = EXIT_WWM killed %19
	%21 = V_MOV_B32_e32 1, implicit $exec
	early-clobber %18 = WWM killed %17, implicit $exec
	BUFFER_STORE_DWORD_OFFSET killed %18, killed %15, 0, 0, 0, 0, 0, implicit $exec :: (store 4)

	bb.2:
	$exec = S_OR_B64 $exec, killed %0, implicit-def $scc
	$vgpr0 = COPY killed %21
	SI_RETURN_TO_EPILOG killed $vgpr0

	...

	# Test a loop with a loop exit value and a loop phi.
	#CHECK: test_wwm_liveness_loop
	#CHECK: %4:vgpr_32 = IMPLICIT_DEF
	#CHECK: bb.1:
	#CHECK: %4:vgpr_32 = FLAT_LOAD_DWORD{{.*}}, implicit %4
	#CHECK: %27:vgpr_32 = COPY killed %21, implicit %27

	---
	name: test_wwm_liveness_loop
	alignment: 0
	exposesReturnsTwice: false
	legalized: false
	regBankSelected: false
	selected: false
	failedISel: false
	tracksRegLiveness: true
	registers:
	- { id: 0, class: vgpr_32, preferred-register: '' }
	- { id: 1, class: sreg_32_xm0, preferred-register: '' }
	- { id: 2, class: sreg_64, preferred-register: '' }
	- { id: 3, class: sreg_32_xm0, preferred-register: '' }
	- { id: 4, class: vgpr_32, preferred-register: '' }
	- { id: 5, class: sreg_32_xm0, preferred-register: '' }
	- { id: 6, class: sreg_64, preferred-register: '' }
	- { id: 7, class: sreg_64, preferred-register: '' }
	- { id: 8, class: sreg_64, preferred-register: '' }
	- { id: 9, class: vreg_64, preferred-register: '' }
	- { id: 10, class: vgpr_32, preferred-register: '' }
	- { id: 11, class: vgpr_32, preferred-register: '' }
	- { id: 12, class: vgpr_32, preferred-register: '' }
	- { id: 13, class: sreg_64, preferred-register: '' }
	- { id: 14, class: vreg_64, preferred-register: '' }
	- { id: 15, class: sreg_32_xm0, preferred-register: '' }
	- { id: 16, class: vgpr_32, preferred-register: '' }
	- { id: 17, class: sreg_64, preferred-register: '$vcc' }
	- { id: 18, class: vgpr_32, preferred-register: '' }
	- { id: 19, class: vgpr_32, preferred-register: '' }
	- { id: 20, class: vgpr_32, preferred-register: '' }
	- { id: 21, class: vgpr_32, preferred-register: '' }
	- { id: 22, class: vgpr_32, preferred-register: '' }
	- { id: 23, class: sreg_64, preferred-register: '' }
	- { id: 24, class: sreg_64, preferred-register: '' }
	- { id: 25, class: sreg_64, preferred-register: '' }
	- { id: 26, class: sreg_64, preferred-register: '' }
	- { id: 27, class: vgpr_32, preferred-register: '' }
	liveins:
	frameInfo:
	isFrameAddressTaken: false
	isReturnAddressTaken: false
	hasStackMap: false
	hasPatchPoint: false
	stackSize: 0
	offsetAdjustment: 0
	maxAlignment: 0
	adjustsStack: false
	hasCalls: false
	stackProtector: ''
	maxCallFrameSize: 4294967295
	hasOpaqueSPAdjustment: false
	hasVAStart: false
	hasMustTailInVarArgFunc: false
	localFrameSize: 0
	savePoint: ''
	restorePoint: ''
	fixedStack:
	stack:
	constants:
	body: \|
	bb.0:
	successors: %bb.1(0x80000000)

	%25:sreg_64 = S_OR_SAVEEXEC_B64 -1, implicit-def $exec, implicit-def dead $scc, implicit $exec
	%0:vgpr_32 = FLAT_LOAD_DWORD undef %9:vreg_64, 0, 0, 0, implicit $exec, implicit $flat_scr :: (volatile load 4 from `float addrspace(1)* undef`, addrspace 1)
	$exec = EXIT_WWM killed %25
	%12:vgpr_32 = V_MBCNT_LO_U32_B32_e64 -1, 0, implicit $exec
	%7:sreg_64 = S_MOV_B64 0
	%26:sreg_64 = COPY killed %7
	%27:vgpr_32 = COPY killed %12

	bb.1:
	successors: %bb.2(0x04000000), %bb.1(0x7c000000)

	%24:sreg_64 = S_OR_SAVEEXEC_B64 -1, implicit-def $exec, implicit-def dead $scc, implicit $exec
	%20:vgpr_32 = COPY killed %27
	%2:sreg_64 = COPY killed %26
	%4:vgpr_32 = FLAT_LOAD_DWORD undef %14:vreg_64, 0, 0, 0, implicit $exec, implicit $flat_scr :: (volatile load 4 from `float addrspace(1)* undef`, addrspace 1)
	$exec = EXIT_WWM killed %24
	%22:vgpr_32 = V_ADD_I32_e32 -1, killed %20, implicit-def dead $vcc, implicit $exec
	%17:sreg_64 = V_CMP_EQ_U32_e64 0, %22, implicit $exec
	%6:sreg_64 = S_OR_B64 killed %17, killed %2, implicit-def $scc
	%21:vgpr_32 = COPY killed %22
	%26:sreg_64 = COPY %6
	%27:vgpr_32 = COPY killed %21
	$exec = S_ANDN2_B64_term $exec, %6
	S_CBRANCH_EXECNZ %bb.1, implicit $exec
	S_BRANCH %bb.2

	bb.2:
	$exec = S_OR_B64 $exec, killed %6, implicit-def $scc
	%23:sreg_64 = S_OR_SAVEEXEC_B64 -1, implicit-def $exec, implicit-def dead $scc, implicit $exec
	%18:vgpr_32 = V_ADD_F32_e32 killed %0, killed %4, implicit $exec
	$exec = EXIT_WWM killed %23
	early-clobber %19:vgpr_32 = COPY killed %18, implicit $exec
	$vgpr0 = COPY killed %19
	SI_RETURN_TO_EPILOG killed $vgpr0

	...

test/CodeGen/AMDGPU/indirect-addressing-term.ll

Show First 20 Lines • Show All 75 Lines • ▼ Show 20 Lines	define amdgpu_kernel void @extract_w_offset_vgpr(i32 addrspace(1)* %out) {
; GCN: SI_SPILL_S128_SAVE killed $sgpr8_sgpr9_sgpr10_sgpr11, %stack.1, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr3, implicit-def dead $m0 :: (store 16 into %stack.1, align 4, addrspace 5)		; GCN: SI_SPILL_S128_SAVE killed $sgpr8_sgpr9_sgpr10_sgpr11, %stack.1, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr3, implicit-def dead $m0 :: (store 16 into %stack.1, align 4, addrspace 5)
; GCN: SI_SPILL_V512_SAVE killed $vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31_vgpr32, %stack.2, $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr3, 0, implicit $exec :: (store 64 into %stack.2, align 4, addrspace 5)		; GCN: SI_SPILL_V512_SAVE killed $vgpr17_vgpr18_vgpr19_vgpr20_vgpr21_vgpr22_vgpr23_vgpr24_vgpr25_vgpr26_vgpr27_vgpr28_vgpr29_vgpr30_vgpr31_vgpr32, %stack.2, $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr3, 0, implicit $exec :: (store 64 into %stack.2, align 4, addrspace 5)
; GCN: SI_SPILL_S64_SAVE killed $sgpr0_sgpr1, %stack.3, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr3, implicit-def dead $m0 :: (store 8 into %stack.3, align 4, addrspace 5)		; GCN: SI_SPILL_S64_SAVE killed $sgpr0_sgpr1, %stack.3, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr3, implicit-def dead $m0 :: (store 8 into %stack.3, align 4, addrspace 5)
; GCN: SI_SPILL_V32_SAVE killed $vgpr1, %stack.4, $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr3, 0, implicit $exec :: (store 4 into %stack.4, addrspace 5)		; GCN: SI_SPILL_V32_SAVE killed $vgpr1, %stack.4, $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr3, 0, implicit $exec :: (store 4 into %stack.4, addrspace 5)
; GCN: SI_SPILL_S64_SAVE killed $sgpr24_sgpr25, %stack.5, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr3, implicit-def dead $m0 :: (store 8 into %stack.5, align 4, addrspace 5)		; GCN: SI_SPILL_S64_SAVE killed $sgpr24_sgpr25, %stack.5, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr3, implicit-def dead $m0 :: (store 8 into %stack.5, align 4, addrspace 5)
; GCN: bb.1:		; GCN: bb.1:
; GCN: successors: %bb.1(0x40000000), %bb.2(0x40000000)		; GCN: successors: %bb.1(0x40000000), %bb.2(0x40000000)
; GCN: $sgpr0_sgpr1 = SI_SPILL_S64_RESTORE %stack.5, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr3, implicit-def dead $m0 :: (load 8 from %stack.5, align 4, addrspace 5)		; GCN: $sgpr0_sgpr1 = SI_SPILL_S64_RESTORE %stack.5, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr3, implicit-def dead $m0 :: (load 8 from %stack.5, align 4, addrspace 5)
; GCN: $vgpr0 = SI_SPILL_V32_RESTORE %stack.4, $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr3, 0, implicit $exec :: (load 4 from %stack.4, addrspace 5)		; GCN: $vgpr0 = SI_SPILL_V32_RESTORE %stack.0, $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr3, 0, implicit $exec :: (load 4 from %stack.0, addrspace 5)
; GCN: $vgpr1 = SI_SPILL_V32_RESTORE %stack.0, $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr3, 0, implicit $exec :: (load 4 from %stack.0, addrspace 5)		; GCN: renamable $sgpr2 = V_READFIRSTLANE_B32 $vgpr0, implicit $exec
; GCN: renamable $sgpr2 = V_READFIRSTLANE_B32 $vgpr1, implicit $exec		; GCN: renamable $sgpr0_sgpr1 = V_CMP_EQ_U32_e64 $sgpr2, killed $vgpr0, implicit $exec
; GCN: renamable $sgpr4_sgpr5 = V_CMP_EQ_U32_e64 $sgpr2, killed $vgpr1, implicit $exec		; GCN: renamable $sgpr0_sgpr1 = S_AND_SAVEEXEC_B64 killed renamable $sgpr0_sgpr1, implicit-def $exec, implicit-def $scc, implicit $exec
; GCN: renamable $sgpr4_sgpr5 = S_AND_SAVEEXEC_B64 killed renamable $sgpr4_sgpr5, implicit-def $exec, implicit-def $scc, implicit $exec
; GCN: S_SET_GPR_IDX_ON killed renamable $sgpr2, 1, implicit-def $m0, implicit undef $m0		; GCN: S_SET_GPR_IDX_ON killed renamable $sgpr2, 1, implicit-def $m0, implicit undef $m0
; GCN: $vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17 = SI_SPILL_V512_RESTORE %stack.2, $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr3, 0, implicit $exec :: (load 64 from %stack.2, align 4, addrspace 5)		; GCN: $vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16 = SI_SPILL_V512_RESTORE %stack.2, $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr3, 0, implicit $exec :: (load 64 from %stack.2, align 4, addrspace 5)
; GCN: renamable $vgpr18 = V_MOV_B32_e32 undef $vgpr3, implicit $exec, implicit killed $vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16_vgpr17, implicit $m0		; GCN: renamable $vgpr17 = V_MOV_B32_e32 undef $vgpr2, implicit $exec, implicit killed $vgpr1_vgpr2_vgpr3_vgpr4_vgpr5_vgpr6_vgpr7_vgpr8_vgpr9_vgpr10_vgpr11_vgpr12_vgpr13_vgpr14_vgpr15_vgpr16, implicit $m0
; GCN: S_SET_GPR_IDX_OFF		; GCN: S_SET_GPR_IDX_OFF
; GCN: renamable $vgpr19 = COPY renamable $vgpr18		; GCN: renamable $vgpr18 = COPY renamable $vgpr17
; GCN: renamable $sgpr6_sgpr7 = COPY renamable $sgpr4_sgpr5		; GCN: renamable $sgpr4_sgpr5 = COPY renamable $sgpr0_sgpr1
; GCN: SI_SPILL_S64_SAVE killed $sgpr6_sgpr7, %stack.5, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr3, implicit-def dead $m0 :: (store 8 into %stack.5, align 4, addrspace 5)		; GCN: SI_SPILL_S64_SAVE killed $sgpr4_sgpr5, %stack.5, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr3, implicit-def dead $m0 :: (store 8 into %stack.5, align 4, addrspace 5)
; GCN: SI_SPILL_S64_SAVE killed $sgpr0_sgpr1, %stack.6, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr3, implicit-def dead $m0 :: (store 8 into %stack.6, align 4, addrspace 5)		; GCN: SI_SPILL_V32_SAVE killed $vgpr18, %stack.4, $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr3, 0, implicit $exec :: (store 4 into %stack.4, addrspace 5)
; GCN: SI_SPILL_V32_SAVE killed $vgpr19, %stack.4, $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr3, 0, implicit $exec :: (store 4 into %stack.4, addrspace 5)		; GCN: SI_SPILL_V32_SAVE killed $vgpr17, %stack.6, $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr3, 0, implicit $exec :: (store 4 into %stack.6, addrspace 5)
; GCN: SI_SPILL_V32_SAVE killed $vgpr0, %stack.7, $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr3, 0, implicit $exec :: (store 4 into %stack.7, addrspace 5)		; GCN: $exec = S_XOR_B64_term $exec, killed renamable $sgpr0_sgpr1, implicit-def $scc
; GCN: SI_SPILL_V32_SAVE killed $vgpr18, %stack.8, $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr3, 0, implicit $exec :: (store 4 into %stack.8, addrspace 5)
; GCN: $exec = S_XOR_B64_term $exec, killed renamable $sgpr4_sgpr5, implicit-def $scc
; GCN: S_CBRANCH_EXECNZ %bb.1, implicit $exec		; GCN: S_CBRANCH_EXECNZ %bb.1, implicit $exec
; GCN: bb.2:		; GCN: bb.2:
; GCN: $sgpr0_sgpr1 = SI_SPILL_S64_RESTORE %stack.3, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr3, implicit-def dead $m0 :: (load 8 from %stack.3, align 4, addrspace 5)		; GCN: $sgpr0_sgpr1 = SI_SPILL_S64_RESTORE %stack.3, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr3, implicit-def dead $m0 :: (load 8 from %stack.3, align 4, addrspace 5)
; GCN: $exec = S_MOV_B64 killed renamable $sgpr0_sgpr1		; GCN: $exec = S_MOV_B64 killed renamable $sgpr0_sgpr1
; GCN: $vgpr0 = SI_SPILL_V32_RESTORE %stack.8, $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr3, 0, implicit $exec :: (load 4 from %stack.8, addrspace 5)		; GCN: $vgpr0 = SI_SPILL_V32_RESTORE %stack.6, $sgpr96_sgpr97_sgpr98_sgpr99, $sgpr3, 0, implicit $exec :: (load 4 from %stack.6, addrspace 5)
; GCN: $sgpr4_sgpr5_sgpr6_sgpr7 = SI_SPILL_S128_RESTORE %stack.1, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr3, implicit-def dead $m0 :: (load 16 from %stack.1, align 4, addrspace 5)		; GCN: $sgpr4_sgpr5_sgpr6_sgpr7 = SI_SPILL_S128_RESTORE %stack.1, implicit $exec, implicit $sgpr96_sgpr97_sgpr98_sgpr99, implicit $sgpr3, implicit-def dead $m0 :: (load 16 from %stack.1, align 4, addrspace 5)
; GCN: BUFFER_STORE_DWORD_OFFSET killed renamable $vgpr0, killed renamable $sgpr4_sgpr5_sgpr6_sgpr7, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into %ir.out.load, addrspace 1)		; GCN: BUFFER_STORE_DWORD_OFFSET killed renamable $vgpr0, killed renamable $sgpr4_sgpr5_sgpr6_sgpr7, 0, 0, 0, 0, 0, implicit $exec :: (store 4 into %ir.out.load, addrspace 1)
; GCN: S_ENDPGM		; GCN: S_ENDPGM
entry:		entry:
%id = call i32 @llvm.amdgcn.workitem.id.x() #1		%id = call i32 @llvm.amdgcn.workitem.id.x() #1
%index = add i32 %id, 1		%index = add i32 %id, 1
%value = extractelement <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16>, i32 %index		%value = extractelement <16 x i32> <i32 0, i32 1, i32 2, i32 3, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16>, i32 %index
store i32 %value, i32 addrspace(1)* %out		store i32 %value, i32 addrspace(1)* %out
ret void		ret void
}		}

test/CodeGen/AMDGPU/wqm.mir

	# RUN: llc -march=amdgcn -verify-machineinstrs -run-pass si-wqm -o - %s \| FileCheck %s			# RUN: llc -march=amdgcn -verify-machineinstrs -run-pass si-wqm -o - %s \| FileCheck %s

	---			---
	# Check for awareness that s_or_saveexec_b64 clobbers SCC			# Check for awareness that s_or_saveexec_b64 clobbers SCC
	#			#
	#CHECK: S_OR_SAVEEXEC_B64			#CHECK: ENTER_WWM
	#CHECK: S_CMP_LT_I32			#CHECK: S_CMP_LT_I32
	#CHECK: S_CSELECT_B32			#CHECK: S_CSELECT_B32
	name: test_wwm_scc			name: test_wwm_scc
	alignment: 0			alignment: 0
	exposesReturnsTwice: false			exposesReturnsTwice: false
	legalized: false			legalized: false
	regBankSelected: false			regBankSelected: false
	selected: false			selected: false
	Show All 36 Lines

test/CodeGen/AMDGPU/wwm-reserved.ll

This file was added.

				; RUN: llc -O0 -march=amdgcn -mcpu=gfx900 -amdgpu-dpp-combine=false -verify-machineinstrs < %s \| FileCheck -check-prefixes=GFX9,GFX9-O0 %s
				; RUN: llc -march=amdgcn -mcpu=gfx900 -amdgpu-dpp-combine=false -verify-machineinstrs < %s \| FileCheck -check-prefixes=GFX9,GFX9-O3 %s

				arsenmUnsubmitted Done Reply Inline Actions Needs a -O0 run line arsenm: Needs a -O0 run line
				define amdgpu_cs void @no_cfg(<4 x i32> inreg %tmp14) {
				%tmp100 = call <2 x float> @llvm.amdgcn.raw.buffer.load.v2f32(<4 x i32> %tmp14, i32 0, i32 0, i32 0)
				%tmp101 = bitcast <2 x float> %tmp100 to <2 x i32>
				%tmp102 = extractelement <2 x i32> %tmp101, i32 0
				%tmp103 = extractelement <2 x i32> %tmp101, i32 1
				%tmp105 = tail call i32 @llvm.amdgcn.set.inactive.i32(i32 %tmp102, i32 0)
				%tmp107 = tail call i32 @llvm.amdgcn.set.inactive.i32(i32 %tmp103, i32 0)

				; GFX9: v_mov_b32_dpp v[[FIRST_MOV:[0-9]+]], v{{[0-9]+}} row_bcast:31 row_mask:0xc bank_mask:0xf
				; GFX9: v_add_u32_e32 v[[FIRST_ADD:[0-9]+]], v{{[0-9]+}}, v[[FIRST_MOV]]
				; GFX9: v_mov_b32_e32 v[[FIRST:[0-9]+]], v[[FIRST_ADD]]
				arsenmUnsubmitted Done Reply Inline Actions Do these tests really need all of this? Can you just have a few sample instructions? arsenm: Do these tests really need all of this? Can you just have a few sample instructions?
				sheredomAuthorUnsubmitted Done Reply Inline Actions I can probably cut it down to just a single DPP inst in each WWM section - good idea! sheredom: I can probably cut it down to just a single DPP inst in each WWM section - good idea!
				%tmp120 = tail call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %tmp105, i32 323, i32 12, i32 15, i1 false)
				%tmp121 = add i32 %tmp105, %tmp120
				%tmp122 = tail call i32 @llvm.amdgcn.wwm.i32(i32 %tmp121)

				; GFX9: v_mov_b32_dpp v[[SECOND_MOV:[0-9]+]], v{{[0-9]+}} row_bcast:31 row_mask:0xc bank_mask:0xf
				; GFX9: v_add_u32_e32 v[[SECOND_ADD:[0-9]+]], v{{[0-9]+}}, v[[SECOND_MOV]]
				; GFX9: v_mov_b32_e32 v[[SECOND:[0-9]+]], v[[SECOND_ADD]]
				%tmp135 = tail call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %tmp107, i32 323, i32 12, i32 15, i1 false)
				%tmp136 = add i32 %tmp107, %tmp135
				%tmp137 = tail call i32 @llvm.amdgcn.wwm.i32(i32 %tmp136)

				; GFX9-O3: v_cmp_eq_u32_e32 vcc, v[[FIRST]], v[[SECOND]]
				; GFX9-O0: v_cmp_eq_u32_e64 s{{\[}}{{[0-9]+}}:{{[0-9]+}}{{\]}}, v[[FIRST]], v[[SECOND]]
				%tmp138 = icmp eq i32 %tmp122, %tmp137
				%tmp139 = sext i1 %tmp138 to i32
				%tmp140 = shl nsw i32 %tmp139, 1
				%tmp141 = and i32 %tmp140, 2
				%tmp145 = bitcast i32 %tmp141 to float
				call void @llvm.amdgcn.raw.buffer.store.f32(float %tmp145, <4 x i32> %tmp14, i32 4, i32 0, i32 0)
				ret void
				}

				define amdgpu_cs void @cfg(<4 x i32> inreg %tmp14, i32 %arg) {
				entry:
				%tmp100 = call <2 x float> @llvm.amdgcn.raw.buffer.load.v2f32(<4 x i32> %tmp14, i32 0, i32 0, i32 0)
				%tmp101 = bitcast <2 x float> %tmp100 to <2 x i32>
				%tmp102 = extractelement <2 x i32> %tmp101, i32 0
				%tmp105 = tail call i32 @llvm.amdgcn.set.inactive.i32(i32 %tmp102, i32 0)

				; GFX9: v_mov_b32_dpp v[[FIRST_MOV:[0-9]+]], v{{[0-9]+}} row_bcast:31 row_mask:0xc bank_mask:0xf
				; GFX9: v_add_u32_e32 v[[FIRST_ADD:[0-9]+]], v{{[0-9]+}}, v[[FIRST_MOV]]
				; GFX9: v_mov_b32_e32 v[[FIRST:[0-9]+]], v[[FIRST_ADD]]
				; GFX9-O0: buffer_store_dword v[[FIRST]], off, s{{\[}}{{[0-9]+}}:{{[0-9]+}}{{\]}}, s[[FIRST_SGPR_OFFSET:[0-9]+]] offset:[[FIRST_IMM_OFFSET:[0-9]+]]
				%tmp120 = tail call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %tmp105, i32 323, i32 12, i32 15, i1 false)
				%tmp121 = add i32 %tmp105, %tmp120
				%tmp122 = tail call i32 @llvm.amdgcn.wwm.i32(i32 %tmp121)

				%cond = icmp eq i32 %arg, 0
				br i1 %cond, label %if, label %merge
				if:
				%tmp103 = extractelement <2 x i32> %tmp101, i32 1
				%tmp107 = tail call i32 @llvm.amdgcn.set.inactive.i32(i32 %tmp103, i32 0)

				; GFX9: v_mov_b32_dpp v[[SECOND_MOV:[0-9]+]], v{{[0-9]+}} row_bcast:31 row_mask:0xc bank_mask:0xf
				arsenmUnsubmitted Done Reply Inline Actions Needs some cases with control flow arsenm: Needs some cases with control flow
				sheredomAuthorUnsubmitted Done Reply Inline Actions Added a CFG and a called function case. sheredom: Added a CFG and a called function case.
				; GFX9: v_add_u32_e32 v[[SECOND_ADD:[0-9]+]], v{{[0-9]+}}, v[[SECOND_MOV]]
				; GFX9: v_mov_b32_e32 v[[SECOND:[0-9]+]], v[[SECOND_ADD]]
				; GFX9-O0: buffer_store_dword v[[SECOND]], off, s{{\[}}{{[0-9]+}}:{{[0-9]+}}{{\]}}, s[[SECOND_SGPR_OFFSET:[0-9]+]] offset:[[SECOND_IMM_OFFSET:[0-9]+]]
				%tmp135 = tail call i32 @llvm.amdgcn.update.dpp.i32(i32 0, i32 %tmp107, i32 323, i32 12, i32 15, i1 false)
				%tmp136 = add i32 %tmp107, %tmp135
				%tmp137 = tail call i32 @llvm.amdgcn.wwm.i32(i32 %tmp136)
				br label %merge

				merge:
				%merge_value = phi i32 [ 0, %entry ], [%tmp137, %if ]
				; GFX9-O3: v_cmp_eq_u32_e32 vcc, v[[FIRST]], v[[SECOND]]
				; GFX9-O0: buffer_load_dword v[[SECOND:[0-9]+]], off, s{{\[}}{{[0-9]+}}:{{[0-9]+}}{{\]}}, s[[SECOND_SGPR_OFFSET]] offset:[[SECOND_IMM_OFFSET]]
				; GFX9-O0: buffer_load_dword v[[FIRST:[0-9]+]], off, s{{\[}}{{[0-9]+}}:{{[0-9]+}}{{\]}}, s[[FIRST_SGPR_OFFSET]] offset:[[FIRST_IMM_OFFSET]]
				; GFX9-O0: v_cmp_eq_u32_e64 s{{\[}}{{[0-9]+}}:{{[0-9]+}}{{\]}}, v[[FIRST]], v[[SECOND]]
				%tmp138 = icmp eq i32 %tmp122, %merge_value
				%tmp139 = sext i1 %tmp138 to i32
				%tmp140 = shl nsw i32 %tmp139, 1
				%tmp141 = and i32 %tmp140, 2
				%tmp145 = bitcast i32 %tmp141 to float
				call void @llvm.amdgcn.raw.buffer.store.f32(float %tmp145, <4 x i32> %tmp14, i32 4, i32 0, i32 0)
				ret void
				}

				define i32 @called(i32 %a) noinline {
				; GFX9: v_add_u32_e32 v1, v0, v0
				%add = add i32 %a, %a
				; GFX9: v_mul_lo_i32 v0, v1, v0
				%mul = mul i32 %add, %a
				; GFX9: v_sub_u32_e32 v0, v0, v1
				%sub = sub i32 %mul, %add
				ret i32 %sub
				}

				define amdgpu_kernel void @call(<4 x i32> inreg %tmp14, i32 inreg %arg) {
				; GFX9-O0: v_mov_b32_e32 v2, v0
				; GFX9-O3: v_mov_b32_e32 v2, s0
				; GFX9-NEXT: s_not_b64 exec, exec
				; GFX9-O0-NEXT: v_mov_b32_e32 v2, s3
				; GFX9-O3-NEXT: v_mov_b32_e32 v2, 0
				; GFX9-NEXT: s_not_b64 exec, exec
				%tmp107 = tail call i32 @llvm.amdgcn.set.inactive.i32(i32 %arg, i32 0)
				; GFX9: v_mov_b32_e32 v0, v2
				; GFX9: s_waitcnt lgkmcnt(0)
				; GFX9: s_swappc_b64
				%tmp134 = call i32 @called(i32 %tmp107)
				; GFX9: v_mov_b32_e32 v1, v0
				; GFX9: v_add_u32_e32 v1, v1, v2
				%tmp136 = add i32 %tmp134, %tmp107
				%tmp137 = tail call i32 @llvm.amdgcn.wwm.i32(i32 %tmp136)
				; GFX9: buffer_store_dword v0
				call void @llvm.amdgcn.raw.buffer.store.i32(i32 %tmp137, <4 x i32> %tmp14, i32 4, i32 0, i32 0)
				ret void
				}

				declare i32 @llvm.amdgcn.wwm.i32(i32)
				declare i32 @llvm.amdgcn.set.inactive.i32(i32, i32)
				declare i32 @llvm.amdgcn.update.dpp.i32(i32, i32, i32, i32, i32, i1)
				declare <2 x float> @llvm.amdgcn.raw.buffer.load.v2f32(<4 x i32>, i32, i32, i32)
				declare void @llvm.amdgcn.raw.buffer.store.f32(float, <4 x i32>, i32, i32, i32)
				declare void @llvm.amdgcn.raw.buffer.store.i32(i32, <4 x i32>, i32, i32, i32)

This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Pre-allocate WWM registers to reduce VGPR pressure.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 191099

include/llvm/IR/IntrinsicsAMDGPU.td

lib/Target/AMDGPU/AMDGPU.h

lib/Target/AMDGPU/AMDGPUTargetMachine.cpp

lib/Target/AMDGPU/CMakeLists.txt

lib/Target/AMDGPU/SIFixWWMLiveness.cpp

lib/Target/AMDGPU/SIInstrInfo.cpp

lib/Target/AMDGPU/SIInstructions.td

lib/Target/AMDGPU/SIMachineFunctionInfo.h

lib/Target/AMDGPU/SIPreAllocateWWMRegs.cpp

lib/Target/AMDGPU/SIRegisterInfo.cpp

lib/Target/AMDGPU/SIWholeQuadMode.cpp

test/CodeGen/AMDGPU/atomic_optimizations_buffer.ll

test/CodeGen/AMDGPU/atomic_optimizations_global_pointer.ll

test/CodeGen/AMDGPU/atomic_optimizations_local_pointer.ll

test/CodeGen/AMDGPU/atomic_optimizations_raw_buffer.ll

test/CodeGen/AMDGPU/atomic_optimizations_struct_buffer.ll

test/CodeGen/AMDGPU/fix-wwm-liveness.mir

test/CodeGen/AMDGPU/indirect-addressing-term.ll

test/CodeGen/AMDGPU/wqm.mir

test/CodeGen/AMDGPU/wwm-reserved.ll

[AMDGPU] Pre-allocate WWM registers to reduce VGPR pressure.
ClosedPublic