This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Pre-allocate WWM registers to reduce VGPR pressure.
ClosedPublic

Authored by sheredom on Mar 13 2019, 6:27 AM.

Details

Summary

This change incorporates an effort by Connor Abbot to change how we deal with WWM operations potentially trashing valid values in inactive lanes.

Previously, the SIFixWWMLiveness pass would work out which registers were being trashed within WWM regions, and ensure that the register allocator did not have any values it was depending on resident in those registers if the WWM section would trash them. This worked perfectly well, but would cause sometimes severe register pressure when the WWM section resided before divergent control flow (or at least that is where I mostly observed it).

This fix instead runs through the WWM sections and pre allocates some registers for WWM. It then reserves these registers so that the register allocator cannot use them. This results in a significant register saving on some WWM shaders I'm working with (130 -> 104 VGPRs, with just this change!).

This change was entirely thought up by @cwabbott (I claim no credit for the original idea!), but I had some time to look at it and so we agreed that I could give it a final polish for submission.

Diff Detail

Event Timeline

sheredom created this revision.Mar 13 2019, 6:27 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 13 2019, 6:27 AM

I really don't like introducing new, dynamically reserved registers for this. It's going to introduce hell for dealing with any kind of ABI, and reserved registers are generally a bad idea. There's also nothing guaranteeing there are any free registers available to reserve, since you are just grabbing totally unused ones. This is going to just hit some variant of the problem I've been working on solving for handling SGPR->VGPR spills. Can WWM code be moved into a bundle or something?

include/llvm/IR/IntrinsicsAMDGPU.td
1363–1365 ↗(On Diff #190402)

This is a separate fix that can be split into its own patch

lib/Target/AMDGPU/SIPreAllocateWWMRegs.cpp
53

You can remove this

223–228

I don't like this hardcoded opcode check. Why is S_OR_SAVEEXEC_B64 special?

I really don't like introducing new, dynamically reserved registers for this. It's going to introduce hell for dealing with any kind of ABI, and reserved registers are generally a bad idea. There's also nothing guaranteeing there are any free registers available to reserve, since you are just grabbing totally unused ones. This is going to just hit some variant of the problem I've been working on solving for handling SGPR->VGPR spills. Can WWM code be moved into a bundle or something?

No, since the problem the current pass and this new pass are trying to solve affects register allocation for code that is arbitrarily far away from the original WWM sequence. For a more detailed explanation, I can't do any better than the comment at the start of SIFixWWMLiveness.cpp. (There will also be problems if RA decides to split a live interval inside a WWM sequence, which can be fixed by bundling it, but that's a completely different problem).

While it might seem dangerous, in practice this works out, since WWM sequences that the frontends currently emit only require a few registers. Hence this pass is guaranteed to succeed, even if there's very high register pressure. If allocating the registers needed for the WWM sequence fails and RA decides to spill something inside there, then you're pretty much toast anyways since the same invalid-lane-clobbering concerns would reappear.

One idea would be to add a way to tell RA that a certain live range absolutely cannot be split (and probably boost its priority as well, lest we fail to allocate it), pre-allocate one or more of these unsplittable registers for WWM, make every definition in the WWM sequence a partial definition, and add fake definitions of the WWM registers in the closest block with uniform control flow that dominates the WWM sequence in order to prevent definitions whose invalid lanes could be clobbered from using the WWM registers. This gives RA a little more flexibility and means that potentially some other operations could use the WWM registers, but you still basically wind up preallocating them.

One idea would be to add a way to tell RA that a certain live range absolutely cannot be split (and probably boost its priority as well, lest we fail to allocate it), pre-allocate one or more of these unsplittable registers for WWM, make every definition in the WWM sequence a partial definition, and add fake definitions of the WWM registers in the closest block with uniform control flow that dominates the WWM sequence in order to prevent definitions whose invalid lanes could be clobbered from using the WWM registers. This gives RA a little more flexibility and means that potentially some other operations could use the WWM registers, but you still basically wind up preallocating them.

Declaring a certain live range unsplittable is impossible. LiveIntervals sort of supports it, but not all the allocators use it. Particularly, FastRegAlloc doesn't really track liveness and spills all values live out of a block. It's important that anything works correctly without LIveIntervals. If you can express the constraints with some series of uses and defs, that would be preferable.

My main concerns are making sure this works with:

  1. -O0/fastregalloc
  2. Presence of calls
  3. Inline asm or any other physical register constraints
test/CodeGen/AMDGPU/wwm-reserved.ll
3

Needs a -O0 run line

58

Needs some cases with control flow

If you can express the constraints with some series of uses and defs, that would be preferable.

Unfortunately that's just not possible. The best thing we could come up with that does this was the current SIFixWWMLiveness pass, and even with subsequent modifications, it's still pretty terrible in practice. We just can't get decent performance without doing something like this hack.

Actually, now that I think about it, I believe we realized that SIFixWWMLiveness has a giant hole in that if any of the extra live ranges we insert are split, it'll fall over. I don't think anyone has come up with a way to express the constraints only with extra defs and uses in a way that always works, and I'm not sure it's possible. The issue is that we're lying to LLVM RA by pretending that vector instructions always fully clobber their destinations, and while before we were careful to never write to any inactive channels in order to keep up the charade, but WWM instructions force us to deal with it somehow. Fully informing LLVM of what's going on would involve marking every vector instruction as partially clobbering its destination, even the move instructions and load/store instructions LLVM emits during RA, which of course would tank performance unless LLVM is taught about predicated liveness -- but of course that's a whole lot of work that opens another can of worms (register pressure is suddenly not that meaningful anymore...).

The best thing we could come up with that does this was the current SIFixWWMLiveness pass, and even with subsequent modifications, it's still pretty terrible in practice.

cwabbott added inline comments.Mar 13 2019, 12:17 PM
lib/Target/AMDGPU/SIPreAllocateWWMRegs.cpp
223–228

IIRC I wrote this since WholeQuadMode currently just emits a S_OR_SAVEEXEC_B64 foo, -1 in order to start WWM, so you have to check it manually in order to know when WWM starts. Maybe it's better to got back and add a ENTER_WWM pseudoinstruction like the pre-existing EXIT_WWM.

sheredom updated this revision to Diff 191074.Mar 18 2019, 6:23 AM

Addressed all review comments.

sheredom marked 6 inline comments as done.Mar 18 2019, 6:43 AM
sheredom added inline comments.
lib/Target/AMDGPU/SIPreAllocateWWMRegs.cpp
223–228

I've added an ENTER_WWM like was suggested!

test/CodeGen/AMDGPU/wwm-reserved.ll
58

Added a CFG and a called function case.

sheredom updated this revision to Diff 191077.Mar 18 2019, 6:44 AM
sheredom marked 2 inline comments as done.

Remove the explicit pass name.

sheredom marked an inline comment as done.Mar 18 2019, 6:44 AM
arsenm added inline comments.Mar 18 2019, 7:33 AM
test/CodeGen/AMDGPU/wwm-reserved.ll
13

Do these tests really need all of this? Can you just have a few sample instructions?

sheredom marked an inline comment as done.Mar 18 2019, 7:36 AM
sheredom added inline comments.
test/CodeGen/AMDGPU/wwm-reserved.ll
13

I can probably cut it down to just a single DPP inst in each WWM section - good idea!

sheredom updated this revision to Diff 191099.Mar 18 2019, 8:37 AM

Reduce the number of DPP calls in the test for cleanliness, and reintroduce the convergent on WWM. The CFG test contains the bug that was exposed by the lack of convergent on WWM, LLVM will sink the WWM statement out of the branch which totally messes up all calculations.

sheredom marked an inline comment as done.Mar 18 2019, 8:38 AM
sheredom marked 2 inline comments as done.
sheredom added inline comments.
include/llvm/IR/IntrinsicsAMDGPU.td
1363–1365 ↗(On Diff #190402)

So I tried to remove this (forgetting why I needed it) and LLVM will sink the WWM out of the branch which totally messes up the WWM calculation. So this is actually a requirement for the patch, not a separate thing.

sheredom marked an inline comment as done.Mar 18 2019, 9:18 AM
arsenm added inline comments.Mar 18 2019, 9:46 AM
include/llvm/IR/IntrinsicsAMDGPU.td
1363–1365 ↗(On Diff #190402)

You can commit that first then

sheredom marked 2 inline comments as done.Mar 19 2019, 1:45 AM
sheredom added inline comments.
include/llvm/IR/IntrinsicsAMDGPU.td
1363–1365 ↗(On Diff #190402)
sheredom marked 2 inline comments as done.Mar 19 2019, 10:14 AM
sheredom added inline comments.
include/llvm/IR/IntrinsicsAMDGPU.td
1363–1365 ↗(On Diff #190402)

I've submitted that other change, so this should be good to go once I get a sign off!

arsenm added inline comments.Mar 20 2019, 11:34 AM
lib/Target/AMDGPU/SIPreAllocateWWMRegs.cpp
155

Can't you just hardcoded this to v_mov_b32?

157

MRI->reg_instructions()? It would also save the check that the operand matches later

161

.isCopy()

184–185

I know we have a helper to do this somewhere

arsenm added inline comments.Mar 20 2019, 11:48 AM
lib/Target/AMDGPU/SIPreAllocateWWMRegs.cpp
155

I'm confused on the constraints for the WWM intrinsic, or lack thereof. The WWM instruction just uses "unknown" and the intrinsic allows any type. Can this be a 64-bit register or greater?

sheredom marked 3 inline comments as done.Mar 21 2019, 2:11 AM
sheredom added inline comments.
lib/Target/AMDGPU/SIPreAllocateWWMRegs.cpp
155

We only use WWM with 32-bit & 64-bit types in our stack - nothing else.

184–185

Couldn't find it myself in the code.

sheredom updated this revision to Diff 191646.Mar 21 2019, 2:11 AM
sheredom marked an inline comment as done.

Fix review comments.

sheredom marked 3 inline comments as done.Mar 21 2019, 2:13 AM
sheredom updated this revision to Diff 191865.Mar 22 2019, 5:48 AM

Update for two reasons:

  • was missing V_SET_INACTIVE_B64 in the set inactive check, causing miscompile with double/int64
  • changed where the mov's are injected to tie them explicitly to WWM statements rather than all copy's within a WWM section (which produces better codegen and actually matches more the intent of the WWM intrinsic in the first place).
sheredom marked an inline comment as done.Mar 22 2019, 5:49 AM
sheredom updated this revision to Diff 192842.Mar 29 2019, 8:59 AM

Found a fun little bug whereby the phys vgprs were being coalesced onto previous instructions, and then shouldClusterMemOps was assuming only virt regs. Added a workaround for that.

tpr accepted this revision.Mar 29 2019, 1:51 PM

LGTM modulo the wrong license on the new file.

lib/Target/AMDGPU/SIPreAllocateWWMRegs.cpp
6

Wrong license.

This revision is now accepted and ready to land.Mar 29 2019, 1:51 PM
arsenm requested changes to this revision.Apr 1 2019, 8:15 AM

Can you add a test where allocation is impossible? e.g. use

call void asm sideeffect "",
  "~{v0},~{v1},~{v2},~{v3},~{v4},~{v5},~{v6},~{v7},~{v8},~{v9}
  ,~{v10},~{v11},~{v12},~{v13},~{v14},~{v15},~{v16},~{v17},~{v18},~{v19}
  ,~{v20},~{v21},~{v22},~{v23},~{v24},~{v25},~{v26},~{v27},~{v28},~{v29}
  ,~{v30},~{v31},~{v32},~{v33},~{v34},~{v35},~{v36},~{v37},~{v38},~{v39}
  ,~{v40},~{v41},~{v42},~{v43},~{v44},~{v45},~{v46},~{v47},~{v48},~{v49}
  ,~{v50},~{v51},~{v52},~{v53},~{v54},~{v55},~{v56},~{v57},~{v58},~{v59}
  ,~{v60},~{v61},~{v62},~{v63},~{v64},~{v65},~{v66},~{v67},~{v68},~{v69}
  ,~{v70},~{v71},~{v72},~{v73},~{v74},~{v75},~{v76},~{v77},~{v78},~{v79}
  ,~{v80},~{v81},~{v82},~{v83},~{v84},~{v85},~{v86},~{v87},~{v88},~{v89}
  ,~{v90},~{v91},~{v92},~{v93},~{v94},~{v95},~{v96},~{v97},~{v98},~{v99}
  ,~{v100},~{v101},~{v102},~{v103},~{v104},~{v105},~{v106},~{v107},~{v108},~{v109}
  ,~{v110},~{v111},~{v112},~{v113},~{v114},~{v115},~{v116},~{v117},~{v118},~{v119}
  ,~{v120},~{v121},~{v122},~{v123},~{v124},~{v125},~{v126},~{v127},~{v128},~{v129}
  ,~{v130},~{v131},~{v132},~{v133},~{v134},~{v135},~{v136},~{v137},~{v138},~{v139}
  ,~{v140},~{v141},~{v142},~{v143},~{v144},~{v145},~{v146},~{v147},~{v148},~{v149}
  ,~{v150},~{v151},~{v152},~{v153},~{v154},~{v155},~{v156},~{v157},~{v158},~{v159}
  ,~{v160},~{v161},~{v162},~{v163},~{v164},~{v165},~{v166},~{v167},~{v168},~{v169}
  ,~{v170},~{v171},~{v172},~{v173},~{v174},~{v175},~{v176},~{v177},~{v178},~{v179}
  ,~{v180},~{v181},~{v182},~{v183},~{v184},~{v185},~{v186},~{v187},~{v188},~{v189}
  ,~{v190},~{v191},~{v192},~{v193},~{v194},~{v195},~{v196},~{v197},~{v198},~{v199}
  ,~{v200},~{v201},~{v202},~{v203},~{v204},~{v205},~{v206},~{v207},~{v208},~{v209}
  ,~{v210},~{v211},~{v212},~{v213},~{v214},~{v215},~{v216},~{v217},~{v218},~{v219}
  ,~{v220},~{v221},~{v222},~{v223},~{v224},~{v225},~{v226},~{v227},~{v228},~{v229}
  ,~{v230},~{v231},~{v232},~{v233},~{v234},~{v235},~{v236},~{v237},~{v238},~{v239}
  ,~{v240},~{v241},~{v242},~{v243},~{v244},~{v245},~{v246},~{v247},~{v248},~{v249}
  ,~{v250},~{v251},~{v252},~{v253},~{v254},~{v255}"() #1
This revision now requires changes to proceed.Apr 1 2019, 8:15 AM
This revision was not accepted when it landed; it landed in state Needs Revision.Apr 1 2019, 8:19 AM
This revision was automatically updated to reflect the committed changes.
arsenm added inline comments.Apr 1 2019, 8:19 AM
lib/Target/AMDGPU/SIInstrInfo.cpp
454–458

This is a separate change

lib/Target/AMDGPU/SIMachineFunctionInfo.h
210

This should be serialized into YAML. Maybe this should also be in terms of RegUnits?

lib/Target/AMDGPU/SIPreAllocateWWMRegs.cpp
104

Is the isPhysRegUsed check really necessary? It's going to break if you have multiple WWM sections

test/CodeGen/AMDGPU/atomic_optimizations_struct_buffer.ll