This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Legalize VGPR Rsrc operands for MUBUF instructions
ClosedPublic

Authored by scott.linder on Aug 20 2018, 11:12 AM.

Details

Summary

Emit a waterfall loop in the general case for a potentially-divergent Rsrc operand. When possible, avoid this by using Addr64 instructions.

The old legalization assumed Addr64 was available, and only handled the _ADDR64 and _OFFSET variants of MUBUF instructions.

I am not sure what the best way to identify Addr64 MUBUF opcodes is, but since we need the InstrMapping in order to convert _OFFSET to _ADDR64 anyway I tried to reuse that mapping.

Diff Detail

Repository
rL LLVM

Event Timeline

scott.linder created this revision.Aug 20 2018, 11:12 AM

In this case I think we should have some pure-IR tests since we'll need to worry about this in GlobalISel as well.

You are missing some tests involving control flow that are likely to be broken. Some cases I would worry about:

  • Expansion outside of the entry block, with values live across block boundaries
  • Expansion of two adjacent operations that introduce control flow
  • Expansion in multiple blocks in the function

I think there were a few other problematic cases I ran into for the vector indexing waterfall

lib/Target/AMDGPU/SIInstrInfo.cpp
3481–3486 ↗(On Diff #161517)

I was hoping for this to a be a more general function that could be re-used for other operations in the future, and probably sharable with the vector indexing code.

I was thinking about something like a list of operands that need to be SGPRs, and iterating over all of the subregisters rather than assuming the SGPR128 + sometimes SGPR32 the buffer operations use

3490 ↗(On Diff #161517)

This is a bool

3546–3548 ↗(On Diff #161517)

Can you post a separate patch to fix the vector indexing loop not using this?

test/CodeGen/AMDGPU/mubuf-legalize-operands.mir
1 ↗(On Diff #161517)

I think this needs a -verify-machineinstrs

16–17 ↗(On Diff #161517)

If you're not going to write out the variable checks for these, you're probably better off just using update_mir_test_checks

I think with this we can also fix the broken definition for const-32bit address space of always giving a uniform result

scott.linder marked an inline comment as done.Aug 21 2018, 10:08 AM

I will add some IR tests for at least those cases.

I don't understand what you mean about the const-32bit addrspace; can you explain more?

lib/Target/AMDGPU/SIInstrInfo.cpp
3481–3486 ↗(On Diff #161517)

I can work on this; would you prefer I do that in this patch?

3490 ↗(On Diff #161517)

I thought isUndef was a bool and getUndefRegState returned flags.

3546–3548 ↗(On Diff #161517)

Yes, I was planning on doing that. Alternatively if I unify the two in this patch that will also handle it.

test/CodeGen/AMDGPU/mubuf-legalize-operands.mir
16–17 ↗(On Diff #161517)

I will add the variable checks here.

I will add some IR tests for at least those cases.

I don't understand what you mean about the const-32bit addrspace; can you explain more?

I think part of the shader assume uniform hack is we claim that loads from the 32-bit const address space are uniform, such that it makes it "ok" to use the single readlane. This was always a broken concept though. See isSDNodeAlwaysUniform handling of load, and the documentation

lib/Target/AMDGPU/SIInstrInfo.cpp
3481–3486 ↗(On Diff #161517)

That can be a second patch

3490 ↗(On Diff #161517)

Yes, you're right

Use variables for registers in MIR tests. Add IR tests, trying to cover problem cases around introducing control flow. There is another case in the indirect-addressing tests with an SGPR def between the two operations, but I'm not sure what it is testing. Are the tests I added useful, and are there still more you can think of to add?

I will start working on a patch with a generic waterfall emitter, and I will talk to the graphics devs about how to fix the readfirstlane shortcut and the addrspace hack.

I find it a bit concerning that we seem to have different semantics for these instructions depending on the environment.

Graphics really wants to legalize the instructions using a single v_readfirstlane (without the loop overhead), because that's what the relevant semantics are there. This patch doesn't change that, so it's not too bad, but please keep it in mind.

We should probably have some real executing tests that stress this in piglit or somewhere else.

A solution to the graphics semantics problem might be to use a readfirstlane intrinsic call as an input as an assertion the input is known uniform. We can then optimize out patterns that look like V_READFIRSTLANE (COPY SGPR to VGPR)

arsenm added inline comments.Aug 23 2018, 2:09 AM
test/CodeGen/AMDGPU/mubuf-legalize-operands.ll
2 ↗(On Diff #161793)

Probably should have an -O0 run line to make sure spills work correctly. Full check lines for every test for this is probably too exhausting for doing it manually, so maybe just a case with a cross block use

12 ↗(On Diff #161793)

What's this?

I think even in the simplest case my code is wrong. The $soffset operand to the BUFFER_LOAD remains killed but the def is not moved into the LoopBB. I am essentially transforming:

%soffset:sreg_32_xm0 = S_MOV_B32 0
BUFFER_LOAD ..., killed %soffset:sreg_32_xm0, ...

Into:

%soffset:sreg_32_xm0 = S_MOV_B32 0
bb1:
BUFFER_LOAD ..., killed %soffset:sreg_32_xm0, ...
S_CBRANCH bb1

Is it correct to just clear the killed flag on the operand? Should I do the same for the $vindex/$voffset/$vaddr operands?

test/CodeGen/AMDGPU/mubuf-legalize-operands.ll
12 ↗(On Diff #161793)

This is me trying to be sure the uses occurs after the defs, even though the order within each doesn't matter. My understanding is that without the divider the cmp's could appear before the readfirstlanes and because they are adjacent -DAG it would still match. The chances of it seem low, and maybe there is a better way than a dummy CHECK-NOT to prevent it?

You can just unconditionally remove kill flags. They are more or less deprecated, and fully recomputed later before RA. We do make use of them for a few things. Pre-RA they are only really useful for a few places that care about physical registers

arsenm added inline comments.Aug 24 2018, 3:10 AM
test/CodeGen/AMDGPU/mubuf-legalize-operands.ll
12 ↗(On Diff #161793)

I'm not sure that's how CHECK-NOT works, although there were some recent changes to try to clarify the behavior of CHECK-DAG I haven't looked into

Clear killed flag on uses, add test at -O0

arsenm added inline comments.Aug 28 2018, 5:18 AM
lib/Target/AMDGPU/SIInstrInfo.cpp
3571–3575 ↗(On Diff #162475)

Use MRI.clearKillFlags

Use MRI.clearKillFlags

arsenm accepted this revision.Sep 4 2018, 11:09 AM

LGTM

This revision is now accepted and ready to land.Sep 4 2018, 11:09 AM
This revision was automatically updated to reflect the committed changes.