This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] added writelane intrinsic
ClosedPublic

Authored by tpr on Feb 2 2018, 2:47 AM.

Details

Summary

For use by LLPC SPV_AMD_shader_ballot extension.

The v_writelane instruction was already implemented for use by SGPR
spilling, but I had to add an extra dummy operand tied to the
destination, to represent that all lanes except the selected one keep
the old value of the destination register.

.ll test changes were due to schedule changes caused by that new
operand.

Change-Id: I26fcb56acb60cf3a9cce63cc673f704fad24cde2

Diff Detail

Repository
rL LLVM

Event Timeline

tpr created this revision.Feb 2 2018, 2:47 AM
dstuttard added inline comments.Feb 2 2018, 4:04 AM
include/llvm/IR/IntrinsicsAMDGPU.td
748 ↗(On Diff #132555)

It is by no means consistent, but the usual way to document the intrinsic arguments is different.
See e.g. int_amdgcn_exp for an example of this

lib/Target/AMDGPU/SIRegisterInfo.cpp
748 ↗(On Diff #132555)

I notice that you've left this as the pseudo whereas the original lowered to an actual MCOpcode - I guess it's more consistent to leave that final lowering until later.

arsenm added inline comments.Feb 2 2018, 8:44 AM
include/llvm/IR/IntrinsicsAMDGPU.td
751 ↗(On Diff #132555)

It would make senses for this to be a mangled type so you could use f32/f16/i16/v2i16/v2f16, however this was already a mistake with readlane so maybe it should be consistent.

lib/Target/AMDGPU/SIInstrInfo.cpp
3155–3178 ↗(On Diff #132555)

All of this code should be unnecessary. The instruction is already broken if it gets here. VGPRs are not allowed in src0/src1, and required for src2. The selection pattern should already be respecting these constraints. Since you're using a straightforward tablegen pattern, there shouldn't be an issue. The main purpose of the legalizeOperands* functions is to handle legalizing operand constraints that require looking at the entire instruction, and not just a single operand (i.e. the constant bus restriction)

lib/Target/AMDGPU/SIRegisterInfo.cpp
742–746 ↗(On Diff #132555)

This is broken. You shouldn't need to do this. The kill flags need to be correct. The VGPRs used for SGPR spilling are hackily reserved. Depending on where this is called from it's possible that the way that is done is broken. I think bad things could still happen from this undef if the verifier is unhappy.

748 ↗(On Diff #132555)

There is a reason for this, but I don't remember what it was exactly. I think there was some related change between SI and VI. I think you should leave this as a separate change. I know I've tried to fix this before and hit some issue.

lib/Target/AMDGPU/VOP2Instructions.td
328 ↗(On Diff #132555)

This isn't a src2, so should be named otherwise to indicate it's tied (like vdst_in or something)

tpr added inline comments.Feb 5 2018, 1:15 AM
lib/Target/AMDGPU/SIRegisterInfo.cpp
742–746 ↗(On Diff #132555)

So what do you think I should do? What bad things could happen? Post-RA scheduling going wrong?

748 ↗(On Diff #132555)

Yes; gfx6 and gfx7 have vop2 and gfx8 has vop3. But I saw that my new use of writelane does not lower the pseudo until later, and my testing with this change (with llpc only, not other users of the backend) passed. But good idea to make it a separate change.

tpr added inline comments.Feb 6 2018, 5:50 AM
lib/Target/AMDGPU/SIInstrInfo.cpp
3155–3178 ↗(On Diff #132555)

Hi Matt. I tried removing this code, and the verifier falls over due to src1 being a vgpr_32 in the test I am trying. Also I see that readlane has analogous special case code here in legalizeOperandsVOP2.

tpr added inline comments.Feb 6 2018, 5:57 AM
lib/Target/AMDGPU/SIRegisterInfo.cpp
742–746 ↗(On Diff #132555)

Also it's not any more broken than it was before my change, when the destination register is not mentioned as an input at all. Fixing any brokenness of sgpr spilling should not be part of this change. I'll change the comment to reflect suspected brokenness.

arsenm added inline comments.Feb 6 2018, 7:49 AM
lib/Target/AMDGPU/SIInstrInfo.cpp
3155–3178 ↗(On Diff #132555)

The src2 verification shouldn't be necessary.

The readlane handling is surprising to me, I would expect we would need to handle this case by waterfalling

lib/Target/AMDGPU/SIRegisterInfo.cpp
742–746 ↗(On Diff #132555)

My main concern would be the register scavenger finding this as an available register to use

tpr updated this revision to Diff 133017.Feb 6 2018, 8:32 AM

V2: Addressed some review comments.

tpr updated this revision to Diff 133019.Feb 6 2018, 8:36 AM
tpr marked an inline comment as done.

V3: Addressed review comment about not needing to check src2 in
legalizeOperandsVop2.

tpr marked 4 inline comments as done.Feb 6 2018, 8:37 AM
tpr added a comment.Feb 13 2018, 3:50 AM

Ping. Can I land this on the basis that vgpr allocation for sgpr spilling is no more broken than it was before?

I think there should be two more tests: one where the vdst_in comes from a uniform value in an SGPR (e.g. a shader where it comes from an inreg) and one where it's a constant.

lib/Target/AMDGPU/SIInstrInfo.cpp
3155–3178 ↗(On Diff #132555)

For context, the GLSL readlane intrinsic is defined to require a uniform lane selection. So, like other intrinsics which have the same requirement, not doing automatic waterfalling is the right thing to do here.

lib/Target/AMDGPU/SIRegisterInfo.cpp
748 ↗(On Diff #132555)

What is actually breaking without the RegState::Undef?

When we allocate a VGPR for spilling, we add it as LiveIn to all basic blocks, and the V_READLANE_B32 generated for restoring a spilled SGPR should never set the VGPR to killed. So I don't see where the machine verifier would complain. An example would be good.

tpr added inline comments.Feb 26 2018, 4:25 AM
lib/Target/AMDGPU/SIRegisterInfo.cpp
748 ↗(On Diff #132555)

Just re-checked, and I think the problem (shown by test/CodeGen/AMDGPU/byval-frame-setup.ll) is when it happens to pick a callee saves register for spilling sgprs into. After inserting the writelanes for sgpr spills, it runs the prologue/epilogue insertion pass, which adds a store of the callee saves register to stack frame, with a kill.

Any ideas on a good way to fix that? Anything less hacky than scanning the first basic block to find the first writelane and mark just that one as undef?

tpr updated this revision to Diff 135922.Feb 26 2018, 10:11 AM

V4: Mark the input undef only on the first sgpr spill in the first basic

block for that vgpr. Added the suggested two tests.

Thank you for taking another look. I understand now, and I agree that there doesn't seem to be a super clean solution. Not to mention that the function prolog/epilog code here is clearly incorrect, because it is saving/restoring the VGPR with a potentially insufficient EXEC mask.

Anyway, those issues are unrelated, and I think your approach to solving it is reasonable now. I'd just ask you to inline that one helper function as mentioned in the comment, apart from that the change looks good to me.

lib/Target/AMDGPU/SIMachineFunctionInfo.h
656–664 ↗(On Diff #135922)

I don't like this function. The name is misleading: it suggests a pure function ("is XXX"), but in reality there is a side effect; and the list of parameters is pretty surprising given the name as well.

Please just inline the function (including its comments) in the one single location where it is used, I think that will be much cleaner.

tpr marked an inline comment as done.Feb 28 2018, 9:17 AM
tpr added inline comments.
lib/Target/AMDGPU/SIMachineFunctionInfo.h
656–664 ↗(On Diff #135922)

Will inline before landing.

This revision was not accepted when it landed; it landed in state Needs Review.Feb 28 2018, 11:14 AM
This revision was automatically updated to reflect the committed changes.
tpr marked an inline comment as done.