This is an archive of the discontinued LLVM Phabricator instance.

[AMDGPU] Fix promote alloca with double use in a same insn
ClosedPublic

Authored by rampitec on Feb 9 2021, 4:54 PM.

Details

Summary

If we have an instruction where more than one pointer operands
are derived from the same promoted alloca, we are fixing it for
one argument and do not fix a second use considering this user
done.

Fix this by deferring processing of memory intrinsics until all
potential operands are replaced.

Fixes: SWDEV-271358

Diff Detail

Event Timeline

rampitec created this revision.Feb 9 2021, 4:54 PM
rampitec requested review of this revision.Feb 9 2021, 4:54 PM
Herald added a project: Restricted Project. · View Herald TranscriptFeb 9 2021, 4:54 PM
Herald added a subscriber: wdng. · View Herald Transcript
arsenm added inline comments.Feb 9 2021, 5:57 PM
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
603–608

I don't see how this sorts it, the ordering is still determined by the arbitrary ordering in users. Can't the replacement just check all candidate operands

rampitec added inline comments.Feb 9 2021, 6:08 PM
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
603–608

The function is recursive and does DFS. This code pushes any later use to the end.

Checking all uses at replacement is practically impossible because it may be a long use-def chain. You would need to do DFS again at every replacement.

arsenm added inline comments.Feb 9 2021, 6:38 PM
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
603–608

This is a DFS on the user list, not the function. I do not think this will give you the guarantee that the first user dominates the second. But you also just need to ensure the transitive users are enqueued regardless. Would it be easier to just use a SetVector instead of is_contained and a regular worklist

llvm/test/CodeGen/AMDGPU/promote-alloca-mem-intrinsics.ll
68

I noticed this dropped dereferenceable, should maybe fix that

79

Can you also add tests with select and phi both derived from the same

rampitec planned changes to this revision.Feb 9 2021, 9:40 PM

I actually think this creates topological problem. If I move a use forward I also need to move the whole chain after it forward. So it doesn't seem to work unfortunately. Uses create trees and somewhere in these trees we might need a sort of barriers. Essentially all use chains have to come to an instruction useing them.

rampitec added inline comments.Feb 10 2021, 11:08 AM
llvm/test/CodeGen/AMDGPU/promote-alloca-mem-intrinsics.ll
79

We have these tests. One in promote-alloca-to-lds-select.ll @lds_promote_alloca_select_two_derived_pointers, one in the promote-alloca-to-lds-phi.ll @branch_ptr_var_same_alloca. This works because we only update operands not replacing the instruction. Problem with the memcpy is that we actually create a new call which is not going to be hit when we touch second operand.

One possible solution I am exploring is to postpone patching memory intrinsics until the end. Another is to patch the call in place.

rampitec updated this revision to Diff 322778.Feb 10 2021, 12:01 PM
rampitec marked 3 inline comments as done.

Changed to just defer processing memcpy and memove.
Restored dereferenceable attributes.

rampitec edited the summary of this revision. (Show Details)Feb 10 2021, 12:03 PM
arsenm added inline comments.Feb 10 2021, 12:39 PM
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
1011

In principle this could happen for all multi-operand instructions, like select and icmp

1014

I'm not sure I understand why it really needs to be deferred. If we tracked the specific use, you could just replace the one operand and then encounter the instruction again for the second?

1025–1027

This is a separate patch (also more attributes still apply)

rampitec added inline comments.Feb 10 2021, 12:55 PM
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
1011

We have tests for this. It works fine. Operands are just updated with replaceAllUsesWith(). With memcpy and memmove there are two issues:

  1. We need to replace called function with different mangling. We are only replacing it for one of the arguments. Second argument is also updated but we do not visit this intrinsic again.
  2. Even if we have these calls in the worklist twice that still would not work. Other instructions just get their operands updated, but here we are replacing call instruction itself with a new one. No pointer in the worklist would point to it.
1014

No, because instruction is dropped.

rampitec updated this revision to Diff 322804.Feb 10 2021, 12:57 PM
rampitec marked an inline comment as done.

Dropped memset attribute parts.

rampitec added a comment.EditedFeb 10 2021, 1:02 PM

Here is what happens w/o the patch:

call void @llvm.memcpy.p0i8.p3i8.i64(i8 addrspace(3)* align 8 %i, i8 addrspace(3)* align 8 %i1, i64 16, i1 false)

As you may see both operands are correctly updated. That is mangling of the memcpy is wrong (p0i8 for the first operand).

arsenm accepted this revision.Feb 11 2021, 11:23 AM
This revision is now accepted and ready to land.Feb 11 2021, 11:23 AM