This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Promote array alloca if used by memmove/memcpy
ClosedPublic

Authored by ruiling on Dec 22 2022, 6:16 PM.

Diff Detail

Event Timeline

ruiling created this revision.Dec 22 2022, 6:16 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 22 2022, 6:16 PM
ruiling requested review of this revision.Dec 22 2022, 6:16 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 22 2022, 6:16 PM

Shouldn't these have been expanded into loads and stores already?

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
508

Extra ;

Shouldn't these have been expanded into loads and stores already?

We only split mem transfer intrinsics using a loop if its length is > 1024.

Shouldn't these have been expanded into loads and stores already?

We only split mem transfer intrinsics using a loop if its length is > 1024.

I think InstCombine and/or SROA do some too. Also that threshold is arbitrary, and we could start expanding the small ones in IR too

Shouldn't these have been expanded into loads and stores already?

We only split mem transfer intrinsics using a loop if its length is > 1024.

I think InstCombine and/or SROA do some too. Also that threshold is arbitrary, and we could start expanding the small ones in IR too

If I understand correctly, InstCombine focus on simplifying mem transfer intrinsic instead of expanding them and only work in very limited cases.

In fact, the change here shares similar idea with SROA, it will analyze whether the Alloca involved in memmove/memcpy can be promoted to vector/scalar operation. If it is promotable, then we will expand the mem transfer intrinsic as part of the promotion process. I would say the important thing here is to get optimal code generation for small alloca, expanding smaller mem transfer is not quite important. Does this sound reasonable to you? Or could you share how would you like we expand small mem transfer intrinsic?

In the long term, I would hope we can enhance the SROA to cover the optimization we have done in tryPromoteAllocaToVector(). But that need much more time/effort.

ruiling added inline comments.Jan 8 2023, 5:46 PM
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
508

I guess you might misread here? I think we always put a semicolon at the end of lambda expression.

arsenm added inline comments.Jan 8 2023, 5:54 PM
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
502

[=]?

594–596

probably should dyn_cast to MemIntrinsicInst here to defend agains these becoming out of sync

612

No reason to care about type pointers anymore

llvm/test/CodeGen/AMDGPU/promote-alloca-array-aggregate.ll
219

Needs tests for the different address space cases you mentioned in the todo

219

also memcpy_inline

ruiling updated this revision to Diff 487253.Jan 8 2023, 5:59 PM

rebase and ping.

ruiling added inline comments.Jan 8 2023, 6:32 PM
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
612

Our graphics compiler has not fully switched to opaque pointers yet. @foad am I right?

ruiling updated this revision to Diff 487298.Jan 8 2023, 11:26 PM

address review comments.

Thanks for the careful review @arsenm, I have fixed most of them. Please help take a second look. Thanks!

llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
502

We are capturing GEPVectorIdx and Alloca, so I think capture by reference is preferred over by copy here. Please correct me if I am wrong.

594–596

Agree, I have changed to cast to catch the possible broken case.

llvm/test/CodeGen/AMDGPU/promote-alloca-array-aggregate.ll
219

Good idea, I have added all of them.

foad added inline comments.Jan 9 2023, 2:32 AM
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
544

This can be cast instead of dyn_cast.

612

Yes.

ruiling updated this revision to Diff 487604.Jan 9 2023, 4:48 PM

use cast instead of dyn_cast.

ruiling updated this revision to Diff 487687.Jan 9 2023, 10:30 PM

Use alignment from alloca.

arsenm accepted this revision.Jan 10 2023, 5:56 AM
arsenm added inline comments.
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
594–596

This still has the switch over specific IDs instead of the matching cast

612

Well, that needs to be fixed soon since code is going to start getting ripped out very shortly

llvm/test/CodeGen/AMDGPU/promote-alloca-array-aggregate.ll
307

Another edge case to maybe test is the identity copy of the full alloca to itself

This revision is now accepted and ready to land.Jan 10 2023, 5:56 AM
foad added inline comments.Jan 10 2023, 6:00 AM
llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
612

We have top engineers working on it! https://github.com/GPUOpen-Drivers/llpc/pull/2210

ruiling updated this revision to Diff 487783.Jan 10 2023, 6:33 AM

address review comments.

arsenm accepted this revision.Jan 10 2023, 6:38 AM
ruiling marked 2 inline comments as done.Jan 10 2023, 6:39 AM
ruiling added inline comments.
llvm/test/CodeGen/AMDGPU/promote-alloca-array-aggregate.ll
307

I have added a case for it, but would not bother to optimize for it as such code pattern are trivially optimized away early.

This revision was automatically updated to reflect the committed changes.
ruiling marked an inline comment as done.