This is an archive of the discontinued LLVM Phabricator instance.

[OpenMP] Replace GPU globalization calls with shared memory in the middle-end
ClosedPublic

Authored by jhuber6 on Mar 2 2021, 4:34 PM.

Details

Summary

The changes introduced in D97680 create a simpler interface to code that needs to be globalized. This interface is used to simplify the globalization calls in the middle end. We can check any globalization call that is only called by a single thread in the team and replace it with a static shared memory buffer.

Depends on D102700 D97680

Diff Detail

Event Timeline

jhuber6 created this revision.Mar 2 2021, 4:34 PM
jhuber6 requested review of this revision.Mar 2 2021, 4:34 PM

some initial comments.

llvm/lib/Transforms/IPO/OpenMPOpt.cpp
1033
1036

No cast needed

1041

prefer early exit, if (!...) return false;

1044

No need for it to be a constant

1048

Don't assume an order. Check all users, one should be a free, others can be whatever. If you find bitcast users, remember the type, if they all agree, use that for the alloca.

1733

no need to go over the free calls. they need to be users of the alloc and we remove them with the alloc.

1740

No need, use *CI below.

We can reuse Attributor HeapToStack for this :)

jhuber6 updated this revision to Diff 330353.Mar 12 2021, 1:24 PM
jhuber6 retitled this revision from [OpenMP] Remove GPU globalization calls in the middle-end to [OpenMP] Replace GPU globalization calls with shared memory in the middle-end.
jhuber6 edited the summary of this revision. (Show Details)

Changing this optimization to replace the globalization calls with shared memory. Removing them will be done by the attributor using HeapToStack once we add the allocation calls and improve the attributor.

jdoerfert added inline comments.Mar 17 2021, 9:05 AM
llvm/lib/Transforms/IPO/OpenMPOpt.cpp
68

I don't think we need this after all.

1010

For now, check isKernel(F) and only do this for kernel functions. Later we can be more aggressive but for now that should limit it properly, also with regards to the lifetime of those allocations.

1068

You cannot recollect while looping.
Move the recollect for free after the loop, delete the other one. Return true from this function when it deleted the use, that will update the alloc use list.

1072

return true;

jhuber6 updated this revision to Diff 331375.Mar 17 2021, 2:21 PM
jhuber6 edited the summary of this revision. (Show Details)

Changing the test to simply check if we are in a non-SPMD kernel function. A more advanced approach can be used in the future.

jhuber6 updated this revision to Diff 331676.Mar 18 2021, 1:30 PM
jdoerfert accepted this revision.Mar 18 2021, 4:20 PM

LGTM, one nit.

llvm/lib/Transforms/IPO/OpenMPOpt.cpp
1073

Can you use a enum here or variable instead of "3".

This revision is now accepted and ready to land.Mar 18 2021, 4:20 PM
jhuber6 updated this revision to Diff 332837.Mar 23 2021, 5:48 PM

Making changes.

jhuber6 updated this revision to Diff 332842.Mar 23 2021, 6:27 PM

Adding enum for address space constant.

jhuber6 updated this revision to Diff 342721.May 4 2021, 7:10 AM

Changing the method for determining valid allocation calls to use AAExecutionDomain introduced in D101578. Currently, this doesn't work with the code Clang generates but will soon.

jhuber6 updated this revision to Diff 346208.May 18 2021, 10:14 AM
jhuber6 edited the summary of this revision. (Show Details)

Adding a pattern match check so this picks up on the code Clang currently generates. This is a temporary solution to improve performance until a full runtime rewrite can be landed.

jdoerfert added inline comments.May 18 2021, 11:29 AM
llvm/lib/Transforms/IPO/OpenMPOpt.cpp
1036

Nit: hoist this out of the lambda.

1061

You have to check that a unique free call was found or give up otherwise.

1078

You might be able to replace these two with just: ConstantExpr::getPointerCast

2467

Check the Pred as well. Maybe try to split this in multiple matches or somehow to make it easier to read.

jhuber6 updated this revision to Diff 346264.May 18 2021, 1:48 PM

Addressing comments.

I looked over this again because I think we will later move the logic into an AbstractAttribute. The reason is that we want to reuse some of the analysis in AAKernelInfo to determine if we can use SPMD mode. Basically, if the shared memory is only used in a nice way and fees into an actual parallel51, it is compatible with SPMD mode. If we go to SPMD mode we have (or better should) eliminate the shared memory indirection. Maybe we move all of this into AAHeapToStack and make a more general AAAllocationInfo out of it. Let's see.

That said, I noticed a problem below. Please add a test that misses the free, and/or that has a phi before the free, to make sure we do not globalize.

llvm/lib/Transforms/IPO/OpenMPOpt.cpp
1065

if (!FC) return false;

Also move the debug below the exists.

jhuber6 updated this revision to Diff 350293.Jun 7 2021, 7:57 AM
jhuber6 edited the summary of this revision. (Show Details)

Change the implementation to use the Attributor. This is because the attributor after registering HeapToStack in D102197 would change the call sites. Putting this pass inside an attributor instance allows both to run and now globalization will only be replaced if HeapToStack doesn't remove it first.

jhuber6 updated this revision to Diff 352533.Jun 16 2021, 1:18 PM

Changing alignment of new global variable, alignment of 8 caused errors in some applications.

This revision was landed with ongoing or failed builds.Jun 22 2021, 9:10 AM
This revision was automatically updated to reflect the committed changes.