This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Add pass to lower kernel arguments to loads
ClosedPublic

Authored by arsenm on Jun 25 2018, 2:16 AM.

Details

Summary

This replaces most argument uses with loads, but for
now not all.

The code in SelectionDAG for calling convention lowering
is actively harmful for amdgpu_kernel. It attempts to
split the argument types into register legal types, which
results in low quality code for arbitary types. Since
all kernel arguments are passed in memory, we just want the
raw types.

I've tried a couple of methods of mitigating this in SelectionDAG,
but it's easier to just bypass this problem alltogether. It's
possible to hack around the problem in the initial lowering,
but the real problem is the DAG then expects to be able to use
CopyToReg/CopyFromReg for uses of the arguments outside the block.

Exposing the argument loads in the IR also has the advantage
that the LoadStoreVectorizer can merge them.

I'm not sure the best approach to dealing with the IR
argument list is. The patch as-is just leaves the IR arguments
in place, so all the existing code will still compute the same
kernarg size and pointlessly lowers the arguments.

Arguably the frontend should emit kernels with an empty argument
list in the first place. Alternatively a dummy array could be
inserted as a single argument just to reserve space.

This does have some disadvantages. Local pointer kernel arguments can
no longer have AssertZext placed on them as the equivalent !range
metadata is not valid on pointer typed loads. This is mostly bad
for SI which needs to know about the known bits in order to use the
DS instruction offset, so in this case this is not done.

More importantly, this skips noalias arguments since this pass
does not yet convert this to the equivalent !alias.scope and !noalias
metadata. Producing this metadata correctly seems to be tricky,
although this logically is the same as inlining into a function which
doesn't exist. Additionally, exposing these loads to the vectorizer
may result in degraded aliasing information if a pointer load is
merged with another argument load.

I'm also not entirely sure this is preserving the current clover
ABI, although I would greatly prefer if it would stop widening
arguments and match the HSA ABI. As-is I think it is extending
< 4-byte arguments to 4-bytes but doesn't align them to 4-bytes.

Diff Detail

Event Timeline

arsenm created this revision.Jun 25 2018, 2:16 AM

In general I believe having distinct loads for kernel arguments is a right thing to do. Of course that would be preferable to keep only one mechanism and have no dual support like for noalis and SI LDS.
Couple notes though:

  1. I think we need to have a distinct addresspace for kernarg, not just constant.
  2. This would be handy to set names on the argument loads derived from either original argument names if present, or from an argument number. This patch introduces geps with immediate offsets which would render IR unreadable.
test/CodeGen/AMDGPU/extract_vector_elt-i16.ll
117

Why do you need all of that explicit padding in many tests?

In general I believe having distinct loads for kernel arguments is a right thing to do. Of course that would be preferable to keep only one mechanism and have no dual support like for noalis and SI LDS.
Couple notes though:

  1. I think we need to have a distinct addresspace for kernarg, not just constant.
  2. This would be handy to set names on the argument loads derived from either original argument names if present, or from an argument number. This patch introduces geps with immediate offsets which would render IR unreadable.
  1. I think there's no need for a distinct address space, and would just add unnecessary complexity for all the places that would now need to consider yet another address space. Ideally I would like to eventually eliminate even constant address space. We were already using just constant for the intrinsic
  2. I'm not sure what you mean. For debugging purposes the replaced value does use a derived name

Most of the benefit of noalias is probably attainable with the existing metadata, except for the vectorization problem (in which case something is probably needed to annotate inttoptr). The range problem also requires some kind of new metadata mechanism for marking the pointer ranges that may or may not be worth the effort.

Also, if this were to split argument loads into 4-byte pieces it could workaround the limitation of the vectorizer where it can't merge different sized types

test/CodeGen/AMDGPU/extract_vector_elt-i16.ll
117

It's enough that they can't be merged with unused gaps if the vectorizer supported that. Most tests are looking for loads of specific values, and it's harder / impossible to easily check the correct value with the merged load.

rampitec added inline comments.Jun 25 2018, 1:14 PM
lib/Target/AMDGPU/AMDGPULowerKernelArguments.cpp
165

I mean to give a name derived from the argument to this gep.

169

.. also this GEP.

arsenm updated this revision to Diff 152847.Jun 26 2018, 12:33 AM

Add more value names

rampitec accepted this revision.Jun 26 2018, 7:52 AM

LGTM.

This revision is now accepted and ready to land.Jun 26 2018, 7:52 AM
arsenm closed this revision.Jun 26 2018, 12:14 PM

r335650