Page MenuHomePhabricator

[CUDA, NVPTX] Pass byval aggregates directly
Changes PlannedPublic

Authored by tra on Jan 24 2022, 3:37 PM.

Details

Reviewers
jdoerfert
yaxunl
Summary

Changes the NVPTX ABI to pass aggregates directly. Only clang-generated IR is
affected. The change does not affect ABI on thechange function signatures in the
generated PTX

Discussion: https://llvm.discourse.group/t/nvptx-calling-convention-for-aggregate-arguments-passed-by-value

Currently NVPTX ABI passes aggregate values indirectly as a byval pointer. When
we need to pass a *value*, LLVM has to store it in an alloca, so it can have a
pointer to pass on. This is a double whammy for NVPTX. LLVM often fails to
eliminate that alloca (usually SROA considers such pointer as escaped and gives
up) and that is noticeable hit on performance. When we lower IR to PTX, the
argument is actually passed by copy, so we end up having to do more work just to
get the value loaded back from the alloca. So, we do more work for less
performance. Switching to passing aggregates directly allows us to generate
better code.

Diff Detail

Event Timeline

tra created this revision.Jan 24 2022, 3:37 PM
tra requested review of this revision.Jan 24 2022, 3:37 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 24 2022, 3:37 PM

The RFC discussion concluded this should be fine wrt. the interoperability use cases we want to support.
Code change looks good but I have one question.

clang/lib/CodeGen/TargetInfo.cpp
7183

Nit: Maybe a note that this effectively disables passing values via byval.

7193

When is this ever hit and should we not disable byval here too while we are at it?

tra planned changes to this revision.Jan 25 2022, 4:20 PM

Getting rid of byval helps getting rid of locals in quite a few places, but runs into a new problem. 😕

Looks like this change does have unexpected side-effects.
When we need to dynamically index into a struct passed directly, there's no easy way to do it as extractvalue only supports constant indices.
With byval aggregates LLVM uses GEP which does allow using dynamic indexing.
Alloca would only show up after nvptx-lower-args pass and that by that time IR would often be simple enough to eliminate that alloca.
Now, clang generates a local copy early on and, indexes into it dynamically with GEP... and then LLVM fails to eliminate the local copy because SROA fails to deal with dynamic indices and that in turn prevents IR optimizations that were possible without alloca.
https://github.com/llvm/llvm-project/issues/51734

That's rather unfortunate. This regression is serious enough to be a showstopper for my use case.

clang/lib/CodeGen/TargetInfo.cpp
7193

Basically it's saying "pass as byval pointer if it's an int that's larger than what we can lower".
Yes, I think passing such integer directly would make sense.

We may hit this if clang wants to pass __i128 (do larger int types exist in clang?).
I think (some of) this may be a leftover from the days when we didn't support i128 in CUDA/NVPTX. I think we do now.

@lebedev.ri wanted to teach SROA how to deal with dynamic indices before, IIRC. It seems to be generally useful. This patch can wait till then?

clang/lib/CodeGen/TargetInfo.cpp
7193

We have larger types, I somewhat doubt using them will work properly everywhere though.

tra added a comment.Jan 25 2022, 4:36 PM

@lebedev.ri wanted to teach SROA how to deal with dynamic indices before, IIRC. It seems to be generally useful.

Interesting. I'd like to hear more.

This patch can wait till then?

Yes.

@lebedev.ri wanted to teach SROA how to deal with dynamic indices before, IIRC. It seems to be generally useful.

Interesting. I'd like to hear more.

I guess i, too, would like to hear more about the problem.
My last idea was about allowing splitting

struct {
  int a;
  int b[2];
} a;

into

// not in a struct anymore!
int a;
int b[2]

But given just the int b[2]; i'm not sure what can be done.

This patch can wait till then?

Yes.

tra added a comment.EditedJan 26 2022, 11:34 AM

My last idea was about allowing splitting

struct {
  int a;
  int b[2];
} a;

into

// not in a struct anymore!
int a;
int b[2]

This looks like it's a somewhat different problem.

In my case this is what bites me: https://godbolt.org/z/417fMMn6c
It's a variant of this issue: https://github.com/llvm/llvm-project/issues/51734

I have a WIP patch that converts a GEP with a dynamic index with a known range of values into a series of comparisons and fixed-index GEPs. I guess I'll need to get it sorted out first.

gehre added a subscriber: gehre.Feb 2 2022, 11:47 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 14 2022, 4:11 AM