This is an archive of the discontinued LLVM Phabricator instance.

AMDGPU: Support non-entry block static sized allocas
ClosedPublic

Authored by arsenm on May 27 2020, 8:41 AM.

Details

Summary

OpenMP emits these for some reason, so handle them. Assume these use
4096 bytes by default, with a flag to override this. Also change the
related stack assumption for calls to have a flag.

Diff Detail

Event Timeline

arsenm created this revision.May 27 2020, 8:41 AM
Herald added a project: Restricted Project. · View Herald Transcript

Do you happen to have the input for which OpenMP emitted them?

Can you add a tests showing a kernel along with resulting ScratchSize please?

arsenm updated this revision to Diff 266588.May 27 2020, 10:05 AM

Fix stack growth direction and scale amount by wavefront size

Can you add a tests showing a kernel along with resulting ScratchSize please?

I wasn't sure what to report for the size, so this just misses it entirely. The old code object has is_dynamic_callstack = 1, which I'm not sure actually did anything. I guess we could just pick a big number here like is already done for the external call case? I guess I could pick a smaller, large number?

Can you add a tests showing a kernel along with resulting ScratchSize please?

I wasn't sure what to report for the size, so this just misses it entirely. The old code object has is_dynamic_callstack = 1, which I'm not sure actually did anything. I guess we could just pick a big number here like is already done for the external call case? I guess I could pick a smaller, large number?

Probably yes. We need to allocate it somehow. A large number does not seem unreasonable unless until we have something better.

arsenm updated this revision to Diff 266620.May 27 2020, 12:22 PM
arsenm edited the summary of this revision. (Show Details)

Assume 4096 bytes for dynamic sized objects

This revision is now accepted and ready to land.May 27 2020, 12:50 PM
ronlieb added a subscriber: ronlieb.Jun 2 2020, 8:46 PM

Johannes: here is a reduce source test case, let me know what else you might need?

#include <stdio.h>

int main (void)
{

int ng =12;
int nxyz = 5000;
#pragma omp target teams distribute 
for (int gid = 0; gid < nxyz; gid++) {
  #pragma omp parallel for
  for (unsigned int g = 0; g < ng; g++) {
      int a = 0;
  }  
}
return 0;

}

Johannes: here is a reduce source test case, let me know what else you might need?

#include <stdio.h>

int main (void)
{

int ng =12;
int nxyz = 5000;
#pragma omp target teams distribute 
for (int gid = 0; gid < nxyz; gid++) {
  #pragma omp parallel for
  for (unsigned int g = 0; g < ng; g++) {
      int a = 0;
  }  
}
return 0;

}

@ronlieb Thx. This is a "bug" in OpenMPOpt we need to address there. I'll put it on my (mental) to do list.

Is this still happening, I might have fixed the IRBuilder.

arsenm added a comment.Aug 6 2020, 9:41 AM

Is this still happening, I might have fixed the IRBuilder.

I think so, plus I also just actually fixed the actual codegen in ec8c172d01eb14eba890f36205da0613dda7f742