This is an archive of the discontinued LLVM Phabricator instance.

[Polly] [PPCGCodeGeneration] [WIP] Do not consider writes to killable scalars as scalar stores
AbandonedPublic

Authored by bollu on Jul 17 2017, 8:22 AM.

Details

Summary
  • If a scalar can be killed, then the order of writes into it does not matter, since we don't read it later anyway. So, do not track these as scalar stores
  • TODO 1. How do I test this? there's no direct way to see if this works or not. Perhaps I could observe that there is no br i1 false at the RTC check. However, that is too indirect for my taste.
  • TODO 2. Refactor the SAI check into a separate function, since it is used in two places now.

Event Timeline

bollu created this revision.Jul 17 2017, 8:22 AM
bollu retitled this revision from [Polly] [PPCGCodeGeneration] [WIP] Do not consider writes to killable scalars as scalar stores - If a scalar can be killed, then the order of writes into it does not matter, since we don't read it later anyway. So, do not track these as scalar... to [Polly] [PPCGCodeGeneration] [WIP] Do not consider writes to killable scalars as scalar stores.Jul 17 2017, 8:49 AM
bollu edited the summary of this revision. (Show Details)

@grosser - does the change make sense to you? I'll clean it once I know that this is a sane change.

grosser edited edge metadata.Jul 17 2017, 8:57 AM

Thanks for pushing forward here. One thing I would like to check if the scalars that are private are passed into the kernel or not. AFAIU ppcg is not passing such "private" scalars to the kernel, but Polly-ACC still does this today. Can you check if/what is going on here?

Best,
Tobias

bollu added a comment.Jul 18 2017, 1:46 AM

From what I can tell, PPCG does seem to pass "private" scalars:

test.c
void checkPrivatization(int A[], int B[], int C[], const int control) {
int x = 0;
#pragma scop
    for(int i = 0; i < 1000; i++) {

ifcond:        if (control != 0)
              xinc: x = C[i];

xuse:        B[i] = x * A[i];
    }
#pragma endscop
}
testbed_kernel.cu
/* PPCG output */
#include "testbed_kernel.hu"
__global__ void kernel0(int *A, int *B, int *C, int *x, int control)
{
    int b0 = blockIdx.x;
    int t0 = threadIdx.x;
    __shared__ int shared_x;

    {
      if (control == 0 && t0 == 0)
        shared_x = *x;
      __syncthreads();
      if ((control <= -1 && 32 * b0 + t0 <= 999) || (control >= 1 && 32 * b0 + t0 <= 999))
        shared_x = C[32 * b0 + t0];
      if (32 * b0 + t0 <= 999)
        B[32 * b0 + t0] = (shared_x * A[32 * b0 + t0]);
      __syncthreads();
    }
}

I would have expected it to not take x as a parameter, but it seems to do so.
Are you referring to some other behaviour I am unaware of?

bollu updated this revision to Diff 107035.Jul 18 2017, 2:32 AM
  • Cleanup code.
bollu added a comment.Jul 18 2017, 3:49 AM

@grosser - with this change, we break the test test/GPGPU/invariant-load-hoisting-with-failing-scop.ll which gets triggered when we have a BuildSuccessful = 0 with invariant load hoisting. However, the only way to trigger this is either

  1. screw up verifyModule

or

  1. have a test case where storedScalar is true.

I'm not able to generate an example after this patch such that storedScalar = true.

Any ideas for this?

Seems inconsistent, I get for the following input:

float X;                                                                         
void f(float A[]) {                                                              
#pragma scop                                                                     
  for (long i = 0; i < 1024; i++) {                                              
    X = i * i;                                                                   
    A[i] = X;                                                                    
  }                                                                              
 X = 100;                                                                        
#pragma endscop                                                                  
}

the following output

__global__ void kernel0(float *A)
{
    int b0 = blockIdx.x;
    int t0 = threadIdx.x;
    float private_X;

    {
      private_X = ((32 * b0 + t0) * (32 * b0 + t0));
      A[32 * b0 + t0] = private_X;
    }
}
__global__ void kernel1(float *X)
{

    X[0] = 100;
}
bollu updated this revision to Diff 107055.Jul 18 2017, 4:12 AM
  • [WIP] previous commmit makes no sense, that checks if we care about the value at the *end* of a scop, not in between loop iterations
bollu abandoned this revision.Aug 21 2017, 5:14 AM

Abandoned because rL311259 contains changes that this patch was supposed to perform.