This is an archive of the discontinued LLVM Phabricator instance.

[WIP] [Polly] [PPCGCodeGeneration + PPCG] [3/3] Collect changes to PPCGCodeGen because of PPCG upgrade.
ClosedPublic

Authored by bollu on Jul 20 2017, 4:37 AM.

Details

Summary
  • PPCG changed parts of it's API, so update PPCGCodeGeneration to adapt

to the changes.

  1. PPCG now uses isl_multi_pw_aff instead of an array of pw_aff. This needs us to adjust how we index array bounds and how we construct array bounds.
  1. PPCG introduces two new kinds of nodes: init_device and clear_device. We should investigate what the correct way to handle these are.
  1. PPCG has gotten smarter with its use of live range reordering, so some of the tests have a qualitative improvement.
  1. PPCG changed its output style, so many test cases need to be updated to fit the new style for polly-acc-dump-code checks.

Event Timeline

bollu created this revision.Jul 20 2017, 4:37 AM
bollu retitled this revision from [Polly] [PPCGCodeGeneration + PPCG] [2/3] Collect changes to PPCGCodeGen because of PPCG upgrade. to [Polly] [PPCGCodeGeneration + PPCG] [3/3] Collect changes to PPCGCodeGen because of PPCG upgrade..Jul 20 2017, 4:37 AM

@singam-sanjay: We should move the discussion about init / clear device here.

bollu retitled this revision from [Polly] [PPCGCodeGeneration + PPCG] [3/3] Collect changes to PPCGCodeGen because of PPCG upgrade. to [WIP] [Polly] [PPCGCodeGeneration + PPCG] [3/3] Collect changes to PPCGCodeGen because of PPCG upgrade..Jul 20 2017, 4:38 AM

marked as WIP because the test cases need another look, and the init/clear_device issue is still being discussed.

bollu updated this revision to Diff 107488.Jul 20 2017, 5:23 AM
  • Use init_device and clear_device nodes in the schedule tree.
bollu added a comment.EditedJul 20 2017, 5:45 AM

I compiled the program from non-read-only-scalars.ll and ran in on daint (Piz-daint).

program.c
#include <stdio.h>

float foo(float A[]) {
    float sum = 0;

    for (long i = 0; i < 32; i++)
        SetA:    A[i] = i;


    for (long i = 0; i < 32; i++)
        IncA:    A[i] += i;

    for (long i = 0; i < 32; i++)
        IncSum:    sum += A[i];

RetSum:  return sum;
}


int main() {
    float A[32];
    float sum = foo(A);
    printf("=== PROGRAM ===\n");
    printf("sum: %f\n", sum);
    printf("=== END PROGRAM ===\n");
    return 0;
}

The output looks correct. It should be:

output-calc
(sum_i=0^32 i) * 2 =
32 * 31 / 2 * 2 =
992

The output from the run on daint:

daint-run
POLLY_DEBUG=1 srun -n 1 -Cgpu --partition=debug nvprof ./program.out
srun: job 2452889 queued and waiting for resources

srun: job 2452889 has been allocated resources
-> polly_initContext
-> initContextCUDA
==28137== NVPROF is profiling process 28137, command: ./program.out
> Running on GPU device 0 : Tesla P100-PCIE-16GB.
-> polly_allocateMemoryForDevice
-> allocateMemoryForDeviceCUDA
-> polly_allocateMemoryForDevice
-> allocateMemoryForDeviceCUDA
-> polly_allocateMemoryForDevice
-> allocateMemoryForDeviceCUDA
-> polly_copyFromHostToDevice
-> copyFromHostToDeviceCUDA
-> polly_getDevicePtr
-> getDevicePtrCUDA
-> polly_getKernel
-> getKernelCUDA
CUDA Link Completed in 0.000000ms. Linker Output:
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'FUNC_foo_SCOP_0_KERNEL_0' for 'sm_60'
ptxas info    : Function properties for FUNC_foo_SCOP_0_KERNEL_0
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 5 registers, 328 bytes cmem[0]
info    : 0 bytes gmem
info    : Function properties for 'FUNC_foo_SCOP_0_KERNEL_0':
info    : used 5 registers, 0 stack, 0 bytes smem, 328 bytes cmem[0], 0 bytes lmem
-> polly_launchKernel
-> launchKernelCUDA
-> polly_freeKernel
-> freeKernelCUDA
-> polly_getDevicePtr
-> getDevicePtrCUDA
-> polly_getDevicePtr
-> getDevicePtrCUDA
-> polly_getDevicePtr
-> getDevicePtrCUDA
-> polly_getKernel
-> getKernelCUDA
CUDA Link Completed in 0.000000ms. Linker Output:
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'FUNC_foo_SCOP_0_KERNEL_1' for 'sm_60'
ptxas info    : Function properties for FUNC_foo_SCOP_0_KERNEL_1
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 38 registers, 344 bytes cmem[0]
info    : 0 bytes gmem
info    : Function properties for 'FUNC_foo_SCOP_0_KERNEL_1':
info    : used 38 registers, 0 stack, 0 bytes smem, 344 bytes cmem[0], 0 bytes lmem
-> polly_launchKernel
-> launchKernelCUDA
-> polly_freeKernel
-> freeKernelCUDA
-> polly_copyFromDeviceToHost
-> copyFromDeviceToHostCUDA
-> polly_copyFromDeviceToHost
-> copyFromDeviceToHostCUDA
-> polly_freeDeviceMemory
-> freeDeviceMemoryCUDA
-> polly_freeDeviceMemory
-> freeDeviceMemoryCUDA
-> polly_freeDeviceMemory
-> freeDeviceMemoryCUDA
-> polly_freeContext
=== PROGRAM ===
sum: 992.000000
=== END PROGRAM ===
==28137== Profiling application: ./program.out -o run-output.txt
==28137== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 31.62%  2.5600us         1  2.5600us  2.5600us  2.5600us  FUNC_foo_SCOP_0_KERNEL_0
 28.85%  2.3360us         1  2.3360us  2.3360us  2.3360us  FUNC_foo_SCOP_0_KERNEL_1
 23.32%  1.8880us         2     944ns     672ns  1.2160us  [CUDA memcpy DtoH]
 16.21%  1.3120us         1  1.3120us  1.3120us  1.3120us  [CUDA memcpy HtoD]

==28137== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 96.16%  269.63ms         1  269.63ms  269.63ms  269.63ms  cuCtxCreate
  3.44%  9.6323ms         2  4.8162ms  4.3765ms  5.2558ms  cuLinkAddData
  0.20%  566.56us         3  188.85us  4.1630us  557.11us  cuMemAlloc
  0.07%  194.61us         2  97.302us  94.214us  100.39us  cuLinkComplete
  0.05%  134.44us         2  67.218us  66.802us  67.634us  cuModuleLoadData
  0.04%  105.59us         3  35.196us  5.4070us  91.471us  cuMemFree
  0.01%  39.579us         2  19.789us  17.197us  22.382us  cuLinkCreate
  0.01%  35.306us         2  17.653us  12.669us  22.637us  cuLaunchKernel
  0.01%  30.260us         2  15.130us  11.406us  18.854us  cuMemcpyDtoH
  0.01%  17.904us         1  17.904us  17.904us  17.904us  cuDeviceGetName
  0.01%  15.630us         1  15.630us  15.630us  15.630us  cuMemcpyHtoD
  0.00%  2.7000us         2  1.3500us     338ns  2.3620us  cuLinkDestroy
  0.00%  2.2280us         3     742ns     170ns  1.6410us  cuDeviceGetCount
  0.00%  1.2820us         2     641ns     550ns     732ns  cuModuleGetFunction
  0.00%     915ns         4     228ns     161ns     303ns  cuDeviceGetAttribute
  0.00%     873ns         3     291ns     179ns     475ns  cuDeviceGet
  0.00%     289ns         1     289ns     289ns     289ns  cuDeviceComputeCapability

Specifically, notice:

=== PROGRAM ===
sum: 992.000000
=== END PROGRAM ===

clearly, we do call the GPU version of the code (I set polly-acc-mincompute to 0).

So, it appears that the test case works.


I've attached the input .ll and the output .ll here in case anyone wants to take a look.

bollu added a comment.Jul 20 2017, 6:31 AM

Just to be very sure, I took the .ll file from the testcase and compiled it with the exact commands that we run in the test case.

We detect 3 kernels, unlike when we compile from the C file where we detect 2 kernels. I suspect this is because the test file was not run through polly-canonicalize. In any case, we still generate the correct output.

Trace of the run pasted below (notice that we have 3 kernel launches now).

15:29 $ make build-ll && make run
rm *.optimised.ll
rm *.out
rm *.bench
rm: cannot remove '*.bench': No such file or directory
makefile:30: recipe for target 'clean' failed
make: [clean] Error 1 (ignored)
rm *.s
/users/siddhart/llvm-install/bin/opt -S  -polly-process-unprofitable  -polly-codegen-ppcg \
	-polly-acc-mincompute=0 program.ll -o program.optimised.ll -polly-acc-dump-code
Code
====
# host
{
#define cudaCheckReturn(ret) \
  do { \
    cudaError_t cudaCheckReturn_e = (ret); \
    if (cudaCheckReturn_e != cudaSuccess) { \
      fprintf(stderr, "CUDA error: %s\n", cudaGetErrorString(cudaCheckReturn_e)); \
      fflush(stderr); \
    } \
    assert(cudaCheckReturn_e == cudaSuccess); \
  } while(0)
#define cudaCheckKernel() \
  do { \
    cudaCheckReturn(cudaGetLastError()); \
  } while(0)

  float *dev_MemRef0;
  float *dev_MemRef1__phi;
  float *dev_MemRef2;

  cudaCheckReturn(cudaMalloc((void **) &dev_MemRef0, (32) * sizeof(float)));
  cudaCheckReturn(cudaMalloc((void **) &dev_MemRef1__phi, sizeof(float)));
  cudaCheckReturn(cudaMalloc((void **) &dev_MemRef2, sizeof(float)));

  {
    dim3 k0_dimBlock(32);
    dim3 k0_dimGrid(1);
    kernel0 <<<k0_dimGrid, k0_dimBlock>>> (dev_MemRef0);
    cudaCheckKernel();
  }

  {
    dim3 k1_dimBlock;
    dim3 k1_dimGrid;
    kernel1 <<<k1_dimGrid, k1_dimBlock>>> (dev_MemRef1__phi);
    cudaCheckKernel();
  }

  {
    dim3 k2_dimBlock;
    dim3 k2_dimGrid;
    kernel2 <<<k2_dimGrid, k2_dimBlock>>> (dev_MemRef0, dev_MemRef1__phi, dev_MemRef2);
    cudaCheckKernel();
  }

  cudaCheckReturn(cudaMemcpy(MemRef0, dev_MemRef0, (32) * sizeof(float), cudaMemcpyDeviceToHost));
  cudaCheckReturn(cudaMemcpy(&MemRef2, dev_MemRef2, sizeof(float), cudaMemcpyDeviceToHost));
  cudaCheckReturn(cudaFree(dev_MemRef0));
  cudaCheckReturn(cudaFree(dev_MemRef1__phi));
  cudaCheckReturn(cudaFree(dev_MemRef2));
}

# kernel0
{
  Stmt1(t0);
  Stmt5(t0);
}

# kernel1
Stmt7();

# kernel2
for (int c0 = 0; c0 <= 32; c0 += 1) {
  Stmt8(c0);
  if (c0 <= 31)
    Stmt10(c0);
}

/users/siddhart/llvm-install/bin/llc program.optimised.ll -o program.s
/users/siddhart/llvm-install/bin/clang program.s  -lcudart -lGPURuntime -ldl -lOpenCL -lgfortran -lstdc++ -o program.out -L/opt/nvidia/cudatoolkit8.0/8.0.54_2.2.8_ga620558-2.1/lib64/
running program.out... on debug queue
POLLY_DEBUG=1 srun -n 1 -Cgpu --partition=debug nvprof ./program.out
srun: job 2454404 queued and waiting for resources
srun: job 2454404 has been allocated resources
-> polly_initContext
-> initContextCUDA
==23755== NVPROF is profiling process 23755, command: ./program.out
> Running on GPU device 0 : Tesla P100-PCIE-16GB.
-> polly_allocateMemoryForDevice
-> allocateMemoryForDeviceCUDA
-> polly_allocateMemoryForDevice
-> allocateMemoryForDeviceCUDA
-> polly_allocateMemoryForDevice
-> allocateMemoryForDeviceCUDA
-> polly_getDevicePtr
-> getDevicePtrCUDA
-> polly_getKernel
-> getKernelCUDA
CUDA Link Completed in 0.000000ms. Linker Output:
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'FUNC_foo_SCOP_0_KERNEL_0' for 'sm_60'
ptxas info    : Function properties for FUNC_foo_SCOP_0_KERNEL_0
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 5 registers, 328 bytes cmem[0]
info    : 0 bytes gmem
info    : Function properties for 'FUNC_foo_SCOP_0_KERNEL_0':
info    : used 5 registers, 0 stack, 0 bytes smem, 328 bytes cmem[0], 0 bytes lmem
-> polly_launchKernel
-> launchKernelCUDA
-> polly_freeKernel
-> freeKernelCUDA
-> polly_getDevicePtr
-> getDevicePtrCUDA
-> polly_getKernel
-> getKernelCUDA
CUDA Link Completed in 0.000000ms. Linker Output:
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'FUNC_foo_SCOP_0_KERNEL_1' for 'sm_60'
ptxas info    : Function properties for FUNC_foo_SCOP_0_KERNEL_1
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 4 registers, 328 bytes cmem[0]
info    : 0 bytes gmem
info    : Function properties for 'FUNC_foo_SCOP_0_KERNEL_1':
info    : used 4 registers, 0 stack, 0 bytes smem, 328 bytes cmem[0], 0 bytes lmem
-> polly_launchKernel
-> launchKernelCUDA
-> polly_freeKernel
-> freeKernelCUDA
-> polly_getDevicePtr
-> getDevicePtrCUDA
-> polly_getDevicePtr
-> getDevicePtrCUDA
-> polly_getDevicePtr
-> getDevicePtrCUDA
-> polly_getKernel
-> getKernelCUDA
CUDA Link Completed in 0.000000ms. Linker Output:
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'FUNC_foo_SCOP_0_KERNEL_2' for 'sm_60'
ptxas info    : Function properties for FUNC_foo_SCOP_0_KERNEL_2
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 38 registers, 344 bytes cmem[0]
info    : 0 bytes gmem
info    : Function properties for 'FUNC_foo_SCOP_0_KERNEL_2':
info    : used 38 registers, 0 stack, 0 bytes smem, 344 bytes cmem[0], 0 bytes lmem
-> polly_launchKernel
-> launchKernelCUDA
-> polly_freeKernel
-> freeKernelCUDA
-> polly_copyFromDeviceToHost
-> copyFromDeviceToHostCUDA
-> polly_copyFromDeviceToHost
-> copyFromDeviceToHostCUDA
-> polly_freeDeviceMemory
-> freeDeviceMemoryCUDA
-> polly_freeDeviceMemory
-> freeDeviceMemoryCUDA
-> polly_freeDeviceMemory
-> freeDeviceMemoryCUDA
-> polly_freeContext
992.000000
==23755== Profiling application: ./program.out
==23755== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 30.87%  3.0720us         1  3.0720us  3.0720us  3.0720us  FUNC_foo_SCOP_0_KERNEL_0
 27.01%  2.6880us         1  2.6880us  2.6880us  2.6880us  FUNC_foo_SCOP_0_KERNEL_2
 22.83%  2.2720us         1  2.2720us  2.2720us  2.2720us  FUNC_foo_SCOP_0_KERNEL_1
 19.29%  1.9200us         2     960ns     704ns  1.2160us  [CUDA memcpy DtoH]

==23755== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 94.59%  264.02ms         1  264.02ms  264.02ms  264.02ms  cuCtxCreate
  4.91%  13.699ms         3  4.5665ms  4.1479ms  5.1304ms  cuLinkAddData
  0.22%  604.48us         3  201.49us  4.1270us  594.81us  cuMemAlloc
  0.10%  284.06us         3  94.687us  92.083us  97.582us  cuLinkComplete
  0.09%  245.11us         3  81.703us  79.328us  83.090us  cuModuleLoadData
  0.04%  107.86us         3  35.952us  5.8870us  92.425us  cuMemFree
  0.02%  60.425us         3  20.141us  18.714us  22.737us  cuLinkCreate
  0.02%  51.373us         3  17.124us  13.505us  23.857us  cuLaunchKernel
  0.01%  32.797us         2  16.398us  11.642us  21.155us  cuMemcpyDtoH
  0.01%  17.782us         1  17.782us  17.782us  17.782us  cuDeviceGetName
  0.00%  3.2160us         3  1.0720us     426ns  2.2140us  cuLinkDestroy
  0.00%  2.4330us         3     811ns     233ns  1.7410us  cuDeviceGetCount
  0.00%  1.9920us         3     664ns     602ns     749ns  cuModuleGetFunction
  0.00%     926ns         3     308ns     168ns     535ns  cuDeviceGet
  0.00%     864ns         4     216ns     157ns     273ns  cuDeviceGetAttribute
  0.00%     282ns         1     282ns     282ns     282ns  cuDeviceComputeCapability

Specifically,

  1. Correct output:
992.000000
==23755== Profiling application: ./program.out
  1. Three kernel launches:
30.87%  3.0720us         1  3.0720us  3.0720us  3.0720us  FUNC_foo_SCOP_0_KERNEL_0
27.01%  2.6880us         1  2.6880us  2.6880us  2.6880us  FUNC_foo_SCOP_0_KERNEL_2
22.83%  2.2720us         1  2.2720us  2.2720us  2.2720us  FUNC_foo_SCOP_0_KERNEL_1

I'm now confident that the changes to non-read-only-scalars works.

grosser accepted this revision.Jul 20 2017, 7:15 AM
grosser added inline comments.
lib/CodeGen/PPCGCodeGeneration.cpp
62

Remove option. Put your name in contributors, if you want it in the source code.

1171

Please leave the preloading outside of the PPCG ast printing. The invariant loads she be initialized even before the runtime check is built.

2312

OK.

2329

Why do we add empty lines.

2552

Start with uppercase letter.

2637

??

2790

Drop that!

3112

??

test/GPGPU/mostly-sequential.ll
24

Drop that.

This revision is now accepted and ready to land.Jul 20 2017, 7:15 AM
bollu updated this revision to Diff 107517.Jul 20 2017, 8:08 AM
  • update testcase of non-read-only-scalars
  • move preloadInvariantLoads to before the RTC is generated.
  • [NFC] style fixes
bollu updated this revision to Diff 107529.Jul 20 2017, 8:34 AM
  • Diff against latest version of [2/3] of these changes.
bollu updated this revision to Diff 107530.Jul 20 2017, 8:39 AM
  • Fix nits.
bollu closed this revision.Jul 20 2017, 8:50 AM

closed by r308625