This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/CodeGen/
-
CodeGen/
8
PPCGCodeGeneration.cpp
-
test/GPGPU/
-
GPGPU/
-
host-control-flow.ll
-
host-statement.ll
-
invalid-kernel.ll
-
kernel-params-only-some-arrays.ll
1
mostly-sequential.ll
-
non-read-only-scalars.ll
-
non-zero-array-offset.ll
-
parametric-loop-bound.ll
-
phi-nodes-in-kernel.ll
-
region-stmt.ll
-
scheduler-timeout.ll
-
size-cast.ll
-
untouched-arrays.ll

Differential D35677

[WIP] [Polly] [PPCGCodeGeneration + PPCG] [3/3] Collect changes to PPCGCodeGen because of PPCG upgrade.
ClosedPublic

Authored by bollu on Jul 20 2017, 4:37 AM.

Download Raw Diff

Details

Reviewers

efriedma
jdoerfert
Meinersbur
gareevroman
sebpop
• zinob
huihuiz
pollydev
grosser
singam-sanjay

Commits

rG9e3db2b75636: [PPCGCodeGen] [3/3] Update PPCGCodeGen + tests to latest ppcg.
rPLO308625: [PPCGCodeGen] [3/3] Update PPCGCodeGen + tests to latest ppcg.
rL308625: [PPCGCodeGen] [3/3] Update PPCGCodeGen + tests to latest ppcg.

Summary

PPCG changed parts of it's API, so update PPCGCodeGeneration to adapt

to the changes.

PPCG now uses isl_multi_pw_aff instead of an array of pw_aff. This needs us to adjust how we index array bounds and how we construct array bounds.

PPCG introduces two new kinds of nodes: init_device and clear_device. We should investigate what the correct way to handle these are.

PPCG has gotten smarter with its use of live range reordering, so some of the tests have a qualitative improvement.

PPCG changed its output style, so many test cases need to be updated to fit the new style for polly-acc-dump-code checks.

Diff Detail

Build Status

Buildable 8442
Build 8442: arc lint + arc unit

Event Timeline

bollu created this revision.Jul 20 2017, 4:37 AM

Herald added subscribers: kbarton, nemanjai. · View Herald TranscriptJul 20 2017, 4:37 AM

@singam-sanjay: We should move the discussion about init / clear device here.

marked as WIP because the test cases need another look, and the init/clear_device issue is still being discussed.

Use init_device and clear_device nodes in the schedule tree.

Harbormaster completed remote builds in B8429: Diff 107488.Jul 20 2017, 5:25 AM

I compiled the program from non-read-only-scalars.ll and ran in on daint (Piz-daint).

program.c

#include <stdio.h>

float foo(float A[]) {
    float sum = 0;

    for (long i = 0; i < 32; i++)
        SetA:    A[i] = i;


    for (long i = 0; i < 32; i++)
        IncA:    A[i] += i;

    for (long i = 0; i < 32; i++)
        IncSum:    sum += A[i];

RetSum:  return sum;
}


int main() {
    float A[32];
    float sum = foo(A);
    printf("=== PROGRAM ===\n");
    printf("sum: %f\n", sum);
    printf("=== END PROGRAM ===\n");
    return 0;
}

The output looks correct. It should be:

output-calc

(sum_i=0^32 i) * 2 =
32 * 31 / 2 * 2 =
992

The output from the run on daint:

daint-run

POLLY_DEBUG=1 srun -n 1 -Cgpu --partition=debug nvprof ./program.out
srun: job 2452889 queued and waiting for resources

srun: job 2452889 has been allocated resources
-> polly_initContext
-> initContextCUDA
==28137== NVPROF is profiling process 28137, command: ./program.out
> Running on GPU device 0 : Tesla P100-PCIE-16GB.
-> polly_allocateMemoryForDevice
-> allocateMemoryForDeviceCUDA
-> polly_allocateMemoryForDevice
-> allocateMemoryForDeviceCUDA
-> polly_allocateMemoryForDevice
-> allocateMemoryForDeviceCUDA
-> polly_copyFromHostToDevice
-> copyFromHostToDeviceCUDA
-> polly_getDevicePtr
-> getDevicePtrCUDA
-> polly_getKernel
-> getKernelCUDA
CUDA Link Completed in 0.000000ms. Linker Output:
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'FUNC_foo_SCOP_0_KERNEL_0' for 'sm_60'
ptxas info    : Function properties for FUNC_foo_SCOP_0_KERNEL_0
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 5 registers, 328 bytes cmem[0]
info    : 0 bytes gmem
info    : Function properties for 'FUNC_foo_SCOP_0_KERNEL_0':
info    : used 5 registers, 0 stack, 0 bytes smem, 328 bytes cmem[0], 0 bytes lmem
-> polly_launchKernel
-> launchKernelCUDA
-> polly_freeKernel
-> freeKernelCUDA
-> polly_getDevicePtr
-> getDevicePtrCUDA
-> polly_getDevicePtr
-> getDevicePtrCUDA
-> polly_getDevicePtr
-> getDevicePtrCUDA
-> polly_getKernel
-> getKernelCUDA
CUDA Link Completed in 0.000000ms. Linker Output:
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'FUNC_foo_SCOP_0_KERNEL_1' for 'sm_60'
ptxas info    : Function properties for FUNC_foo_SCOP_0_KERNEL_1
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 38 registers, 344 bytes cmem[0]
info    : 0 bytes gmem
info    : Function properties for 'FUNC_foo_SCOP_0_KERNEL_1':
info    : used 38 registers, 0 stack, 0 bytes smem, 344 bytes cmem[0], 0 bytes lmem
-> polly_launchKernel
-> launchKernelCUDA
-> polly_freeKernel
-> freeKernelCUDA
-> polly_copyFromDeviceToHost
-> copyFromDeviceToHostCUDA
-> polly_copyFromDeviceToHost
-> copyFromDeviceToHostCUDA
-> polly_freeDeviceMemory
-> freeDeviceMemoryCUDA
-> polly_freeDeviceMemory
-> freeDeviceMemoryCUDA
-> polly_freeDeviceMemory
-> freeDeviceMemoryCUDA
-> polly_freeContext
=== PROGRAM ===
sum: 992.000000
=== END PROGRAM ===
==28137== Profiling application: ./program.out -o run-output.txt
==28137== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 31.62%  2.5600us         1  2.5600us  2.5600us  2.5600us  FUNC_foo_SCOP_0_KERNEL_0
 28.85%  2.3360us         1  2.3360us  2.3360us  2.3360us  FUNC_foo_SCOP_0_KERNEL_1
 23.32%  1.8880us         2     944ns     672ns  1.2160us  [CUDA memcpy DtoH]
 16.21%  1.3120us         1  1.3120us  1.3120us  1.3120us  [CUDA memcpy HtoD]

==28137== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 96.16%  269.63ms         1  269.63ms  269.63ms  269.63ms  cuCtxCreate
  3.44%  9.6323ms         2  4.8162ms  4.3765ms  5.2558ms  cuLinkAddData
  0.20%  566.56us         3  188.85us  4.1630us  557.11us  cuMemAlloc
  0.07%  194.61us         2  97.302us  94.214us  100.39us  cuLinkComplete
  0.05%  134.44us         2  67.218us  66.802us  67.634us  cuModuleLoadData
  0.04%  105.59us         3  35.196us  5.4070us  91.471us  cuMemFree
  0.01%  39.579us         2  19.789us  17.197us  22.382us  cuLinkCreate
  0.01%  35.306us         2  17.653us  12.669us  22.637us  cuLaunchKernel
  0.01%  30.260us         2  15.130us  11.406us  18.854us  cuMemcpyDtoH
  0.01%  17.904us         1  17.904us  17.904us  17.904us  cuDeviceGetName
  0.01%  15.630us         1  15.630us  15.630us  15.630us  cuMemcpyHtoD
  0.00%  2.7000us         2  1.3500us     338ns  2.3620us  cuLinkDestroy
  0.00%  2.2280us         3     742ns     170ns  1.6410us  cuDeviceGetCount
  0.00%  1.2820us         2     641ns     550ns     732ns  cuModuleGetFunction
  0.00%     915ns         4     228ns     161ns     303ns  cuDeviceGetAttribute
  0.00%     873ns         3     291ns     179ns     475ns  cuDeviceGet
  0.00%     289ns         1     289ns     289ns     289ns  cuDeviceComputeCapability

Specifically, notice:

=== PROGRAM ===
sum: 992.000000
=== END PROGRAM ===

clearly, we do call the GPU version of the code (I set polly-acc-mincompute to 0).

So, it appears that the test case works.

program.optimised.ll16 KBDownload

program.input.ll5 KBDownload

I've attached the input .ll and the output .ll here in case anyone wants to take a look.

Just to be very sure, I took the .ll file from the testcase and compiled it with the exact commands that we run in the test case.

We detect 3 kernels, unlike when we compile from the C file where we detect 2 kernels. I suspect this is because the test file was not run through polly-canonicalize. In any case, we still generate the correct output.

Trace of the run pasted below (notice that we have 3 kernel launches now).

15:29 $ make build-ll && make run
rm *.optimised.ll
rm *.out
rm *.bench
rm: cannot remove '*.bench': No such file or directory
makefile:30: recipe for target 'clean' failed
make: [clean] Error 1 (ignored)
rm *.s
/users/siddhart/llvm-install/bin/opt -S  -polly-process-unprofitable  -polly-codegen-ppcg \
	-polly-acc-mincompute=0 program.ll -o program.optimised.ll -polly-acc-dump-code
Code
====
# host
{
#define cudaCheckReturn(ret) \
  do { \
    cudaError_t cudaCheckReturn_e = (ret); \
    if (cudaCheckReturn_e != cudaSuccess) { \
      fprintf(stderr, "CUDA error: %s\n", cudaGetErrorString(cudaCheckReturn_e)); \
      fflush(stderr); \
    } \
    assert(cudaCheckReturn_e == cudaSuccess); \
  } while(0)
#define cudaCheckKernel() \
  do { \
    cudaCheckReturn(cudaGetLastError()); \
  } while(0)

  float *dev_MemRef0;
  float *dev_MemRef1__phi;
  float *dev_MemRef2;

  cudaCheckReturn(cudaMalloc((void **) &dev_MemRef0, (32) * sizeof(float)));
  cudaCheckReturn(cudaMalloc((void **) &dev_MemRef1__phi, sizeof(float)));
  cudaCheckReturn(cudaMalloc((void **) &dev_MemRef2, sizeof(float)));

  {
    dim3 k0_dimBlock(32);
    dim3 k0_dimGrid(1);
    kernel0 <<<k0_dimGrid, k0_dimBlock>>> (dev_MemRef0);
    cudaCheckKernel();
  }

  {
    dim3 k1_dimBlock;
    dim3 k1_dimGrid;
    kernel1 <<<k1_dimGrid, k1_dimBlock>>> (dev_MemRef1__phi);
    cudaCheckKernel();
  }

  {
    dim3 k2_dimBlock;
    dim3 k2_dimGrid;
    kernel2 <<<k2_dimGrid, k2_dimBlock>>> (dev_MemRef0, dev_MemRef1__phi, dev_MemRef2);
    cudaCheckKernel();
  }

  cudaCheckReturn(cudaMemcpy(MemRef0, dev_MemRef0, (32) * sizeof(float), cudaMemcpyDeviceToHost));
  cudaCheckReturn(cudaMemcpy(&MemRef2, dev_MemRef2, sizeof(float), cudaMemcpyDeviceToHost));
  cudaCheckReturn(cudaFree(dev_MemRef0));
  cudaCheckReturn(cudaFree(dev_MemRef1__phi));
  cudaCheckReturn(cudaFree(dev_MemRef2));
}

# kernel0
{
  Stmt1(t0);
  Stmt5(t0);
}

# kernel1
Stmt7();

# kernel2
for (int c0 = 0; c0 <= 32; c0 += 1) {
  Stmt8(c0);
  if (c0 <= 31)
    Stmt10(c0);
}

/users/siddhart/llvm-install/bin/llc program.optimised.ll -o program.s
/users/siddhart/llvm-install/bin/clang program.s  -lcudart -lGPURuntime -ldl -lOpenCL -lgfortran -lstdc++ -o program.out -L/opt/nvidia/cudatoolkit8.0/8.0.54_2.2.8_ga620558-2.1/lib64/
running program.out... on debug queue
POLLY_DEBUG=1 srun -n 1 -Cgpu --partition=debug nvprof ./program.out
srun: job 2454404 queued and waiting for resources
srun: job 2454404 has been allocated resources
-> polly_initContext
-> initContextCUDA
==23755== NVPROF is profiling process 23755, command: ./program.out
> Running on GPU device 0 : Tesla P100-PCIE-16GB.
-> polly_allocateMemoryForDevice
-> allocateMemoryForDeviceCUDA
-> polly_allocateMemoryForDevice
-> allocateMemoryForDeviceCUDA
-> polly_allocateMemoryForDevice
-> allocateMemoryForDeviceCUDA
-> polly_getDevicePtr
-> getDevicePtrCUDA
-> polly_getKernel
-> getKernelCUDA
CUDA Link Completed in 0.000000ms. Linker Output:
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'FUNC_foo_SCOP_0_KERNEL_0' for 'sm_60'
ptxas info    : Function properties for FUNC_foo_SCOP_0_KERNEL_0
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 5 registers, 328 bytes cmem[0]
info    : 0 bytes gmem
info    : Function properties for 'FUNC_foo_SCOP_0_KERNEL_0':
info    : used 5 registers, 0 stack, 0 bytes smem, 328 bytes cmem[0], 0 bytes lmem
-> polly_launchKernel
-> launchKernelCUDA
-> polly_freeKernel
-> freeKernelCUDA
-> polly_getDevicePtr
-> getDevicePtrCUDA
-> polly_getKernel
-> getKernelCUDA
CUDA Link Completed in 0.000000ms. Linker Output:
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'FUNC_foo_SCOP_0_KERNEL_1' for 'sm_60'
ptxas info    : Function properties for FUNC_foo_SCOP_0_KERNEL_1
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 4 registers, 328 bytes cmem[0]
info    : 0 bytes gmem
info    : Function properties for 'FUNC_foo_SCOP_0_KERNEL_1':
info    : used 4 registers, 0 stack, 0 bytes smem, 328 bytes cmem[0], 0 bytes lmem
-> polly_launchKernel
-> launchKernelCUDA
-> polly_freeKernel
-> freeKernelCUDA
-> polly_getDevicePtr
-> getDevicePtrCUDA
-> polly_getDevicePtr
-> getDevicePtrCUDA
-> polly_getDevicePtr
-> getDevicePtrCUDA
-> polly_getKernel
-> getKernelCUDA
CUDA Link Completed in 0.000000ms. Linker Output:
ptxas info    : 0 bytes gmem
ptxas info    : Compiling entry function 'FUNC_foo_SCOP_0_KERNEL_2' for 'sm_60'
ptxas info    : Function properties for FUNC_foo_SCOP_0_KERNEL_2
ptxas         .     0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info    : Used 38 registers, 344 bytes cmem[0]
info    : 0 bytes gmem
info    : Function properties for 'FUNC_foo_SCOP_0_KERNEL_2':
info    : used 38 registers, 0 stack, 0 bytes smem, 344 bytes cmem[0], 0 bytes lmem
-> polly_launchKernel
-> launchKernelCUDA
-> polly_freeKernel
-> freeKernelCUDA
-> polly_copyFromDeviceToHost
-> copyFromDeviceToHostCUDA
-> polly_copyFromDeviceToHost
-> copyFromDeviceToHostCUDA
-> polly_freeDeviceMemory
-> freeDeviceMemoryCUDA
-> polly_freeDeviceMemory
-> freeDeviceMemoryCUDA
-> polly_freeDeviceMemory
-> freeDeviceMemoryCUDA
-> polly_freeContext
992.000000
==23755== Profiling application: ./program.out
==23755== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
 30.87%  3.0720us         1  3.0720us  3.0720us  3.0720us  FUNC_foo_SCOP_0_KERNEL_0
 27.01%  2.6880us         1  2.6880us  2.6880us  2.6880us  FUNC_foo_SCOP_0_KERNEL_2
 22.83%  2.2720us         1  2.2720us  2.2720us  2.2720us  FUNC_foo_SCOP_0_KERNEL_1
 19.29%  1.9200us         2     960ns     704ns  1.2160us  [CUDA memcpy DtoH]

==23755== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 94.59%  264.02ms         1  264.02ms  264.02ms  264.02ms  cuCtxCreate
  4.91%  13.699ms         3  4.5665ms  4.1479ms  5.1304ms  cuLinkAddData
  0.22%  604.48us         3  201.49us  4.1270us  594.81us  cuMemAlloc
  0.10%  284.06us         3  94.687us  92.083us  97.582us  cuLinkComplete
  0.09%  245.11us         3  81.703us  79.328us  83.090us  cuModuleLoadData
  0.04%  107.86us         3  35.952us  5.8870us  92.425us  cuMemFree
  0.02%  60.425us         3  20.141us  18.714us  22.737us  cuLinkCreate
  0.02%  51.373us         3  17.124us  13.505us  23.857us  cuLaunchKernel
  0.01%  32.797us         2  16.398us  11.642us  21.155us  cuMemcpyDtoH
  0.01%  17.782us         1  17.782us  17.782us  17.782us  cuDeviceGetName
  0.00%  3.2160us         3  1.0720us     426ns  2.2140us  cuLinkDestroy
  0.00%  2.4330us         3     811ns     233ns  1.7410us  cuDeviceGetCount
  0.00%  1.9920us         3     664ns     602ns     749ns  cuModuleGetFunction
  0.00%     926ns         3     308ns     168ns     535ns  cuDeviceGet
  0.00%     864ns         4     216ns     157ns     273ns  cuDeviceGetAttribute
  0.00%     282ns         1     282ns     282ns     282ns  cuDeviceComputeCapability

Specifically,

Correct output:

992.000000
==23755== Profiling application: ./program.out

Three kernel launches:

30.87%  3.0720us         1  3.0720us  3.0720us  3.0720us  FUNC_foo_SCOP_0_KERNEL_0
27.01%  2.6880us         1  2.6880us  2.6880us  2.6880us  FUNC_foo_SCOP_0_KERNEL_2
22.83%  2.2720us         1  2.2720us  2.2720us  2.2720us  FUNC_foo_SCOP_0_KERNEL_1

I'm now confident that the changes to non-read-only-scalars works.

grosser accepted this revision.Jul 20 2017, 7:15 AM

grosser added inline comments.

lib/CodeGen/PPCGCodeGeneration.cpp
62	Remove option. Put your name in contributors, if you want it in the source code.
1165	Please leave the preloading outside of the PPCG ast printing. The invariant loads she be initialized even before the runtime check is built.
2305–2306	OK.
2321	Why do we add empty lines.
2543	Start with uppercase letter.
2628	??
2781	Drop that!
3104	??
test/GPGPU/mostly-sequential.ll
24	Drop that.

This revision is now accepted and ready to land.Jul 20 2017, 7:15 AM

update testcase of non-read-only-scalars
move preloadInvariantLoads to before the RTC is generated.
[NFC] style fixes

Diff against latest version of [2/3] of these changes.

Harbormaster completed remote builds in B8441: Diff 107529.Jul 20 2017, 8:34 AM

Fix nits.

Harbormaster completed remote builds in B8442: Diff 107530.Jul 20 2017, 8:40 AM

closed by r308625

Revision Contents

Path

Size

lib/

CodeGen/

PPCGCodeGeneration.cpp

97 lines

test/

GPGPU/

host-control-flow.ll

5 lines

host-statement.ll

6 lines

invalid-kernel.ll

7 lines

kernel-params-only-some-arrays.ll

9 lines

mostly-sequential.ll

22 lines

non-read-only-scalars.ll

53 lines

non-zero-array-offset.ll

23 lines

parametric-loop-bound.ll

7 lines

phi-nodes-in-kernel.ll

46 lines

7 lines

7 lines

12 lines

7 lines

Diff 107530

lib/CodeGen/PPCGCodeGeneration.cpp

Show First 20 Lines • Show All 53 Lines • ▼ Show 20 Lines

#define DEBUG_TYPE "polly-codegen-ppcg"		#define DEBUG_TYPE "polly-codegen-ppcg"

static cl::opt<bool> DumpSchedule("polly-acc-dump-schedule",		static cl::opt<bool> DumpSchedule("polly-acc-dump-schedule",
cl::desc("Dump the computed GPU Schedule"),		cl::desc("Dump the computed GPU Schedule"),
cl::Hidden, cl::init(false), cl::ZeroOrMore,		cl::Hidden, cl::init(false), cl::ZeroOrMore,
cl::cat(PollyCategory));		cl::cat(PollyCategory));

static cl::opt<bool>		static cl::opt<bool>
		grosserUnsubmitted Not Done Reply Inline Actions Remove option. Put your name in contributors, if you want it in the source code. grosser: Remove option. Put your name in contributors, if you want it in the source code.
DumpCode("polly-acc-dump-code",		DumpCode("polly-acc-dump-code",
cl::desc("Dump C code describing the GPU mapping"), cl::Hidden,		cl::desc("Dump C code describing the GPU mapping"), cl::Hidden,
cl::init(false), cl::ZeroOrMore, cl::cat(PollyCategory));		cl::init(false), cl::ZeroOrMore, cl::cat(PollyCategory));

static cl::opt<bool> DumpKernelIR("polly-acc-dump-kernel-ir",		static cl::opt<bool> DumpKernelIR("polly-acc-dump-kernel-ir",
cl::desc("Dump the kernel LLVM-IR"),		cl::desc("Dump the kernel LLVM-IR"),
cl::Hidden, cl::init(false), cl::ZeroOrMore,		cl::Hidden, cl::init(false), cl::ZeroOrMore,
cl::cat(PollyCategory));		cl::cat(PollyCategory));
▲ Show 20 Lines • Show All 61 Lines • ▼ Show 20 Lines	struct MustKillsInfo {
/// We currently derive kill information for:		/// We currently derive kill information for:
/// 1. phi nodes. PHI nodes are not alive outside the scop and can		/// 1. phi nodes. PHI nodes are not alive outside the scop and can
/// consequently all be killed.		/// consequently all be killed.
/// 2. Scalar arrays that are not used outside the Scop. This is		/// 2. Scalar arrays that are not used outside the Scop. This is
/// checked by `isScalarUsesContainedInScop`.		/// checked by `isScalarUsesContainedInScop`.
/// [params] -> { [Stmt_phantom[] -> ref_phantom[]] -> scalar_to_kill[] }		/// [params] -> { [Stmt_phantom[] -> ref_phantom[]] -> scalar_to_kill[] }
isl::union_map TaggedMustKills;		isl::union_map TaggedMustKills;

MustKillsInfo() : KillsSchedule(nullptr), TaggedMustKills(nullptr){};		/// Tagged must kills stripped of the tags.
		/// [params] -> { Stmt_phantom[] -> scalar_to_kill[] }
		isl::union_map MustKills;

		MustKillsInfo() : KillsSchedule(nullptr) {}
};		};

/// Check if SAI's uses are entirely contained within Scop S.		/// Check if SAI's uses are entirely contained within Scop S.
/// If a scalar is used only with a Scop, we are free to kill it, as no data		/// If a scalar is used only with a Scop, we are free to kill it, as no data
/// can flow in/out of the value any more.		/// can flow in/out of the value any more.
/// @see computeMustKillsInfo		/// @see computeMustKillsInfo
static bool isScalarUsesContainedInScop(const Scop &S,		static bool isScalarUsesContainedInScop(const Scop &S,
const ScopArrayInfo *SAI) {		const ScopArrayInfo *SAI) {
Show All 25 Lines	static MustKillsInfo computeMustKillsInfo(const Scop &S) {
SmallVector<isl::id, 4> KillMemIds;		SmallVector<isl::id, 4> KillMemIds;
for (ScopArrayInfo *SAI : S.arrays()) {		for (ScopArrayInfo *SAI : S.arrays()) {
if (SAI->isPHIKind() \|\|		if (SAI->isPHIKind() \|\|
(SAI->isValueKind() && isScalarUsesContainedInScop(S, SAI)))		(SAI->isValueKind() && isScalarUsesContainedInScop(S, SAI)))
KillMemIds.push_back(isl::manage(SAI->getBasePtrId()));		KillMemIds.push_back(isl::manage(SAI->getBasePtrId()));
}		}

Info.TaggedMustKills = isl::union_map::empty(isl::space(ParamSpace));		Info.TaggedMustKills = isl::union_map::empty(isl::space(ParamSpace));
		Info.MustKills = isl::union_map::empty(isl::space(ParamSpace));

// Initialising KillsSchedule to `isl_set_empty` creates an empty node in the		// Initialising KillsSchedule to `isl_set_empty` creates an empty node in the
// schedule:		// schedule:
// - filter: "[control] -> { }"		// - filter: "[control] -> { }"
// So, we choose to not create this to keep the output a little nicer,		// So, we choose to not create this to keep the output a little nicer,
// at the cost of some code complexity.		// at the cost of some code complexity.
Info.KillsSchedule = nullptr;		Info.KillsSchedule = nullptr;

Show All 30 Lines	PhantomRefToScalar =
PhantomRefToScalar.set_tuple_id(isl::dim::in, PhantomRefId);		PhantomRefToScalar.set_tuple_id(isl::dim::in, PhantomRefId);
PhantomRefToScalar =		PhantomRefToScalar =
PhantomRefToScalar.set_tuple_id(isl::dim::out, ToKillId);		PhantomRefToScalar.set_tuple_id(isl::dim::out, ToKillId);

// 2. [param] -> { [Stmt[] -> phantom_ref[]] -> scalar_to_kill[] }		// 2. [param] -> { [Stmt[] -> phantom_ref[]] -> scalar_to_kill[] }
isl::map TaggedMustKill = StmtToScalar.domain_product(PhantomRefToScalar);		isl::map TaggedMustKill = StmtToScalar.domain_product(PhantomRefToScalar);
Info.TaggedMustKills = Info.TaggedMustKills.unite(TaggedMustKill);		Info.TaggedMustKills = Info.TaggedMustKills.unite(TaggedMustKill);

		// 2. [param] -> { Stmt[] -> scalar_to_kill[] }
		Info.MustKills = Info.TaggedMustKills.domain_factor_domain();

// 3. Create the kill schedule of the form:		// 3. Create the kill schedule of the form:
// "[param] -> { Stmt_phantom[] }"		// "[param] -> { Stmt_phantom[] }"
// Then add this to Info.KillsSchedule.		// Then add this to Info.KillsSchedule.
isl::space KillStmtSpace = ParamSpace;		isl::space KillStmtSpace = ParamSpace;
KillStmtSpace = KillStmtSpace.set_tuple_id(isl::dim::set, KillStmtId);		KillStmtSpace = KillStmtSpace.set_tuple_id(isl::dim::set, KillStmtId);
isl::union_set KillStmtDomain = isl::set::universe(KillStmtSpace);		isl::union_set KillStmtDomain = isl::set::universe(KillStmtSpace);

isl::schedule KillSchedule = isl::schedule::from_domain(KillStmtDomain);		isl::schedule KillSchedule = isl::schedule::from_domain(KillStmtDomain);
▲ Show 20 Lines • Show All 763 Lines • ▼ Show 20 Lines	static bool isPrefix(std::string String, std::string Prefix) {
return String.find(Prefix) == 0;		return String.find(Prefix) == 0;
}		}

Value GPUNodeBuilder::getArraySize(gpu_array_info Array) {		Value GPUNodeBuilder::getArraySize(gpu_array_info Array) {
isl_ast_build *Build = isl_ast_build_from_context(S.getContext());		isl_ast_build *Build = isl_ast_build_from_context(S.getContext());
Value *ArraySize = ConstantInt::get(Builder.getInt64Ty(), Array->size);		Value *ArraySize = ConstantInt::get(Builder.getInt64Ty(), Array->size);

if (!gpu_array_is_scalar(Array)) {		if (!gpu_array_is_scalar(Array)) {
auto OffsetDimZero = isl_pw_aff_copy(Array->bound[0]);		auto OffsetDimZero = isl_multi_pw_aff_get_pw_aff(Array->bound, 0);
isl_ast_expr *Res = isl_ast_build_expr_from_pw_aff(Build, OffsetDimZero);		isl_ast_expr *Res = isl_ast_build_expr_from_pw_aff(Build, OffsetDimZero);

for (unsigned int i = 1; i < Array->n_index; i++) {		for (unsigned int i = 1; i < Array->n_index; i++) {
isl_pw_aff *Bound_I = isl_pw_aff_copy(Array->bound[i]);		isl_pw_aff *Bound_I = isl_multi_pw_aff_get_pw_aff(Array->bound, i);
isl_ast_expr *Expr = isl_ast_build_expr_from_pw_aff(Build, Bound_I);		isl_ast_expr *Expr = isl_ast_build_expr_from_pw_aff(Build, Bound_I);
Res = isl_ast_expr_mul(Res, Expr);		Res = isl_ast_expr_mul(Res, Expr);
}		}

Value *NumElements = ExprBuilder.create(Res);		Value *NumElements = ExprBuilder.create(Res);
if (NumElements->getType() != ArraySize->getType())		if (NumElements->getType() != ArraySize->getType())
NumElements = Builder.CreateSExt(NumElements, ArraySize->getType());		NumElements = Builder.CreateSExt(NumElements, ArraySize->getType());
ArraySize = Builder.CreateMul(ArraySize, NumElements);		ArraySize = Builder.CreateMul(ArraySize, NumElements);
Show All 23 Lines	Value GPUNodeBuilder::getArrayOffset(gpu_array_info Array) {
}		}
isl_set_free(ZeroSet);		isl_set_free(ZeroSet);

isl_ast_expr *Result =		isl_ast_expr *Result =
isl_ast_expr_from_val(isl_val_int_from_si(isl_set_get_ctx(Min), 0));		isl_ast_expr_from_val(isl_val_int_from_si(isl_set_get_ctx(Min), 0));

for (long i = 0; i < isl_set_dim(Min, isl_dim_set); i++) {		for (long i = 0; i < isl_set_dim(Min, isl_dim_set); i++) {
if (i > 0) {		if (i > 0) {
isl_pw_aff *Bound_I = isl_pw_aff_copy(Array->bound[i - 1]);		isl_pw_aff *Bound_I = isl_multi_pw_aff_get_pw_aff(Array->bound, i - 1);
isl_ast_expr *BExpr = isl_ast_build_expr_from_pw_aff(Build, Bound_I);		isl_ast_expr *BExpr = isl_ast_build_expr_from_pw_aff(Build, Bound_I);
Result = isl_ast_expr_mul(Result, BExpr);		Result = isl_ast_expr_mul(Result, BExpr);
}		}
isl_pw_aff *DimMin = isl_set_dim_min(isl_set_copy(Min), i);		isl_pw_aff *DimMin = isl_set_dim_min(isl_set_copy(Min), i);
isl_ast_expr *MExpr = isl_ast_build_expr_from_pw_aff(Build, DimMin);		isl_ast_expr *MExpr = isl_ast_build_expr_from_pw_aff(Build, DimMin);
Result = isl_ast_expr_add(Result, MExpr);		Result = isl_ast_expr_add(Result, MExpr);
}		}

▲ Show 20 Lines • Show All 87 Lines • ▼ Show 20 Lines	void GPUNodeBuilder::createUser(__isl_take isl_ast_node *UserStmt) {
isl_ast_expr_free(StmtExpr);		isl_ast_expr_free(StmtExpr);

const char *Str = isl_id_get_name(Id);		const char *Str = isl_id_get_name(Id);
if (!strcmp(Str, "kernel")) {		if (!strcmp(Str, "kernel")) {
createKernel(UserStmt);		createKernel(UserStmt);
isl_ast_expr_free(Expr);		isl_ast_expr_free(Expr);
return;		return;
}		}
		if (!strcmp(Str, "init_device")) {
		initializeAfterRTH();
		isl_ast_node_free(UserStmt);
		grosserUnsubmitted Not Done Reply Inline Actions Please leave the preloading outside of the PPCG ast printing. The invariant loads she be initialized even before the runtime check is built. grosser: Please leave the preloading outside of the PPCG ast printing. The invariant loads she be…
		isl_ast_expr_free(Expr);
		return;
		}
		if (!strcmp(Str, "clear_device")) {
		finalize();
		isl_ast_node_free(UserStmt);
		isl_ast_expr_free(Expr);
		return;
		}
if (isPrefix(Str, "to_device")) {		if (isPrefix(Str, "to_device")) {
if (!ManagedMemory)		if (!ManagedMemory)
createDataTransfer(UserStmt, HOST_TO_DEVICE);		createDataTransfer(UserStmt, HOST_TO_DEVICE);
else		else
isl_ast_node_free(UserStmt);		isl_ast_node_free(UserStmt);

isl_ast_expr_free(Expr);		isl_ast_expr_free(Expr);
return;		return;
▲ Show 20 Lines • Show All 597 Lines • ▼ Show 20 Lines	for (long i = 0; i < Kernel->n_array; i++) {
Type *EleTy = SAI->getElementType();		Type *EleTy = SAI->getElementType();
Value Val = &Arg;		Value Val = &Arg;
SmallVector<const SCEV *, 4> Sizes;		SmallVector<const SCEV *, 4> Sizes;
isl_ast_build *Build =		isl_ast_build *Build =
isl_ast_build_from_context(isl_set_copy(Prog->context));		isl_ast_build_from_context(isl_set_copy(Prog->context));
Sizes.push_back(nullptr);		Sizes.push_back(nullptr);
for (long j = 1; j < Kernel->array[i].array->n_index; j++) {		for (long j = 1; j < Kernel->array[i].array->n_index; j++) {
isl_ast_expr *DimSize = isl_ast_build_expr_from_pw_aff(		isl_ast_expr *DimSize = isl_ast_build_expr_from_pw_aff(
Build, isl_pw_aff_copy(Kernel->array[i].array->bound[j]));		Build, isl_multi_pw_aff_get_pw_aff(Kernel->array[i].array->bound, j));
auto V = ExprBuilder.create(DimSize);		auto V = ExprBuilder.create(DimSize);
Sizes.push_back(SE.getSCEV(V));		Sizes.push_back(SE.getSCEV(V));
}		}
const ScopArrayInfo *SAIRep =		const ScopArrayInfo *SAIRep =
S.getOrCreateScopArrayInfo(Val, EleTy, Sizes, MemoryKind::Array);		S.getOrCreateScopArrayInfo(Val, EleTy, Sizes, MemoryKind::Array);
LocalArrays.push_back(Val);		LocalArrays.push_back(Val);

isl_ast_build_free(Build);		isl_ast_build_free(Build);
▲ Show 20 Lines • Show All 344 Lines • ▼ Show 20 Lines	ppcg_options *createPPCGOptions() {
DebugOptions->dump_schedule_constraints = false;		DebugOptions->dump_schedule_constraints = false;
DebugOptions->dump_schedule = false;		DebugOptions->dump_schedule = false;
DebugOptions->dump_final_schedule = false;		DebugOptions->dump_final_schedule = false;
DebugOptions->dump_sizes = false;		DebugOptions->dump_sizes = false;
DebugOptions->verbose = false;		DebugOptions->verbose = false;

Options->debug = DebugOptions;		Options->debug = DebugOptions;

		Options->group_chains = false;
Options->reschedule = true;		Options->reschedule = true;
Options->scale_tile_loops = false;		Options->scale_tile_loops = false;
Options->wrap = false;		Options->wrap = false;

Options->non_negative_parameters = false;		Options->non_negative_parameters = false;
Options->ctx = nullptr;		Options->ctx = nullptr;
Options->sizes = nullptr;		Options->sizes = nullptr;

		Options->tile = true;
Options->tile_size = 32;		Options->tile_size = 32;

		Options->isolate_full_tiles = false;

Options->use_private_memory = PrivateMemory;		Options->use_private_memory = PrivateMemory;
Options->use_shared_memory = SharedMemory;		Options->use_shared_memory = SharedMemory;
Options->max_shared_memory = 48 * 1024;		Options->max_shared_memory = 48 * 1024;

Options->target = PPCG_TARGET_CUDA;		Options->target = PPCG_TARGET_CUDA;
Options->openmp = false;		Options->openmp = false;
Options->linearize_device_arrays = true;		Options->linearize_device_arrays = true;
Options->live_range_reordering = false;		Options->allow_gnu_extensions = false;

		Options->unroll_copy_shared = false;
		Options->unroll_gpu_tile = false;
		Options->live_range_reordering = true;

		Options->live_range_reordering = true;
		Options->hybrid = false;
Options->opencl_compiler_options = nullptr;		Options->opencl_compiler_options = nullptr;
Options->opencl_use_gpu = false;		Options->opencl_use_gpu = false;
Options->opencl_n_include_file = 0;		Options->opencl_n_include_file = 0;
Options->opencl_include_files = nullptr;		Options->opencl_include_files = nullptr;
Options->opencl_print_kernel_types = false;		Options->opencl_print_kernel_types = false;
Options->opencl_embed_kernel_code = false;		Options->opencl_embed_kernel_code = false;

Options->save_schedule_file = nullptr;		Options->save_schedule_file = nullptr;
▲ Show 20 Lines • Show All 98 Lines • ▼ Show 20 Lines	public:
///		///
/// The PPCG scop is initialized with data from the current polly::Scop. From		/// The PPCG scop is initialized with data from the current polly::Scop. From
/// this initial data, the data-dependences in the PPCG scop are initialized.		/// this initial data, the data-dependences in the PPCG scop are initialized.
/// We do not use Polly's dependence analysis for now, to ensure we match		/// We do not use Polly's dependence analysis for now, to ensure we match
/// the PPCG default behaviour more closely.		/// the PPCG default behaviour more closely.
///		///
/// @returns A new ppcg scop.		/// @returns A new ppcg scop.
ppcg_scop *createPPCGScop() {		ppcg_scop *createPPCGScop() {
		MustKillsInfo KillsInfo = computeMustKillsInfo(*S);

auto PPCGScop = (ppcg_scop *)malloc(sizeof(ppcg_scop));		auto PPCGScop = (ppcg_scop *)malloc(sizeof(ppcg_scop));

PPCGScop->options = createPPCGOptions();		PPCGScop->options = createPPCGOptions();
// enable live range reordering		// enable live range reordering
PPCGScop->options->live_range_reordering = 1;		PPCGScop->options->live_range_reordering = 1;

PPCGScop->start = 0;		PPCGScop->start = 0;
PPCGScop->end = 0;		PPCGScop->end = 0;

PPCGScop->context = S->getContext();		PPCGScop->context = S->getContext();
PPCGScop->domain = S->getDomains();		PPCGScop->domain = S->getDomains();
PPCGScop->call = nullptr;		// TODO: investigate this further. PPCG calls collect_call_domains.
		PPCGScop->call = isl_union_set_from_set(S->getContext());
		grosserUnsubmitted Not Done Reply Inline Actions OK. grosser: OK.
PPCGScop->tagged_reads = getTaggedReads();		PPCGScop->tagged_reads = getTaggedReads();
PPCGScop->reads = S->getReads();		PPCGScop->reads = S->getReads();
PPCGScop->live_in = nullptr;		PPCGScop->live_in = nullptr;
PPCGScop->tagged_may_writes = getTaggedMayWrites();		PPCGScop->tagged_may_writes = getTaggedMayWrites();
PPCGScop->may_writes = S->getWrites();		PPCGScop->may_writes = S->getWrites();
PPCGScop->tagged_must_writes = getTaggedMustWrites();		PPCGScop->tagged_must_writes = getTaggedMustWrites();
PPCGScop->must_writes = S->getMustWrites();		PPCGScop->must_writes = S->getMustWrites();
PPCGScop->live_out = nullptr;		PPCGScop->live_out = nullptr;
		PPCGScop->tagged_must_kills = KillsInfo.TaggedMustKills.take();
		PPCGScop->must_kills = KillsInfo.MustKills.take();

PPCGScop->tagger = nullptr;		PPCGScop->tagger = nullptr;
PPCGScop->independence =		PPCGScop->independence =
isl_union_map_empty(isl_set_get_space(PPCGScop->context));		isl_union_map_empty(isl_set_get_space(PPCGScop->context));
PPCGScop->dep_flow = nullptr;		PPCGScop->dep_flow = nullptr;
		grosserUnsubmitted Not Done Reply Inline Actions Why do we add empty lines. grosser: Why do we add empty lines.
PPCGScop->tagged_dep_flow = nullptr;		PPCGScop->tagged_dep_flow = nullptr;
PPCGScop->dep_false = nullptr;		PPCGScop->dep_false = nullptr;
PPCGScop->dep_forced = nullptr;		PPCGScop->dep_forced = nullptr;
PPCGScop->dep_order = nullptr;		PPCGScop->dep_order = nullptr;
PPCGScop->tagged_dep_order = nullptr;		PPCGScop->tagged_dep_order = nullptr;

PPCGScop->schedule = S->getScheduleTree();		PPCGScop->schedule = S->getScheduleTree();

MustKillsInfo KillsInfo = computeMustKillsInfo(*S);
// If we have something non-trivial to kill, add it to the schedule		// If we have something non-trivial to kill, add it to the schedule
if (KillsInfo.KillsSchedule.get())		if (KillsInfo.KillsSchedule.get())
PPCGScop->schedule = isl_schedule_sequence(		PPCGScop->schedule = isl_schedule_sequence(
PPCGScop->schedule, KillsInfo.KillsSchedule.take());		PPCGScop->schedule, KillsInfo.KillsSchedule.take());
PPCGScop->tagged_must_kills = KillsInfo.TaggedMustKills.take();

PPCGScop->names = getNames();		PPCGScop->names = getNames();
PPCGScop->pet = nullptr;		PPCGScop->pet = nullptr;

compute_tagger(PPCGScop);		compute_tagger(PPCGScop);
compute_dependences(PPCGScop);		compute_dependences(PPCGScop);
		eliminate_dead_code(PPCGScop);

return PPCGScop;		return PPCGScop;
}		}

/// Collect the array accesses in a statement.		/// Collect the array accesses in a statement.
///		///
/// @param Stmt The statement for which to collect the accesses.		/// @param Stmt The statement for which to collect the accesses.
///		///
▲ Show 20 Lines • Show All 138 Lines • ▼ Show 20 Lines	public:
///		///
/// For the first dimension we derive the bound of the array from the extent		/// For the first dimension we derive the bound of the array from the extent
/// of this dimension. For inner dimensions we obtain their size directly from		/// of this dimension. For inner dimensions we obtain their size directly from
/// ScopArrayInfo.		/// ScopArrayInfo.
///		///
/// @param PPCGArray The array to compute bounds for.		/// @param PPCGArray The array to compute bounds for.
/// @param Array The polly array from which to take the information.		/// @param Array The polly array from which to take the information.
void setArrayBounds(gpu_array_info &PPCGArray, ScopArrayInfo *Array) {		void setArrayBounds(gpu_array_info &PPCGArray, ScopArrayInfo *Array) {
		isl_pw_aff_list *BoundsList =
		isl_pw_aff_list_alloc(S->getIslCtx(), PPCGArray.n_index);
		std::vector<isl::pw_aff> PwAffs;

		isl_space *AlignSpace = S->getParamSpace();
		AlignSpace = isl_space_add_dims(AlignSpace, isl_dim_set, 1);

if (PPCGArray.n_index > 0) {		if (PPCGArray.n_index > 0) {
if (isl_set_is_empty(PPCGArray.extent)) {		if (isl_set_is_empty(PPCGArray.extent)) {
isl_set *Dom = isl_set_copy(PPCGArray.extent);		isl_set *Dom = isl_set_copy(PPCGArray.extent);
isl_local_space *LS = isl_local_space_from_space(		isl_local_space *LS = isl_local_space_from_space(
isl_space_params(isl_set_get_space(Dom)));		isl_space_params(isl_set_get_space(Dom)));
isl_set_free(Dom);		isl_set_free(Dom);
isl_aff *Zero = isl_aff_zero_on_domain(LS);		isl_pw_aff *Zero = isl_pw_aff_from_aff(isl_aff_zero_on_domain(LS));
PPCGArray.bound[0] = isl_pw_aff_from_aff(Zero);		Zero = isl_pw_aff_align_params(Zero, isl_space_copy(AlignSpace));
		PwAffs.push_back(isl::manage(isl_pw_aff_copy(Zero)));
		BoundsList = isl_pw_aff_list_insert(BoundsList, 0, Zero);
} else {		} else {
isl_set *Dom = isl_set_copy(PPCGArray.extent);		isl_set *Dom = isl_set_copy(PPCGArray.extent);
Dom = isl_set_project_out(Dom, isl_dim_set, 1, PPCGArray.n_index - 1);		Dom = isl_set_project_out(Dom, isl_dim_set, 1, PPCGArray.n_index - 1);
isl_pw_aff *Bound = isl_set_dim_max(isl_set_copy(Dom), 0);		isl_pw_aff *Bound = isl_set_dim_max(isl_set_copy(Dom), 0);
isl_set_free(Dom);		isl_set_free(Dom);
Dom = isl_pw_aff_domain(isl_pw_aff_copy(Bound));		Dom = isl_pw_aff_domain(isl_pw_aff_copy(Bound));
isl_local_space *LS =		isl_local_space *LS =
isl_local_space_from_space(isl_set_get_space(Dom));		isl_local_space_from_space(isl_set_get_space(Dom));
isl_aff *One = isl_aff_zero_on_domain(LS);		isl_aff *One = isl_aff_zero_on_domain(LS);
One = isl_aff_add_constant_si(One, 1);		One = isl_aff_add_constant_si(One, 1);
Bound = isl_pw_aff_add(Bound, isl_pw_aff_alloc(Dom, One));		Bound = isl_pw_aff_add(Bound, isl_pw_aff_alloc(Dom, One));
Bound = isl_pw_aff_gist(Bound, S->getContext());		Bound = isl_pw_aff_gist(Bound, S->getContext());
PPCGArray.bound[0] = Bound;		Bound = isl_pw_aff_align_params(Bound, isl_space_copy(AlignSpace));
		PwAffs.push_back(isl::manage(isl_pw_aff_copy(Bound)));
		BoundsList = isl_pw_aff_list_insert(BoundsList, 0, Bound);
}		}
}		}

for (unsigned i = 1; i < PPCGArray.n_index; ++i) {		for (unsigned i = 1; i < PPCGArray.n_index; ++i) {
isl_pw_aff *Bound = Array->getDimensionSizePw(i);		isl_pw_aff *Bound = Array->getDimensionSizePw(i);
auto LS = isl_pw_aff_get_domain_space(Bound);		auto LS = isl_pw_aff_get_domain_space(Bound);
auto Aff = isl_multi_aff_zero(LS);		auto Aff = isl_multi_aff_zero(LS);
Bound = isl_pw_aff_pullback_multi_aff(Bound, Aff);		Bound = isl_pw_aff_pullback_multi_aff(Bound, Aff);
PPCGArray.bound[i] = Bound;		Bound = isl_pw_aff_align_params(Bound, isl_space_copy(AlignSpace));
		PwAffs.push_back(isl::manage(isl_pw_aff_copy(Bound)));
		BoundsList = isl_pw_aff_list_insert(BoundsList, i, Bound);
}		}

		isl_space_free(AlignSpace);
		isl_space *BoundsSpace = isl_set_get_space(PPCGArray.extent);

		assert(BoundsSpace && "Unable to access space of array.");
		assert(BoundsList && "Unable to access list of bounds.");
		grosserUnsubmitted Not Done Reply Inline Actions Start with uppercase letter. grosser: Start with uppercase letter.

		PPCGArray.bound =
		isl_multi_pw_aff_from_pw_aff_list(BoundsSpace, BoundsList);
		assert(PPCGArray.bound && "PPCGArray.bound was not constructed correctly.");
}		}

/// Create the arrays for @p PPCGProg.		/// Create the arrays for @p PPCGProg.
///		///
/// @param PPCGProg The program to compute the arrays for.		/// @param PPCGProg The program to compute the arrays for.
void createArrays(gpu_prog *PPCGProg) {		void createArrays(gpu_prog *PPCGProg) {
int i = 0;		int i = 0;
for (auto &Array : S->arrays()) {		for (auto &Array : S->arrays()) {
std::string TypeName;		std::string TypeName;
raw_string_ostream OS(TypeName);		raw_string_ostream OS(TypeName);

OS << *Array->getElementType();		OS << *Array->getElementType();
TypeName = OS.str();		TypeName = OS.str();

gpu_array_info &PPCGArray = PPCGProg->array[i];		gpu_array_info &PPCGArray = PPCGProg->array[i];

PPCGArray.space = Array->getSpace();		PPCGArray.space = Array->getSpace();
PPCGArray.type = strdup(TypeName.c_str());		PPCGArray.type = strdup(TypeName.c_str());
PPCGArray.size = Array->getElementType()->getPrimitiveSizeInBits() / 8;		PPCGArray.size = Array->getElementType()->getPrimitiveSizeInBits() / 8;
PPCGArray.name = strdup(Array->getName().c_str());		PPCGArray.name = strdup(Array->getName().c_str());
PPCGArray.extent = nullptr;		PPCGArray.extent = nullptr;
PPCGArray.n_index = Array->getNumberOfDimensions();		PPCGArray.n_index = Array->getNumberOfDimensions();
PPCGArray.bound =
isl_alloc_array(S->getIslCtx(), isl_pw_aff *, PPCGArray.n_index);
PPCGArray.extent = getExtent(Array);		PPCGArray.extent = getExtent(Array);
PPCGArray.n_ref = 0;		PPCGArray.n_ref = 0;
PPCGArray.refs = nullptr;		PPCGArray.refs = nullptr;
PPCGArray.accessed = true;		PPCGArray.accessed = true;
PPCGArray.read_only_scalar =		PPCGArray.read_only_scalar =
Array->isReadOnly() && Array->getNumberOfDimensions() == 0;		Array->isReadOnly() && Array->getNumberOfDimensions() == 0;
PPCGArray.has_compound_element = false;		PPCGArray.has_compound_element = false;
PPCGArray.local = false;		PPCGArray.local = false;
PPCGArray.declare_local = false;		PPCGArray.declare_local = false;
PPCGArray.global = false;		PPCGArray.global = false;
PPCGArray.linearize = false;		PPCGArray.linearize = false;
PPCGArray.dep_order = nullptr;		PPCGArray.dep_order = nullptr;
PPCGArray.user = Array;		PPCGArray.user = Array;

		PPCGArray.bound = nullptr;
setArrayBounds(PPCGArray, Array);		setArrayBounds(PPCGArray, Array);
i++;		i++;

collect_references(PPCGProg, &PPCGArray);		collect_references(PPCGProg, &PPCGArray);
}		}
}		}

/// Create an identity map between the arrays in the scop.		/// Create an identity map between the arrays in the scop.
Show All 27 Lines	gpu_prog createPPCGProg(ppcg_scop PPCGScop) {
PPCGProg->context = isl_set_copy(PPCGScop->context);		PPCGProg->context = isl_set_copy(PPCGScop->context);
PPCGProg->read = isl_union_map_copy(PPCGScop->reads);		PPCGProg->read = isl_union_map_copy(PPCGScop->reads);
PPCGProg->may_write = isl_union_map_copy(PPCGScop->may_writes);		PPCGProg->may_write = isl_union_map_copy(PPCGScop->may_writes);
PPCGProg->must_write = isl_union_map_copy(PPCGScop->must_writes);		PPCGProg->must_write = isl_union_map_copy(PPCGScop->must_writes);
PPCGProg->tagged_must_kill =		PPCGProg->tagged_must_kill =
isl_union_map_copy(PPCGScop->tagged_must_kills);		isl_union_map_copy(PPCGScop->tagged_must_kills);
PPCGProg->to_inner = getArrayIdentity();		PPCGProg->to_inner = getArrayIdentity();
PPCGProg->to_outer = getArrayIdentity();		PPCGProg->to_outer = getArrayIdentity();
		// TODO: verify that this assignment is correct.
		grosserUnsubmitted Not Done Reply Inline Actions ?? grosser: ??
PPCGProg->any_to_outer = nullptr;		PPCGProg->any_to_outer = nullptr;

// this needs to be set when live range reordering is enabled.		// this needs to be set when live range reordering is enabled.
// NOTE: I believe that is conservatively correct. I'm not sure		// NOTE: I believe that is conservatively correct. I'm not sure
// what the semantics of this is.		// what the semantics of this is.
// Quoting PPCG/gpu.h: "Order dependences on non-scalars."		// Quoting PPCG/gpu.h: "Order dependences on non-scalars."
PPCGProg->array_order =		PPCGProg->array_order =
isl_union_map_empty(isl_set_get_space(PPCGScop->context));		isl_union_map_empty(isl_set_get_space(PPCGScop->context));
▲ Show 20 Lines • Show All 136 Lines • ▼ Show 20 Lines	gpu_gen generateGPU(ppcg_scop PPCGScop, gpu_prog *PPCGProg) {

int has_permutable = has_any_permutable_node(Schedule);		int has_permutable = has_any_permutable_node(Schedule);

if (!has_permutable \|\| has_permutable < 0) {		if (!has_permutable \|\| has_permutable < 0) {
Schedule = isl_schedule_free(Schedule);		Schedule = isl_schedule_free(Schedule);
} else {		} else {
Schedule = map_to_device(PPCGGen, Schedule);		Schedule = map_to_device(PPCGGen, Schedule);
PPCGGen->tree = generate_code(PPCGGen, isl_schedule_copy(Schedule));		PPCGGen->tree = generate_code(PPCGGen, isl_schedule_copy(Schedule));
}		}
		grosserUnsubmitted Not Done Reply Inline Actions Drop that! grosser: Drop that!

if (DumpSchedule) {		if (DumpSchedule) {
isl_printer *P = isl_printer_to_str(S->getIslCtx());		isl_printer *P = isl_printer_to_str(S->getIslCtx());
P = isl_printer_set_yaml_style(P, ISL_YAML_STYLE_BLOCK);		P = isl_printer_set_yaml_style(P, ISL_YAML_STYLE_BLOCK);
P = isl_printer_print_str(P, "Schedule\n");		P = isl_printer_print_str(P, "Schedule\n");
P = isl_printer_print_str(P, "========\n");		P = isl_printer_print_str(P, "========\n");
if (Schedule)		if (Schedule)
P = isl_printer_print_schedule(P, Schedule);		P = isl_printer_print_schedule(P, Schedule);
▲ Show 20 Lines • Show All 223 Lines • ▼ Show 20 Lines	void generateCode(__isl_take isl_ast_node Root, gpu_prog Prog) {
NodeBuilder.addParameters(S->getContext());		NodeBuilder.addParameters(S->getContext());

isl_ast_build *Build = isl_ast_build_alloc(S->getIslCtx());		isl_ast_build *Build = isl_ast_build_alloc(S->getIslCtx());
isl_ast_expr Condition = IslAst::buildRunCondition(S, Build);		isl_ast_expr Condition = IslAst::buildRunCondition(S, Build);
isl_ast_expr SufficientCompute = createSufficientComputeCheck(S, Build);		isl_ast_expr SufficientCompute = createSufficientComputeCheck(S, Build);
Condition = isl_ast_expr_and(Condition, SufficientCompute);		Condition = isl_ast_expr_and(Condition, SufficientCompute);
isl_ast_build_free(Build);		isl_ast_build_free(Build);

		// preload invariant loads. Note: This should happen before the RTC
		// because the RTC may depend on values that are invariant load hoisted.
		NodeBuilder.preloadInvariantLoads();

Value *RTC = NodeBuilder.createRTC(Condition);		Value *RTC = NodeBuilder.createRTC(Condition);
Builder.GetInsertBlock()->getTerminator()->setOperand(0, RTC);		Builder.GetInsertBlock()->getTerminator()->setOperand(0, RTC);

Builder.SetInsertPoint(&*StartBlock->begin());		Builder.SetInsertPoint(&*StartBlock->begin());

NodeBuilder.initializeAfterRTH();
NodeBuilder.preloadInvariantLoads();
NodeBuilder.create(Root);		NodeBuilder.create(Root);
NodeBuilder.finalize();

/// In case a sequential kernel has more surrounding loops as any parallel		/// In case a sequential kernel has more surrounding loops as any parallel
/// kernel, the SCoP is probably mostly sequential. Hence, there is no		/// kernel, the SCoP is probably mostly sequential. Hence, there is no
/// point in running it on a GPU.		/// point in running it on a GPU.
if (NodeBuilder.DeepestSequential > NodeBuilder.DeepestParallel)		if (NodeBuilder.DeepestSequential > NodeBuilder.DeepestParallel)
CondBr->setOperand(0, Builder.getFalse());		CondBr->setOperand(0, Builder.getFalse());

if (!NodeBuilder.BuildSuccessful)		if (!NodeBuilder.BuildSuccessful)
▲ Show 20 Lines • Show All 57 Lines • ▼ Show 20 Lines	void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.addPreserved<ScalarEvolutionWrapperPass>();		AU.addPreserved<ScalarEvolutionWrapperPass>();
AU.addPreserved<SCEVAAWrapperPass>();		AU.addPreserved<SCEVAAWrapperPass>();

// FIXME: We do not yet add regions for the newly generated code to the		// FIXME: We do not yet add regions for the newly generated code to the
// region tree.		// region tree.
AU.addPreserved<RegionInfoPass>();		AU.addPreserved<RegionInfoPass>();
AU.addPreserved<ScopInfoRegionPass>();		AU.addPreserved<ScopInfoRegionPass>();
}		}
};		};
		grosserUnsubmitted Not Done Reply Inline Actions ?? grosser: ??
} // namespace		} // namespace

char PPCGCodeGeneration::ID = 1;		char PPCGCodeGeneration::ID = 1;

Pass *polly::createPPCGCodeGenerationPass(GPUArch Arch, GPURuntime Runtime) {		Pass *polly::createPPCGCodeGenerationPass(GPUArch Arch, GPURuntime Runtime) {
PPCGCodeGeneration *generator = new PPCGCodeGeneration();		PPCGCodeGeneration *generator = new PPCGCodeGeneration();
generator->Runtime = Runtime;		generator->Runtime = Runtime;
generator->Architecture = Arch;		generator->Architecture = Arch;
Show All 13 Lines

test/GPGPU/host-control-flow.ll

	; RUN: opt %loadPolly -polly-codegen-ppcg -disable-output \			; RUN: opt %loadPolly -polly-codegen-ppcg -disable-output \
	; RUN: -polly-acc-dump-code < %s \| FileCheck %s -check-prefix=CODE			; RUN: -polly-acc-dump-code < %s \| FileCheck %s -check-prefix=CODE

	; RUN: opt %loadPolly -polly-codegen-ppcg -disable-output \			; RUN: opt %loadPolly -polly-codegen-ppcg -disable-output \
	; RUN: -polly-acc-dump-kernel-ir < %s \| FileCheck %s -check-prefix=KERNEL-IR			; RUN: -polly-acc-dump-kernel-ir < %s \| FileCheck %s -check-prefix=KERNEL-IR

	; RUN: opt %loadPolly -polly-codegen-ppcg \			; RUN: opt %loadPolly -polly-codegen-ppcg \
	; RUN: -S < %s \| FileCheck %s -check-prefix=IR			; RUN: -S < %s \| FileCheck %s -check-prefix=IR
	; void foo(float A[2][100]) {			; void foo(float A[2][100]) {
	; for (long t = 0; t < 100; t++)			; for (long t = 0; t < 100; t++)
	; for (long i = 1; i < 99; i++)			; for (long i = 1; i < 99; i++)
	; A[(t + 1) % 2][i] += A[t % 2][i - 1] + A[t % 2][i] + A[t % 2][i + 1];			; A[(t + 1) % 2][i] += A[t % 2][i - 1] + A[t % 2][i] + A[t % 2][i + 1];
	; }			; }

	; REQUIRES: pollyacc			; REQUIRES: pollyacc

	; CODE: # host			; CODE: cudaCheckReturn(cudaMemcpy(dev_MemRef_A, MemRef_A, (2) * (100) * sizeof(float), cudaMemcpyHostToDevice));
	; CODE-NEXT: {
	; CODE-NEXT: cudaCheckReturn(cudaMemcpy(dev_MemRef_A, MemRef_A, (2) * (100) * sizeof(float), cudaMemcpyHostToDevice));
	; CODE-NEXT: for (int c0 = 0; c0 <= 99; c0 += 1)			; CODE-NEXT: for (int c0 = 0; c0 <= 99; c0 += 1)
	; CODE-NEXT: {			; CODE-NEXT: {
	; CODE-NEXT: dim3 k0_dimBlock(32);			; CODE-NEXT: dim3 k0_dimBlock(32);
	; CODE-NEXT: dim3 k0_dimGrid(4);			; CODE-NEXT: dim3 k0_dimGrid(4);
	; CODE-NEXT: kernel0 <<<k0_dimGrid, k0_dimBlock>>> (dev_MemRef_A, c0);			; CODE-NEXT: kernel0 <<<k0_dimGrid, k0_dimBlock>>> (dev_MemRef_A, c0);
	; CODE-NEXT: cudaCheckKernel();			; CODE-NEXT: cudaCheckKernel();
	; CODE-NEXT: }			; CODE-NEXT: }

	; CODE: cudaCheckReturn(cudaMemcpy(MemRef_A, dev_MemRef_A, (2) * (100) * sizeof(float), cudaMemcpyDeviceToHost));			; CODE: cudaCheckReturn(cudaMemcpy(MemRef_A, dev_MemRef_A, (2) * (100) * sizeof(float), cudaMemcpyDeviceToHost));
				; CODE-NEXT: cudaCheckReturn(cudaFree(dev_MemRef_A));
	; CODE-NEXT: }			; CODE-NEXT: }

	; IR-LABEL: polly.loop_header: ; preds = %polly.loop_header, %polly.loop_preheader			; IR-LABEL: polly.loop_header: ; preds = %polly.loop_header, %polly.loop_preheader
	; IR-NEXT: %polly.indvar = phi i64 [ 0, %polly.loop_preheader ], [ %polly.indvar_next, %polly.loop_header ]			; IR-NEXT: %polly.indvar = phi i64 [ 0, %polly.loop_preheader ], [ %polly.indvar_next, %polly.loop_header ]
	; ...			; ...
	; IR: store i64 %polly.indvar, i64* %polly_launch_0_param_1			; IR: store i64 %polly.indvar, i64* %polly_launch_0_param_1
	; IR-NEXT: [[REGA:%.+]] = getelementptr [4 x i8], [4 x i8]* %polly_launch_0_params, i64 0, i64 1			; IR-NEXT: [[REGA:%.+]] = getelementptr [4 x i8], [4 x i8]* %polly_launch_0_params, i64 0, i64 1
	; IR-NEXT: [[REGB:%.+]] = bitcast i64* %polly_launch_0_param_1 to i8*			; IR-NEXT: [[REGB:%.+]] = bitcast i64* %polly_launch_0_param_1 to i8*
	▲ Show 20 Lines • Show All 141 Lines • Show Last 20 Lines

test/GPGPU/host-statement.ll

	Show All 12 Lines
	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-linux-gnu"			target triple = "x86_64-unknown-linux-gnu"

	declare void @llvm.lifetime.start(i64, i8* nocapture) #0			declare void @llvm.lifetime.start(i64, i8* nocapture) #0

	; This test case tests that we can correctly handle a ScopStmt that is			; This test case tests that we can correctly handle a ScopStmt that is
	; scheduled on the host, instead of within a kernel.			; scheduled on the host, instead of within a kernel.

	; CODE-LABEL: Code			; CODE: cudaCheckReturn(cudaMemcpy(dev_MemRef_A, MemRef_A, (512) * (512) * sizeof(double), cudaMemcpyHostToDevice));
	; CODE-NEXT: ====
	; CODE-NEXT: # host
	; CODE-NEXT: {
	; CODE-NEXT: cudaCheckReturn(cudaMemcpy(dev_MemRef_A, MemRef_A, (512) * (512) * sizeof(double), cudaMemcpyHostToDevice));
	; CODE-NEXT: cudaCheckReturn(cudaMemcpy(dev_MemRef_R, MemRef_R, (p_0 + 1) * (512) * sizeof(double), cudaMemcpyHostToDevice));			; CODE-NEXT: cudaCheckReturn(cudaMemcpy(dev_MemRef_R, MemRef_R, (p_0 + 1) * (512) * sizeof(double), cudaMemcpyHostToDevice));
	; CODE-NEXT: cudaCheckReturn(cudaMemcpy(dev_MemRef_Q, MemRef_Q, (512) * (512) * sizeof(double), cudaMemcpyHostToDevice));			; CODE-NEXT: cudaCheckReturn(cudaMemcpy(dev_MemRef_Q, MemRef_Q, (512) * (512) * sizeof(double), cudaMemcpyHostToDevice));
	; CODE-NEXT: {			; CODE-NEXT: {
	; CODE-NEXT: dim3 k0_dimBlock(32);			; CODE-NEXT: dim3 k0_dimBlock(32);
	; CODE-NEXT: dim3 k0_dimGrid(16);			; CODE-NEXT: dim3 k0_dimGrid(16);
	; CODE-NEXT: kernel0 <<<k0_dimGrid, k0_dimBlock>>> (dev_MemRef_A, dev_MemRef_R, dev_MemRef_Q, p_0, p_1);			; CODE-NEXT: kernel0 <<<k0_dimGrid, k0_dimBlock>>> (dev_MemRef_A, dev_MemRef_R, dev_MemRef_Q, p_0, p_1);
	; CODE-NEXT: cudaCheckKernel();			; CODE-NEXT: cudaCheckKernel();
	; CODE-NEXT: }			; CODE-NEXT: }
	▲ Show 20 Lines • Show All 175 Lines • Show Last 20 Lines

test/GPGPU/invalid-kernel.ll

	Show All 14 Lines
	; for (long i = 0; i < 1024; i++)			; for (long i = 0; i < 1024; i++)
	; A[i] += (B[i] + (long)&B[i]);			; A[i] += (B[i] + (long)&B[i]);
	; }			; }

	; This kernel loads/stores a pointer address we model. This is a rare case,			; This kernel loads/stores a pointer address we model. This is a rare case,
	; were we still lack proper code-generation support. We check here that we			; were we still lack proper code-generation support. We check here that we
	; detect the invalid IR and bail out gracefully.			; detect the invalid IR and bail out gracefully.

	; CODE: Code			; CODE: cudaCheckReturn(cudaMemcpy(dev_MemRef_B, MemRef_B, (1024) * sizeof(i64), cudaMemcpyHostToDevice));
	; CODE-NEXT: ====
	; CODE-NEXT: # host
	; CODE-NEXT: {
	; CODE-NEXT: cudaCheckReturn(cudaMemcpy(dev_MemRef_B, MemRef_B, (1024) * sizeof(i64), cudaMemcpyHostToDevice));
	; CODE-NEXT: cudaCheckReturn(cudaMemcpy(dev_MemRef_A, MemRef_A, (1024) * sizeof(i64), cudaMemcpyHostToDevice));			; CODE-NEXT: cudaCheckReturn(cudaMemcpy(dev_MemRef_A, MemRef_A, (1024) * sizeof(i64), cudaMemcpyHostToDevice));
	; CODE-NEXT: {			; CODE-NEXT: {
	; CODE-NEXT: dim3 k0_dimBlock(32);			; CODE-NEXT: dim3 k0_dimBlock(32);
	; CODE-NEXT: dim3 k0_dimGrid(32);			; CODE-NEXT: dim3 k0_dimGrid(32);
	; CODE-NEXT: kernel0 <<<k0_dimGrid, k0_dimBlock>>> (dev_MemRef_B, dev_MemRef_A);			; CODE-NEXT: kernel0 <<<k0_dimGrid, k0_dimBlock>>> (dev_MemRef_B, dev_MemRef_A);
	; CODE-NEXT: cudaCheckKernel();			; CODE-NEXT: cudaCheckKernel();
	; CODE-NEXT: }			; CODE-NEXT: }

	; CODE: cudaCheckReturn(cudaMemcpy(MemRef_A, dev_MemRef_A, (1024) * sizeof(i64), cudaMemcpyDeviceToHost));			; CODE: cudaCheckReturn(cudaMemcpy(MemRef_A, dev_MemRef_A, (1024) * sizeof(i64), cudaMemcpyDeviceToHost));
	; CODE-NEXT: }

	; CODE: # kernel0			; CODE: # kernel0
	; CODE-NEXT: Stmt_bb2(32 * b0 + t0);			; CODE-NEXT: Stmt_bb2(32 * b0 + t0);

	; RUN: opt %loadPolly -polly-codegen-ppcg -S < %s \| \			; RUN: opt %loadPolly -polly-codegen-ppcg -S < %s \| \
	; RUN: FileCheck %s -check-prefix=IR			; RUN: FileCheck %s -check-prefix=IR

	; KERNEL-IR: kernel			; KERNEL-IR: kernel
	Show All 33 Lines

test/GPGPU/kernel-params-only-some-arrays.ll

	Show All 15 Lines
	; B[i] += 42;			; B[i] += 42;
	; }			; }

	; KERNEL: ; ModuleID = 'FUNC_kernel_params_only_some_arrays_SCOP_0_KERNEL_0'			; KERNEL: ; ModuleID = 'FUNC_kernel_params_only_some_arrays_SCOP_0_KERNEL_0'
	; KERNEL-NEXT: source_filename = "FUNC_kernel_params_only_some_arrays_SCOP_0_KERNEL_0"			; KERNEL-NEXT: source_filename = "FUNC_kernel_params_only_some_arrays_SCOP_0_KERNEL_0"
	; KERNEL-NEXT: target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"			; KERNEL-NEXT: target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
	; KERNEL-NEXT: target triple = "nvptx64-nvidia-cuda"			; KERNEL-NEXT: target triple = "nvptx64-nvidia-cuda"

	; KERNEL: define ptx_kernel void @FUNC_kernel_params_only_some_arrays_SCOP_0_KERNEL_0(i8 addrspace(1)* %MemRef_A)			; KERNEL: define ptx_kernel void @FUNC_kernel_params_only_some_arrays_SCOP_0_KERNEL_0(i8 addrspace(1)* %MemRef_B)
	; KERNEL-NEXT: entry:			; KERNEL-NEXT: entry:
	; KERNEL-NEXT: %0 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()			; KERNEL-NEXT: %0 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
	; KERNEL-NEXT: %b0 = zext i32 %0 to i64			; KERNEL-NEXT: %b0 = zext i32 %0 to i64
	; KERNEL-NEXT: %1 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()			; KERNEL-NEXT: %1 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
	; KERNEL-NEXT: %t0 = zext i32 %1 to i64			; KERNEL-NEXT: %t0 = zext i32 %1 to i64

	; KERNEL: ret void			; KERNEL: ret void
	; KERNEL-NEXT: }			; KERNEL-NEXT: }

	; KERNEL: ; ModuleID = 'FUNC_kernel_params_only_some_arrays_SCOP_0_KERNEL_1'			; KERNEL: ; ModuleID = 'FUNC_kernel_params_only_some_arrays_SCOP_0_KERNEL_1'
	; KERNEL-NEXT: source_filename = "FUNC_kernel_params_only_some_arrays_SCOP_0_KERNEL_1"			; KERNEL-NEXT: source_filename = "FUNC_kernel_params_only_some_arrays_SCOP_0_KERNEL_1"
	; KERNEL-NEXT: target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"			; KERNEL-NEXT: target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v16:16:16-v32:32:32-v64:64:64-v128:128:128-n16:32:64"
	; KERNEL-NEXT: target triple = "nvptx64-nvidia-cuda"			; KERNEL-NEXT: target triple = "nvptx64-nvidia-cuda"

	; KERNEL: define ptx_kernel void @FUNC_kernel_params_only_some_arrays_SCOP_0_KERNEL_1(i8 addrspace(1)* %MemRef_B)			; KERNEL: define ptx_kernel void @FUNC_kernel_params_only_some_arrays_SCOP_0_KERNEL_1(i8 addrspace(1)* %MemRef_A)
	; KERNEL-NEXT: entry:			; KERNEL-NEXT: entry:
	; KERNEL-NEXT: %0 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()			; KERNEL-NEXT: %0 = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
	; KERNEL-NEXT: %b0 = zext i32 %0 to i64			; KERNEL-NEXT: %b0 = zext i32 %0 to i64
	; KERNEL-NEXT: %1 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()			; KERNEL-NEXT: %1 = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
	; KERNEL-NEXT: %t0 = zext i32 %1 to i64			; KERNEL-NEXT: %t0 = zext i32 %1 to i64

	; KERNEL: ret void			; KERNEL: ret void
	; KERNEL-NEXT: }			; KERNEL-NEXT: }


	; IR: [[DEVPTR:%.]] = call i8 @polly_getDevicePtr(i8* %p_dev_array_MemRef_A)			; IR: [[DEVPTR:%.]] = call i8 @polly_getDevicePtr(i8* %p_dev_array_MemRef_B)
	; IR-NEXT: [[SLOT:%.]] = getelementptr [2 x i8], [2 x i8] %polly_launch_0_params, i64 0, i64 0			; IR-NEXT: [[SLOT:%.]] = getelementptr [2 x i8], [2 x i8] %polly_launch_0_params, i64 0, i64 0
	; IR-NEXT: store i8* [[DEVPTR]], i8** %polly_launch_0_param_0			; IR-NEXT: store i8* [[DEVPTR]], i8** %polly_launch_0_param_0
	; IR-NEXT: [[DATA:%.]] = bitcast i8* %polly_launch_0_param_0 to i8*			; IR-NEXT: [[DATA:%.]] = bitcast i8* %polly_launch_0_param_0 to i8*
	; IR-NEXT: store i8* [[DATA]], i8** [[SLOT]]			; IR-NEXT: store i8* [[DATA]], i8** [[SLOT]]

	; IR: [[DEVPTR:%.]] = call i8 @polly_getDevicePtr(i8* %p_dev_array_MemRef_B)			; IR: [[DEVPTR:%.]] = call i8 @polly_getDevicePtr(i8* %p_dev_array_MemRef_A)
	; IR-NEXT: [[SLOT:%.]] = getelementptr [2 x i8], [2 x i8] %polly_launch_1_params, i64 0, i64 0			; IR-NEXT: [[SLOT:%.]] = getelementptr [2 x i8], [2 x i8] %polly_launch_1_params, i64 0, i64 0
	; IR-NEXT: store i8* [[DEVPTR]], i8** %polly_launch_1_param_0			; IR-NEXT: store i8* [[DEVPTR]], i8** %polly_launch_1_param_0
	; IR-NEXT: [[DATA:%.]] = bitcast i8* %polly_launch_1_param_0 to i8*			; IR-NEXT: [[DATA:%.]] = bitcast i8* %polly_launch_1_param_0 to i8*
	; IR-NEXT: store i8* [[DATA]], i8** [[SLOT]]			; IR-NEXT: store i8* [[DATA]], i8** [[SLOT]]


	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

	define void @kernel_params_only_some_arrays(float* %A, float* %B) {			define void @kernel_params_only_some_arrays(float* %A, float* %B) {
	entry:			entry:
	br label %for.cond			br label %for.cond

	for.cond: ; preds = %for.inc, %entry			for.cond: ; preds = %for.inc, %entry
	%i.0 = phi i64 [ 0, %entry ], [ %inc, %for.inc ]			%i.0 = phi i64 [ 0, %entry ], [ %inc, %for.inc ]
	Show All 36 Lines

test/GPGPU/mostly-sequential.ll

	; RUN: opt %loadPolly -polly-codegen-ppcg -polly-acc-dump-code \			; RUN: opt %loadPolly -polly-codegen-ppcg -polly-acc-dump-code \
	; RUN: -disable-output < %s \| \			; RUN: -disable-output < %s \| \
	; RUN: FileCheck -check-prefix=CODE %s			; RUN: FileCheck -check-prefix=CODE %s

	; RUN: opt %loadPolly -polly-codegen-ppcg -S < %s \| \
	; RUN: FileCheck %s -check-prefix=IR

	; REQUIRES: pollyacc			; REQUIRES: pollyacc

	; void foo(float A[]) {			; void foo(float A[]) {
	; for (long i = 0; i < 128; i++)			; for (long i = 0; i < 128; i++)
	; A[i] += i;			; A[i] += i;
	;			;
	; for (long i = 0; i < 128; i++)			; for (long i = 0; i < 128; i++)
	; for (long j = 0; j < 128; j++)			; for (long j = 0; j < 128; j++)
	; A[42] += i + j;			; A[42] += i + j;
	; }			; }

	; CODE: Code			; CODE: cudaCheckReturn(cudaMemcpy(dev_MemRef_A, MemRef_A, (128) * sizeof(float), cudaMemcpyHostToDevice));
	; CODE-NEXT: ====
	; CODE-NEXT: # host
	; CODE-NEXT: {
	; CODE-NEXT: cudaCheckReturn(cudaMemcpy(dev_MemRef_A, MemRef_A, (128) * sizeof(float), cudaMemcpyHostToDevice));
	; CODE-NEXT: {			; CODE-NEXT: {
	; CODE-NEXT: dim3 k0_dimBlock(32);			; CODE-NEXT: dim3 k0_dimBlock(32);
	; CODE-NEXT: dim3 k0_dimGrid(4);			; CODE-NEXT: dim3 k0_dimGrid(4);
	; CODE-NEXT: kernel0 <<<k0_dimGrid, k0_dimBlock>>> (dev_MemRef_A);			; CODE-NEXT: kernel0 <<<k0_dimGrid, k0_dimBlock>>> (dev_MemRef_A);
	; CODE-NEXT: cudaCheckKernel();			; CODE-NEXT: cudaCheckKernel();
	; CODE-NEXT: }			; CODE-NEXT: }

	; CODE: for (int c0 = 0; c0 <= 127; c0 += 1)			; CODE: {
				grosserUnsubmitted Not Done Reply Inline Actions Drop that. grosser: Drop that.
	; CODE-NEXT: for (int c1 = 0; c1 <= 127; c1 += 1)
	; CODE-NEXT: {
	; CODE-NEXT: dim3 k1_dimBlock;			; CODE-NEXT: dim3 k1_dimBlock;
	; CODE-NEXT: dim3 k1_dimGrid;			; CODE-NEXT: dim3 k1_dimGrid;
	; CODE-NEXT: kernel1 <<<k1_dimGrid, k1_dimBlock>>> (dev_MemRef_A, c0, c1);			; CODE-NEXT: kernel1 <<<k1_dimGrid, k1_dimBlock>>> (dev_MemRef_A);
	; CODE-NEXT: cudaCheckKernel();			; CODE-NEXT: cudaCheckKernel();
	; CODE-NEXT: }			; CODE-NEXT: }

	; CODE: cudaCheckReturn(cudaMemcpy(MemRef_A, dev_MemRef_A, (128) * sizeof(float), cudaMemcpyDeviceToHost));			; CODE: cudaCheckReturn(cudaMemcpy(MemRef_A, dev_MemRef_A, (128) * sizeof(float), cudaMemcpyDeviceToHost));
				; CODE-NEXT: cudaCheckReturn(cudaFree(dev_MemRef_A));
	; CODE-NEXT: }			; CODE-NEXT: }

	; CODE: # kernel0			; CODE: # kernel0
	; CODE-NEXT: Stmt_bb4(32 * b0 + t0);			; CODE-NEXT: Stmt_bb4(32 * b0 + t0);

	; CODE: # kernel1			; CODE: # kernel1
				; CODE-NEXT: for (int c0 = 0; c0 <= 127; c0 += 1)
				; CODE-NEXT: for (int c1 = 0; c1 <= 127; c1 += 1)
	; CODE-NEXT: Stmt_bb14(c0, c1);			; CODE-NEXT: Stmt_bb14(c0, c1);

	; Verify that we identified this kernel as non-profitable.
	; IR: br i1 false, label %polly.start, label %bb3

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

	define void @foo(float* %A) {			define void @foo(float* %A) {
	bb:			bb:
	br label %bb3			br label %bb3

	bb3: ; preds = %bb8, %bb			bb3: ; preds = %bb8, %bb
	▲ Show 20 Lines • Show All 55 Lines • Show Last 20 Lines

test/GPGPU/non-read-only-scalars.ll

	Show All 25 Lines
	; }			; }
	;			;
	; int main() {			; int main() {
	; float A[32];			; float A[32];
	; float sum = foo(A);			; float sum = foo(A);
	; printf("%f\n", sum);			; printf("%f\n", sum);
	; }			; }

	; CODE: Code			; CODE: dim3 k0_dimBlock(32);
	; CODE-NEXT: ====
	; CODE-NEXT: # host
	; CODE-NEXT: {
	; CODE-NEXT: {
	; CODE-NEXT: dim3 k0_dimBlock(32);
	; CODE-NEXT: dim3 k0_dimGrid(1);			; CODE-NEXT: dim3 k0_dimGrid(1);
	; CODE-NEXT: kernel0 <<<k0_dimGrid, k0_dimBlock>>> (dev_MemRef_A);			; CODE-NEXT: kernel0 <<<k0_dimGrid, k0_dimBlock>>> (dev_MemRef_A);
	; CODE-NEXT: cudaCheckKernel();			; CODE-NEXT: cudaCheckKernel();
	; CODE-NEXT: }			; CODE-NEXT: }

	; CODE: {			; CODE: {
	; CODE-NEXT: dim3 k1_dimBlock;			; CODE-NEXT: dim3 k1_dimBlock;
	; CODE-NEXT: dim3 k1_dimGrid;			; CODE-NEXT: dim3 k1_dimGrid;
	; CODE-NEXT: kernel1 <<<k1_dimGrid, k1_dimBlock>>> (dev_MemRef_sum_0__phi);			; CODE-NEXT: kernel1 <<<k1_dimGrid, k1_dimBlock>>> (dev_MemRef_sum_0__phi);
	; CODE-NEXT: cudaCheckKernel();			; CODE-NEXT: cudaCheckKernel();
	; CODE-NEXT: }			; CODE-NEXT: }

	; CODE: for (int c0 = 0; c0 <= 32; c0 += 1) {			; CODE: {
	; CODE-NEXT: {
	; CODE-NEXT: dim3 k2_dimBlock;			; CODE-NEXT: dim3 k2_dimBlock;
	; CODE-NEXT: dim3 k2_dimGrid;			; CODE-NEXT: dim3 k2_dimGrid;
	; CODE-NEXT: kernel2 <<<k2_dimGrid, k2_dimBlock>>> (dev_MemRef_sum_0__phi, dev_MemRef_sum_0, c0);			; CODE-NEXT: kernel2 <<<k2_dimGrid, k2_dimBlock>>> (dev_MemRef_A, dev_MemRef_sum_0__phi, dev_MemRef_sum_0);
	; CODE-NEXT: cudaCheckKernel();
	; CODE-NEXT: }

	; CODE: if (c0 <= 31)
	; CODE-NEXT: {
	; CODE-NEXT: dim3 k3_dimBlock;
	; CODE-NEXT: dim3 k3_dimGrid;
	; CODE-NEXT: kernel3 <<<k3_dimGrid, k3_dimBlock>>> (dev_MemRef_A, dev_MemRef_sum_0__phi, dev_MemRef_sum_0, c0);
	; CODE-NEXT: cudaCheckKernel();			; CODE-NEXT: cudaCheckKernel();
	; CODE-NEXT: }			; CODE-NEXT: }

	; CODE: }			; CODE: cudaCheckReturn(cudaMemcpy(MemRef_A, dev_MemRef_A, (32) * sizeof(float), cudaMemcpyDeviceToHost));
	; CODE-NEXT: cudaCheckReturn(cudaMemcpy(MemRef_A, dev_MemRef_A, (32) * sizeof(float), cudaMemcpyDeviceToHost));
	; CODE-NEXT: cudaCheckReturn(cudaMemcpy(&MemRef_sum_0, dev_MemRef_sum_0, sizeof(float), cudaMemcpyDeviceToHost));			; CODE-NEXT: cudaCheckReturn(cudaMemcpy(&MemRef_sum_0, dev_MemRef_sum_0, sizeof(float), cudaMemcpyDeviceToHost));
				; CODE-NEXT: cudaCheckReturn(cudaFree(dev_MemRef_A));
				; CODE-NEXT: cudaCheckReturn(cudaFree(dev_MemRef_sum_0__phi));
				; CODE-NEXT: cudaCheckReturn(cudaFree(dev_MemRef_sum_0));
	; CODE-NEXT: }			; CODE-NEXT: }

	; CODE: # kernel0			; CODE: # kernel0
	; CODE-NEXT: {			; CODE-NEXT: {
	; CODE-NEXT: Stmt_bb4(t0);			; CODE-NEXT: Stmt_bb4(t0);
	; CODE-NEXT: Stmt_bb10(t0);			; CODE-NEXT: Stmt_bb10(t0);
	; CODE-NEXT: }			; CODE-NEXT: }

	; CODE: # kernel1			; CODE: # kernel1
	; CODE-NEXT: Stmt_bb17();			; CODE-NEXT: Stmt_bb17();

	; CODE: # kernel2			; CODE: # kernel2
				; CODE-NEXT: for (int c0 = 0; c0 <= 32; c0 += 1) {
	; CODE-NEXT: Stmt_bb18(c0);			; CODE-NEXT: Stmt_bb18(c0);
				; CODE-NEXT: if (c0 <= 31)
	; CODE: # kernel3
	; CODE-NEXT: Stmt_bb20(c0);			; CODE-NEXT: Stmt_bb20(c0);
				; CODE-NEXT: }

				; KERNEL-IR: define ptx_kernel void @FUNC_foo_SCOP_0_KERNEL_1(i8 addrspace(1)* %MemRef_sum_0__phi)
				; KERNEL-IR: store float 0.000000e+00, float* %sum.0.phiops
				; KERNEL-IR: [[REGA:%.+]] = addrspacecast i8 addrspace(1)* %MemRef_sum_0__phi to float*
				; KERNEL-IR: [[REGB:%.+]] = load float, float* %sum.0.phiops
				; KERNEL-IR: store float [[REGB]], float* [[REGA]]

				; KERNEL-IR: define ptx_kernel void @FUNC_foo_SCOP_0_KERNEL_2(i8 addrspace(1)* %MemRef_A, i8 addrspace(1)* %MemRef_sum_0__phi, i8 addrspace(1)* %MemRef_sum_0)

	; KERNEL-IR: store float %p_tmp23, float* %sum.0.phiops
	; KERNEL-IR-NEXT: [[REGA:%.+]] = addrspacecast i8 addrspace(1)* %MemRef_sum_0__phi to float*
	; KERNEL-IR-NEXT: [[REGB:%.+]] = load float, float* %sum.0.phiops
	; KERNEL-IR-NEXT: store float [[REGB]], float* [[REGA]]
	; KERNEL-IR-NEXT: [[REGC:%.+]] = addrspacecast i8 addrspace(1)* %MemRef_sum_0 to float*
	; KERNEL-IR-NEXT: [[REGD:%.+]] = load float, float* %sum.0.s2a
	; KERNEL-IR-NEXT: store float [[REGD]], float* [[REGC]]
	; KERNEL-IR-NEXT: ret void

	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

	@.str = private unnamed_addr constant [4 x i8] c"%f\0A\00", align 1			@.str = private unnamed_addr constant [4 x i8] c"%f\0A\00", align 1

	define float @foo(float* %A) {			define float @foo(float* %A) {
	bb:			bb:
	br label %bb3			br label %bb3
	▲ Show 20 Lines • Show All 72 Lines • Show Last 20 Lines

test/GPGPU/non-zero-array-offset.ll

	; RUN: opt %loadPolly -polly-codegen-ppcg -polly-acc-dump-code \			; RUN: opt %loadPolly -polly-codegen-ppcg -polly-acc-dump-code \
	; RUN: -disable-output < %s \| \			; RUN: -disable-output < %s \| \
	; RUN: FileCheck -check-prefix=CODE %s			; RUN: FileCheck -check-prefix=CODE %s

	; RUN: opt %loadPolly -polly-codegen-ppcg -S < %s \| \			; RUN: opt %loadPolly -polly-codegen-ppcg -S < %s \| \
	; RUN: FileCheck %s -check-prefix=IR			; RUN: FileCheck %s -check-prefix=IR
	;			;
	; REQUIRES: pollyacc			; REQUIRES: pollyacc

	; CODE: Code			; CODE: cudaCheckReturn(cudaMemcpy(dev_MemRef_B, MemRef_B, (16) * sizeof(float), cudaMemcpyHostToDevice));
	; CODE-NEXT: ====
	; CODE-NEXT: # host
	; CODE-NEXT: {
	; CODE-NEXT: cudaCheckReturn(cudaMemcpy(dev_MemRef_B, MemRef_B, (16) * sizeof(float), cudaMemcpyHostToDevice));
	; CODE-NEXT: cudaCheckReturn(cudaMemcpy(dev_MemRef_A, MemRef_A, (8) * sizeof(float), cudaMemcpyHostToDevice));			; CODE-NEXT: cudaCheckReturn(cudaMemcpy(dev_MemRef_A, MemRef_A, (8) * sizeof(float), cudaMemcpyHostToDevice));
	; CODE-NEXT: {
	; CODE-NEXT: dim3 k0_dimBlock(8);			; CODE: dim3 k0_dimBlock(8);
	; CODE-NEXT: dim3 k0_dimGrid(1);			; CODE-NEXT: dim3 k0_dimGrid(1);
	; CODE-NEXT: kernel0 <<<k0_dimGrid, k0_dimBlock>>> (dev_MemRef_B);			; CODE-NEXT: kernel0 <<<k0_dimGrid, k0_dimBlock>>> (dev_MemRef_A);
	; CODE-NEXT: cudaCheckKernel();			; CODE-NEXT: cudaCheckKernel();
	; CODE-NEXT: }			; CODE-NEXT: }

	; CODE: {			; CODE: {
	; CODE-NEXT: dim3 k1_dimBlock(8);			; CODE-NEXT: dim3 k1_dimBlock(8);
	; CODE-NEXT: dim3 k1_dimGrid(1);			; CODE-NEXT: dim3 k1_dimGrid(1);
	; CODE-NEXT: kernel1 <<<k1_dimGrid, k1_dimBlock>>> (dev_MemRef_A);			; CODE-NEXT: kernel1 <<<k1_dimGrid, k1_dimBlock>>> (dev_MemRef_B);
	; CODE-NEXT: cudaCheckKernel();			; CODE-NEXT: cudaCheckKernel();
	; CODE-NEXT: }			; CODE-NEXT: }

	; CODE: cudaCheckReturn(cudaMemcpy(MemRef_B, dev_MemRef_B, (16) * sizeof(float), cudaMemcpyDeviceToHost));			; CODE: cudaCheckReturn(cudaMemcpy(MemRef_B, dev_MemRef_B, (16) * sizeof(float), cudaMemcpyDeviceToHost));
	; CODE-NEXT: cudaCheckReturn(cudaMemcpy(MemRef_A, dev_MemRef_A, (8) * sizeof(float), cudaMemcpyDeviceToHost));			; CODE-NEXT: cudaCheckReturn(cudaMemcpy(MemRef_A, dev_MemRef_A, (8) * sizeof(float), cudaMemcpyDeviceToHost));
	; CODE-NEXT: }

	; CODE: # kernel0			; CODE: # kernel0
	; CODE-NEXT: Stmt_bb3(t0);			; CODE-NEXT: Stmt_bb11(t0);

	; CODE: # kernel1			; CODE: # kernel1
	; CODE-NEXT: Stmt_bb11(t0);			; CODE-NEXT: Stmt_bb3(t0);

	; IR: %p_dev_array_MemRef_B = call i8* @polly_allocateMemoryForDevice(i64 32)			; IR: %p_dev_array_MemRef_B = call i8* @polly_allocateMemoryForDevice(i64 32)
	; IR-NEXT: %p_dev_array_MemRef_A = call i8* @polly_allocateMemoryForDevice(i64 32)			; IR-NEXT: %p_dev_array_MemRef_A = call i8* @polly_allocateMemoryForDevice(i64 32)
	; IR-NEXT: [[REG0:%.+]] = getelementptr float, float* %B, i64 8			; IR-NEXT: [[REG0:%.+]] = getelementptr float, float* %B, i64 8
	; IR-NEXT: [[REG1:%.+]] = bitcast float* [[REG0]] to i8*			; IR-NEXT: [[REG1:%.+]] = bitcast float* [[REG0]] to i8*
	; IR-NEXT: call void @polly_copyFromHostToDevice(i8* [[REG1]], i8* %p_dev_array_MemRef_B, i64 32)			; IR-NEXT: call void @polly_copyFromHostToDevice(i8* [[REG1]], i8* %p_dev_array_MemRef_B, i64 32)

	; IR: [[REGA:%.+]] = call i8* @polly_getDevicePtr(i8* %p_dev_array_MemRef_B)			; IR: [[REGA:%.+]] = call i8* @polly_getDevicePtr(i8* %p_dev_array_MemRef_B)
	▲ Show 20 Lines • Show All 78 Lines • Show Last 20 Lines

test/GPGPU/parametric-loop-bound.ll

	; RUN: opt %loadPolly -polly-codegen-ppcg -polly-acc-dump-code \			; RUN: opt %loadPolly -polly-codegen-ppcg -polly-acc-dump-code \
	; RUN: -disable-output < %s \| \			; RUN: -disable-output < %s \| \
	; RUN: FileCheck -check-prefix=CODE %s			; RUN: FileCheck -check-prefix=CODE %s

	; RUN: opt %loadPolly -polly-codegen-ppcg \			; RUN: opt %loadPolly -polly-codegen-ppcg \
	; RUN: -S < %s \| \			; RUN: -S < %s \| \
	; RUN: FileCheck -check-prefix=IR %s			; RUN: FileCheck -check-prefix=IR %s

	; REQUIRES: pollyacc			; REQUIRES: pollyacc

	; void foo(long A[], long n) {			; void foo(long A[], long n) {
	; for (long i = 0; i < n; i++)			; for (long i = 0; i < n; i++)
	; A[i] += 100;			; A[i] += 100;
	; }			; }

	; CODE: if (n >= 1) {			; CODE: if (n >= 1) {
	; CODE-NEXT: cudaCheckReturn(cudaMemcpy(dev_MemRef_A, MemRef_A, (n) * sizeof(i64), cudaMemcpyHostToDevice));			; CODE: cudaCheckReturn(cudaMemcpy(dev_MemRef_A, MemRef_A, (n) * sizeof(i64), cudaMemcpyHostToDevice));
	; CODE-NEXT: {			; CODE-NEXT: {
	; CODE-NEXT: dim3 k0_dimBlock(32);			; CODE-NEXT: dim3 k0_dimBlock(32);
	; CODE-NEXT: dim3 k0_dimGrid(n >= 1048546 ? 32768 : floord(n + 31, 32));			; CODE-NEXT: dim3 k0_dimGrid(n >= 1048546 ? 32768 : (n + 31) / 32);
	; CODE-NEXT: kernel0 <<<k0_dimGrid, k0_dimBlock>>> (dev_MemRef_A, n);			; CODE-NEXT: kernel0 <<<k0_dimGrid, k0_dimBlock>>> (dev_MemRef_A, n);
	; CODE-NEXT: cudaCheckKernel();			; CODE-NEXT: cudaCheckKernel();
	; CODE-NEXT: }			; CODE-NEXT: }

	; CODE: cudaCheckReturn(cudaMemcpy(MemRef_A, dev_MemRef_A, (n) * sizeof(i64), cudaMemcpyDeviceToHost));			; CODE: cudaCheckReturn(cudaMemcpy(MemRef_A, dev_MemRef_A, (n) * sizeof(i64), cudaMemcpyDeviceToHost));
				; CODE-NEXT: cudaCheckReturn(cudaFree(dev_MemRef_A));
	; CODE-NEXT: }			; CODE-NEXT: }

	; CODE: # kernel0			; CODE: # kernel0
	; CODE-NEXT: for (int c0 = 0; c0 <= (n - 32 * b0 - 1) / 1048576; c0 += 1)			; CODE-NEXT: for (int c0 = 0; c0 <= (n - 32 * b0 - 1) / 1048576; c0 += 1)
	; CODE-NEXT: if (n >= 32 * b0 + t0 + 1048576 * c0 + 1)			; CODE-NEXT: if (n >= 32 * b0 + t0 + 1048576 * c0 + 1)
	; CODE-NEXT: Stmt_bb2(32 * b0 + t0 + 1048576 * c0);			; CODE-NEXT: Stmt_bb2(32 * b0 + t0 + 1048576 * c0);

	; IR: store i64 %n, i64* %polly_launch_0_param_1			; IR: store i64 %n, i64* %polly_launch_0_param_1
	Show All 29 Lines

test/GPGPU/phi-nodes-in-kernel.ll

	Show All 26 Lines
	;			;
	; outl = add78;			; outl = add78;
	; iter = inc80;			; iter = inc80;
	; }			; }
	;}			;}
	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-linux-gnu"			target triple = "x86_64-unknown-linux-gnu"

	; CODE: # host			; CODE: cudaCheckReturn(cudaMalloc((void *) &dev_MemRef_c, (50) sizeof(i32)));
	; CODE-NEXT: {
	; CODE-NEXT: cudaCheckReturn(cudaMemcpy(dev_MemRef_out_l_055__phi, &MemRef_out_l_055__phi, sizeof(i32), cudaMemcpyHostToDevice));			; CODE: {
	; CODE-NEXT: {
	; CODE-NEXT: dim3 k0_dimBlock(32);			; CODE-NEXT: dim3 k0_dimBlock(32);
	; CODE-NEXT: dim3 k0_dimGrid(2);			; CODE-NEXT: dim3 k0_dimGrid(2);
	; CODE-NEXT: kernel0 <<<k0_dimGrid, k0_dimBlock>>> (dev_MemRef_out_l_055__phi, dev_MemRef_out_l_055, dev_MemRef_c);			; CODE-NEXT: kernel0 <<<k0_dimGrid, k0_dimBlock>>> (dev_MemRef_c);
	; CODE-NEXT: cudaCheckKernel();			; CODE-NEXT: cudaCheckKernel();
	; CODE-NEXT: }			; CODE-NEXT: }

	; CODE: cudaCheckReturn(cudaMemcpy(MemRef_c, dev_MemRef_c, (50) * sizeof(i32), cudaMemcpyDeviceToHost));			; CODE: cudaCheckReturn(cudaMemcpy(MemRef_c, dev_MemRef_c, (50) * sizeof(i32), cudaMemcpyDeviceToHost));
	; CODE-NEXT: }			; CODE-NEXT: cudaCheckReturn(cudaFree(dev_MemRef_c));

	; CODE: # kernel0			; CODE: # kernel0
	; CODE-NEXT: if (32 * b0 + t0 <= 48) {			; CODE-NEXT: if (32 * b0 + t0 <= 48)
	; CODE-NEXT: if (b0 == 1 && t0 == 16)
	; CODE-NEXT: Stmt_for_cond1_preheader(0);
	; CODE-NEXT: Stmt_for_body17(0, 32 * b0 + t0);			; CODE-NEXT: Stmt_for_body17(0, 32 * b0 + t0);
	; CODE-NEXT: if (b0 == 1 && t0 == 16)
	; CODE-NEXT: Stmt_for_cond15_for_cond12_loopexit_crit_edge(0);
	; CODE-NEXT: }

	; IR: [[REGA:%.+]] = bitcast i32* %out_l.055.phiops to i8*			; IR: [[REGC:%.+]] = bitcast i32* %27 to i8*
	; IR-NEXT: call void @polly_copyFromHostToDevice(i8* [[REGA]], i8* %p_dev_array_MemRef_out_l_055__phi, i64 4)

	; IR: [[REGC:%.+]] = bitcast i32* %38 to i8*
	; IR-NEXT: call void @polly_copyFromDeviceToHost(i8* %p_dev_array_MemRef_c, i8* [[REGC]], i64 196)			; IR-NEXT: call void @polly_copyFromDeviceToHost(i8* %p_dev_array_MemRef_c, i8* [[REGC]], i64 196)

	; KERNEL-IR: entry:			; KERNEL-IR: define ptx_kernel void @FUNC_kernel_dynprog_SCOP_0_KERNEL_0(i8 addrspace(1)* %MemRef_c, i32) #0 {
	; KERNEL-IR-NEXT: %out_l.055.s2a = alloca i32			; KERNEL-IR: %polly.access.MemRef_c = getelementptr i32, i32 addrspace(1)* %polly.access.cast.MemRef_c, i64 %10
	; KERNEL-IR-NEXT: %out_l.055.phiops = alloca i32			; KERNEL-IR-NEXT: store i32 %0, i32 addrspace(1)* %polly.access.MemRef_c, align 4
	; KERNEL-IR-NEXT: %1 = addrspacecast i8 addrspace(1)* %MemRef_out_l_055__phi to i32*
	; KERNEL-IR-NEXT: %2 = load i32, i32* %1
	; KERNEL-IR-NEXT: store i32 %2, i32* %out_l.055.phiops
	; KERNEL-IR-NEXT: %3 = addrspacecast i8 addrspace(1)* %MemRef_out_l_055 to i32*
	; KERNEL-IR-NEXT: %4 = load i32, i32* %3
	; KERNEL-IR-NEXT: store i32 %4, i32* %out_l.055.s2a


	define void @kernel_dynprog([50 x i32]* %c) {			define void @kernel_dynprog([50 x i32]* %c) {
	entry:			entry:
	%arrayidx77 = getelementptr inbounds [50 x i32], [50 x i32]* %c, i64 0, i64 49			%arrayidx77 = getelementptr inbounds [50 x i32], [50 x i32]* %c, i64 0, i64 49
	br label %for.cond1.preheader			br label %for.cond1.preheader

	for.cond1.preheader: ; preds = %for.cond15.for.cond12.loopexit_crit_edge, %entry			for.cond1.preheader: ; preds = %for.cond15.for.cond12.loopexit_crit_edge, %entry
	%out_l.055 = phi i32 [ 0, %entry ], [ %add78, %for.cond15.for.cond12.loopexit_crit_edge ]			%out_l.055 = phi i32 [ 0, %entry ], [ %add78, %for.cond15.for.cond12.loopexit_crit_edge ]
	Show All 21 Lines

test/GPGPU/region-stmt.ll

	; RUN: opt %loadPolly -polly-codegen-ppcg -polly-acc-dump-code \			; RUN: opt %loadPolly -polly-codegen-ppcg -polly-acc-dump-code \
	; RUN: -disable-output < %s \| \			; RUN: -disable-output < %s \| \
	; RUN: FileCheck -check-prefix=CODE %s			; RUN: FileCheck -check-prefix=CODE %s

	; RUN: opt %loadPolly -polly-codegen-ppcg -S < %s \| \			; RUN: opt %loadPolly -polly-codegen-ppcg -S < %s \| \
	; RUN: FileCheck %s -check-prefix=IR			; RUN: FileCheck %s -check-prefix=IR

	; CODE: Code			; CODE: cudaCheckReturn(cudaMemcpy(dev_MemRef_A, MemRef_A, (128) * sizeof(float), cudaMemcpyHostToDevice));
	; CODE-NEXT: ====
	; CODE-NEXT: # host
	; CODE-NEXT: {
	; CODE-NEXT: cudaCheckReturn(cudaMemcpy(dev_MemRef_A, MemRef_A, (128) * sizeof(float), cudaMemcpyHostToDevice));
	; CODE-NEXT: cudaCheckReturn(cudaMemcpy(dev_MemRef_B, MemRef_B, (128) * sizeof(float), cudaMemcpyHostToDevice));			; CODE-NEXT: cudaCheckReturn(cudaMemcpy(dev_MemRef_B, MemRef_B, (128) * sizeof(float), cudaMemcpyHostToDevice));
	; CODE-NEXT: {			; CODE-NEXT: {
	; CODE-NEXT: dim3 k0_dimBlock(32);			; CODE-NEXT: dim3 k0_dimBlock(32);
	; CODE-NEXT: dim3 k0_dimGrid(4);			; CODE-NEXT: dim3 k0_dimGrid(4);
	; CODE-NEXT: kernel0 <<<k0_dimGrid, k0_dimBlock>>> (dev_MemRef_A, dev_MemRef_B);			; CODE-NEXT: kernel0 <<<k0_dimGrid, k0_dimBlock>>> (dev_MemRef_A, dev_MemRef_B);
	; CODE-NEXT: cudaCheckKernel();			; CODE-NEXT: cudaCheckKernel();
	; CODE-NEXT: }			; CODE-NEXT: }

	; CODE: cudaCheckReturn(cudaMemcpy(MemRef_B, dev_MemRef_B, (128) * sizeof(float), cudaMemcpyDeviceToHost));			; CODE: cudaCheckReturn(cudaMemcpy(MemRef_B, dev_MemRef_B, (128) * sizeof(float), cudaMemcpyDeviceToHost));
	; CODE-NEXT: }

	; CODE: # kernel0			; CODE: # kernel0
	; CODE-NEXT: Stmt_for_body__TO__if_end(32 * b0 + t0);			; CODE-NEXT: Stmt_for_body__TO__if_end(32 * b0 + t0);

	; IR: @polly_initContext			; IR: @polly_initContext

	; KERNEL-IR: kernel_0			; KERNEL-IR: kernel_0

	▲ Show 20 Lines • Show All 56 Lines • Show Last 20 Lines

test/GPGPU/scheduler-timeout.ll

	Show All 21 Lines
	; for (i = 0; i < _PB_NI; i++)			; for (i = 0; i < _PB_NI; i++)
	; for (j = 0; j < _PB_NL; j++)			; for (j = 0; j < _PB_NL; j++)
	; {			; {
	; D[i][j] *= beta;			; D[i][j] *= beta;
	; for (k = 0; k < _PB_NJ; ++k)			; for (k = 0; k < _PB_NJ; ++k)
	; D[i][j] += tmp[i][k] * C[k][j];			; D[i][j] += tmp[i][k] * C[k][j];
	; }			; }

	; CODE:Code			; CODE: cudaCheckReturn(cudaMemcpy(dev_MemRef_A, MemRef_A, (4096) * (4096) * sizeof(float), cudaMemcpyHostToDevice));
	; CODE-NEXT:====
	; CODE-NEXT:# host
	; CODE-NEXT: {
	; CODE-NEXT: cudaCheckReturn(cudaMemcpy(dev_MemRef_A, MemRef_A, (4096) * (4096) * sizeof(float), cudaMemcpyHostToDevice));
	; CODE-NEXT: cudaCheckReturn(cudaMemcpy(dev_MemRef_B, MemRef_B, (4096) * (4096) * sizeof(float), cudaMemcpyHostToDevice));			; CODE-NEXT: cudaCheckReturn(cudaMemcpy(dev_MemRef_B, MemRef_B, (4096) * (4096) * sizeof(float), cudaMemcpyHostToDevice));
	; CODE-NEXT: cudaCheckReturn(cudaMemcpy(dev_MemRef_D, MemRef_D, (4096) * (4096) * sizeof(float), cudaMemcpyHostToDevice));			; CODE-NEXT: cudaCheckReturn(cudaMemcpy(dev_MemRef_D, MemRef_D, (4096) * (4096) * sizeof(float), cudaMemcpyHostToDevice));
	; CODE-NEXT: cudaCheckReturn(cudaMemcpy(dev_MemRef_C, MemRef_C, (4096) * (4096) * sizeof(float), cudaMemcpyHostToDevice));			; CODE-NEXT: cudaCheckReturn(cudaMemcpy(dev_MemRef_C, MemRef_C, (4096) * (4096) * sizeof(float), cudaMemcpyHostToDevice));
	; CODE-NEXT: {			; CODE-NEXT: {
	; CODE-NEXT: dim3 k0_dimBlock(16, 32);			; CODE-NEXT: dim3 k0_dimBlock(16, 32);
	; CODE-NEXT: dim3 k0_dimGrid(128, 128);			; CODE-NEXT: dim3 k0_dimGrid(128, 128);
	; CODE-NEXT: kernel0 <<<k0_dimGrid, k0_dimBlock>>> (dev_MemRef_tmp, dev_MemRef_A, MemRef_alpha, dev_MemRef_B);			; CODE-NEXT: kernel0 <<<k0_dimGrid, k0_dimBlock>>> (dev_MemRef_tmp, dev_MemRef_A, MemRef_alpha, dev_MemRef_B);
	; CODE-NEXT: cudaCheckKernel();			; CODE-NEXT: cudaCheckKernel();
	; CODE-NEXT: }			; CODE-NEXT: }

	; CODE: {			; CODE: {
	; CODE-NEXT: dim3 k1_dimBlock(16, 32);			; CODE-NEXT: dim3 k1_dimBlock(16, 32);
	; CODE-NEXT: dim3 k1_dimGrid(128, 128);			; CODE-NEXT: dim3 k1_dimGrid(128, 128);
	; CODE-NEXT: kernel1 <<<k1_dimGrid, k1_dimBlock>>> (dev_MemRef_tmp, dev_MemRef_D, MemRef_beta, dev_MemRef_C);			; CODE-NEXT: kernel1 <<<k1_dimGrid, k1_dimBlock>>> (dev_MemRef_tmp, dev_MemRef_D, MemRef_beta, dev_MemRef_C);
	; CODE-NEXT: cudaCheckKernel();			; CODE-NEXT: cudaCheckKernel();
	; CODE-NEXT: }			; CODE-NEXT: }

	; CODE: cudaCheckReturn(cudaMemcpy(MemRef_tmp, dev_MemRef_tmp, (4096) * (4096) * sizeof(float), cudaMemcpyDeviceToHost));			; CODE: cudaCheckReturn(cudaMemcpy(MemRef_tmp, dev_MemRef_tmp, (4096) * (4096) * sizeof(float), cudaMemcpyDeviceToHost));
	; CODE-NEXT: cudaCheckReturn(cudaMemcpy(MemRef_D, dev_MemRef_D, (4096) * (4096) * sizeof(float), cudaMemcpyDeviceToHost));			; CODE-NEXT: cudaCheckReturn(cudaMemcpy(MemRef_D, dev_MemRef_D, (4096) * (4096) * sizeof(float), cudaMemcpyDeviceToHost));
	; CODE-NEXT: }

	; CODE: # kernel0			; CODE: # kernel0
	; CODE-NEXT: for (int c2 = 0; c2 <= 127; c2 += 1)			; CODE-NEXT: for (int c2 = 0; c2 <= 127; c2 += 1)
	; CODE-NEXT: for (int c4 = 0; c4 <= 1; c4 += 1) {			; CODE-NEXT: for (int c4 = 0; c4 <= 1; c4 += 1) {
	; CODE-NEXT: if (c2 == 0)			; CODE-NEXT: if (c2 == 0)
	; CODE-NEXT: Stmt_for_body6(32 * b0 + t0, 32 * b1 + t1 + 16 * c4);			; CODE-NEXT: Stmt_for_body6(32 * b0 + t0, 32 * b1 + t1 + 16 * c4);
	; CODE-NEXT: for (int c5 = 0; c5 <= 31; c5 += 1)			; CODE-NEXT: for (int c5 = 0; c5 <= 31; c5 += 1)
	; CODE-NEXT: Stmt_for_body11(32 * b0 + t0, 32 * b1 + t1 + 16 * c4, 32 * c2 + c5);			; CODE-NEXT: Stmt_for_body11(32 * b0 + t0, 32 * b1 + t1 + 16 * c4, 32 * c2 + c5);
	▲ Show 20 Lines • Show All 117 Lines • Show Last 20 Lines

test/GPGPU/size-cast.ll

	; RUN: opt %loadPolly -polly-codegen-ppcg -polly-acc-dump-code \			; RUN: opt %loadPolly -polly-codegen-ppcg -polly-acc-dump-code \
	; RUN: -disable-output < %s \| \			; RUN: -disable-output < %s \| \
	; RUN: FileCheck -check-prefix=CODE %s			; RUN: FileCheck -check-prefix=CODE %s

	; RUN: opt %loadPolly -polly-codegen-ppcg -S < %s \| \			; RUN: opt %loadPolly -polly-codegen-ppcg -S < %s \| \
	; RUN: FileCheck %s -check-prefix=IR			; RUN: FileCheck %s -check-prefix=IR

	; REQUIRES: pollyacc			; REQUIRES: pollyacc

	; This test case ensures that we properly sign-extend the types we are using.			; This test case ensures that we properly sign-extend the types we are using.

	; CODE: Code			; CODE: if (arg >= 1 && arg1 == 0) {
	; CODE-NEXT: ====			; CODE: cudaCheckReturn(cudaMemcpy(dev_MemRef_arg3, MemRef_arg3, (arg) * sizeof(double), cudaMemcpyHostToDevice));
	; CODE-NEXT: # host
	; CODE-NEXT: if (arg >= 1 && arg1 == 0) {
	; CODE-NEXT: cudaCheckReturn(cudaMemcpy(dev_MemRef_arg3, MemRef_arg3, (arg) * sizeof(double), cudaMemcpyHostToDevice));
	; CODE-NEXT: {			; CODE-NEXT: {
	; CODE-NEXT: dim3 k0_dimBlock(32);			; CODE-NEXT: dim3 k0_dimBlock(32);
	; CODE-NEXT: dim3 k0_dimGrid(arg >= 1048546 ? 32768 : floord(arg + 31, 32));			; CODE-NEXT: dim3 k0_dimGrid(arg >= 1048546 ? 32768 : (arg + 31) / 32);
	; CODE-NEXT: kernel0 <<<k0_dimGrid, k0_dimBlock>>> (dev_MemRef_arg3, dev_MemRef_arg2, arg, arg1);			; CODE-NEXT: kernel0 <<<k0_dimGrid, k0_dimBlock>>> (dev_MemRef_arg3, dev_MemRef_arg2, arg, arg1);
	; CODE-NEXT: cudaCheckKernel();			; CODE-NEXT: cudaCheckKernel();
	; CODE-NEXT: }			; CODE-NEXT: }

	; CODE: cudaCheckReturn(cudaMemcpy(MemRef_arg2, dev_MemRef_arg2, (arg) * sizeof(double), cudaMemcpyDeviceToHost));			; CODE: cudaCheckReturn(cudaMemcpy(MemRef_arg2, dev_MemRef_arg2, (arg) * sizeof(double), cudaMemcpyDeviceToHost));
	; CODE-NEXT: }			; CODE-NEXT cudaCheckReturn(cudaFree(dev_MemRef_arg3));
				; CODE-NEXT cudaCheckReturn(cudaFree(dev_MemRef_arg2));

	; CODE: # kernel0			; CODE: # kernel0
	; CODE-NEXT: for (int c0 = 0; c0 <= (arg - 32 * b0 - 1) / 1048576; c0 += 1)			; CODE-NEXT: for (int c0 = 0; c0 <= (arg - 32 * b0 - 1) / 1048576; c0 += 1)
	; CODE-NEXT: if (arg >= 32 * b0 + t0 + 1048576 * c0 + 1)			; CODE-NEXT: if (arg >= 32 * b0 + t0 + 1048576 * c0 + 1)
	; CODE-NEXT: Stmt_bb6(0, 32 * b0 + t0 + 1048576 * c0);			; CODE-NEXT: Stmt_bb6(0, 32 * b0 + t0 + 1048576 * c0);

	; IR: call i8* @polly_initContextCUDA()			; IR: call i8* @polly_initContextCUDA()
	; IR-NEXT: sext i32 %arg to i64			; IR-NEXT: sext i32 %arg to i64
	Show All 32 Lines

test/GPGPU/untouched-arrays.ll

	; RUN: opt %loadPolly -polly-codegen-ppcg -polly-acc-dump-code \			; RUN: opt %loadPolly -polly-codegen-ppcg -polly-acc-dump-code \
	; RUN: -disable-output < %s \| \			; RUN: -disable-output < %s \| \
	; RUN: FileCheck -check-prefix=CODE %s			; RUN: FileCheck -check-prefix=CODE %s

	; REQUIRES: pollyacc			; REQUIRES: pollyacc

	; CODE: Code			; CODE: cudaCheckReturn(cudaMemcpy(dev_MemRef_global_1, MemRef_global_1, (142) * sizeof(i32), cudaMemcpyHostToDevice));
	; CODE-NEXT: ====
	; CODE-NEXT: # host
	; CODE-NEXT: {
	; CODE-NEXT: cudaCheckReturn(cudaMemcpy(dev_MemRef_global_1, MemRef_global_1, (142) * sizeof(i32), cudaMemcpyHostToDevice));
	; CODE-NEXT: {			; CODE-NEXT: {
	; CODE-NEXT: dim3 k0_dimBlock(10);			; CODE-NEXT: dim3 k0_dimBlock(10);
	; CODE-NEXT: dim3 k0_dimGrid(1);			; CODE-NEXT: dim3 k0_dimGrid(1);
	; CODE-NEXT: kernel0 <<<k0_dimGrid, k0_dimBlock>>> (dev_MemRef_global_1);			; CODE-NEXT: kernel0 <<<k0_dimGrid, k0_dimBlock>>> (dev_MemRef_global_1);
	; CODE-NEXT: cudaCheckKernel();			; CODE-NEXT: cudaCheckKernel();
	; CODE-NEXT: }			; CODE-NEXT: }

	; CODE: cudaCheckReturn(cudaMemcpy(MemRef_global_1, dev_MemRef_global_1, (142) * sizeof(i32), cudaMemcpyDeviceToHost));			; CODE: cudaCheckReturn(cudaMemcpy(MemRef_global_1, dev_MemRef_global_1, (142) * sizeof(i32), cudaMemcpyDeviceToHost));
				; CODE: cudaCheckReturn(cudaFree(dev_MemRef_global_1));
	; CODE-NEXT: }			; CODE-NEXT: }

	; CODE: # kernel0			; CODE: # kernel0
	; CODE-NEXT: Stmt_bb33(t0, 0);			; CODE-NEXT: Stmt_bb33(t0, 0);


	target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"			target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"
	target triple = "x86_64-unknown-linux-gnu"			target triple = "x86_64-unknown-linux-gnu"
	▲ Show 20 Lines • Show All 247 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[WIP] [Polly] [PPCGCodeGeneration + PPCG] [3/3] Collect changes to PPCGCodeGen because of PPCG upgrade.ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 107530

lib/CodeGen/PPCGCodeGeneration.cpp

test/GPGPU/host-control-flow.ll

test/GPGPU/host-statement.ll

test/GPGPU/invalid-kernel.ll

test/GPGPU/kernel-params-only-some-arrays.ll

test/GPGPU/mostly-sequential.ll

test/GPGPU/non-read-only-scalars.ll

test/GPGPU/non-zero-array-offset.ll

test/GPGPU/parametric-loop-bound.ll

test/GPGPU/phi-nodes-in-kernel.ll

test/GPGPU/region-stmt.ll

test/GPGPU/scheduler-timeout.ll

test/GPGPU/size-cast.ll

test/GPGPU/untouched-arrays.ll

[WIP] [Polly] [PPCGCodeGeneration + PPCG] [3/3] Collect changes to PPCGCodeGen because of PPCG upgrade.
ClosedPublic