This is an archive of the discontinued LLVM Phabricator instance.

[WIP] [Polly] [ManagedMemoryRewrite] Rewrite global arrays with global pointers that are polly_mallocManage'd
ClosedPublic

Authored by bollu on Aug 9 2017, 6:06 AM.

Event Timeline

bollu created this revision.Aug 9 2017, 6:06 AM
grosser accepted this revision.Aug 9 2017, 7:16 AM

LGTM.

This revision is now accepted and ready to land.Aug 9 2017, 7:16 AM
bollu updated this revision to Diff 110434.Aug 9 2017, 10:42 AM

[WIP] [Polly] [ManagedMemoryRewrite] Rewrite global arrays to pointers that are allocated with managed memory.

grosser accepted this revision.Aug 9 2017, 10:44 AM
grosser added inline comments.
lib/CodeGen/ManagedMemoryRewrite.cpp
125

Format? Is this more than 80 lines? In general this does not seem to be run through clang format?

164

Unrelated change?

bollu added a comment.Aug 9 2017, 10:46 AM

This is slightly harder than I initially thought, because,

  1. mutateType on the original global is entirely the wrong way to go about this. It leaves the IR in some inconsistent state.
  1. The right solution (I believe) is to create a replacement global (say A.repl), set it up correctly, visit each use site of the original (say A), and change it to the replacement (A.repl)

The problem is that we can't insert an IRBuilder at a Use to call bitcast. We need an Instruction. But the underlying Value could be a Constant or something, and I can't see a way to extract a location for a builder from a Value. I'm probably missing something in the API, any help appreciated.

Right now, this doesn't actually replace, though it does create a separate A.repl and initialises it (I can see the call to cudaMallocManaged with nvprof).

bollu updated this revision to Diff 110436.Aug 9 2017, 10:49 AM

[WIP] [Polly] [ManagedMemoryRewrite] Rewrite global arrays to pointers that are allocated with managed memory.

bollu added a comment.Aug 9 2017, 10:52 AM

@grosser I'm sorry, I seem to be hitting some Phab bug that updates D36516 rather than creating a new revision. The original D36516 was an NFC change. I tried running another arc patch with the same effect (you can see the bollu updated this revision to Diff 110436.). Something super weird is happening.

Anyway, let's discuss the proposed change here. I'll move the [NFC] patch to another revision.

lib/CodeGen/ManagedMemoryRewrite.cpp
125

Yes, hence WIP. I wished to discuss how to do this correctly. Could you request changes, because this patch should not be LGTMd.

Please see my comment on the patch about the discussion I wish to have.

bollu retitled this revision from [NFC] [ManagedMemoryRewrite] [Polly] Erase original malloc and free to [WIP] [Polly] [ManagedMemoryRewrite] Rewrite global arrays with global pointers that are polly_mallocManage'd.Aug 9 2017, 10:52 AM
bollu edited the summary of this revision. (Show Details)

Do you have some other uncommitted phabricator patches in your branch?

bollu added a comment.Aug 10 2017, 1:40 AM

@grosser: could you change your status to "Requesting changes"? That way, you'll see when I push

grosser requested changes to this revision.Aug 10 2017, 1:40 AM
This revision now requires changes to proceed.Aug 10 2017, 1:40 AM
bollu updated this revision to Diff 110543.Aug 10 2017, 3:02 AM
bollu edited edge metadata.
  • [WIP] not sure why this doesn't work, code seems reasonable
bollu added a comment.Aug 10 2017, 3:08 AM

Example of changes induced by this patch on a .ll file

source.c
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <cuda.h>
#include <cuda_runtime.h>
int A[5];
void f(int *arr) {
    for(int i = 0; i < 5; i++) arr[i] = 10;
}


#define CHECK(r) {_check((r), __LINE__);}

void _check(cudaError_t r, int line) {
    if (r != cudaSuccess) {
        printf("CUDA error on line %d: %s\n", line, cudaGetErrorString(r));
        exit(0);
    }
}

void print_arr(int *A) {
    for(int i = 0; i < 5; i++)
        printf("%d := %d\n", i, A[i]);
}


int main() {
    for(int i = 0; i < 5; i++) { A[i] = -42; }
    printf("A before kernel:\n");
    print_arr(A);


    printf("launching kernel...\n");
    f(A);

    printf("printing A...\n");
    print_arr(A);
    return 0;
}
unoptimised.ll
; ModuleID = 'program.ll'
source_filename = "program.c"
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

@.str = private unnamed_addr constant [27 x i8] c"CUDA error on line %d: %s\0A\00", align 1
@.str.1 = private unnamed_addr constant [10 x i8] c"%d := %d\0A\00", align 1
@A = common global [5 x i32] zeroinitializer, align 16
@.str.2 = private unnamed_addr constant [18 x i8] c"A before kernel:\0A\00", align 1
@.str.3 = private unnamed_addr constant [21 x i8] c"launching kernel...\0A\00", align 1
@.str.4 = private unnamed_addr constant [15 x i8] c"printing A...\0A\00", align 1
@str = private unnamed_addr constant [17 x i8] c"A before kernel:\00"
@str.1 = private unnamed_addr constant [20 x i8] c"launching kernel...\00"
@str.2 = private unnamed_addr constant [14 x i8] c"printing A...\00"

define void @f(i32* %arr) {
entry:
  br label %entry.split

entry.split:                                      ; preds = %entry
  br label %for.body

for.body:                                         ; preds = %entry.split, %for.body
  %indvars.iv = phi i64 [ 0, %entry.split ], [ %indvars.iv.next, %for.body ]
  %arrayidx = getelementptr inbounds i32, i32* %arr, i64 %indvars.iv
  store i32 10, i32* %arrayidx, align 4
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond = icmp ne i64 %indvars.iv.next, 5
  br i1 %exitcond, label %for.body, label %for.end

for.end:                                          ; preds = %for.body
  ret void
}

define void @_check(i32 %r, i32 %line) {
entry:
  br label %entry.split

entry.split:                                      ; preds = %entry
  %cmp = icmp eq i32 %r, 0
  br i1 %cmp, label %if.end, label %if.then

if.then:                                          ; preds = %entry.split
  %call = tail call i8* @cudaGetErrorString(i32 %r)
  %call1 = tail call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([27 x i8], [27 x i8]* @.str, i64 0, i64 0), i32 %line, i8* %call)
  tail call void @exit(i32 0)
  unreachable

if.end:                                           ; preds = %entry.split
  ret void
}

declare i32 @printf(i8*, ...)

declare i8* @cudaGetErrorString(i32)

declare void @exit(i32)

define void @print_arr(i32* %A) {
entry:
  br label %entry.split

entry.split:                                      ; preds = %entry
  br label %for.body

for.body:                                         ; preds = %entry.split, %for.body
  %indvars.iv = phi i64 [ 0, %entry.split ], [ %indvars.iv.next, %for.body ]
  %arrayidx = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
  %tmp4 = load i32, i32* %arrayidx, align 4
  %0 = trunc i64 %indvars.iv to i32
  %call = tail call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([10 x i8], [10 x i8]* @.str.1, i64 0, i64 0), i32 %0, i32 %tmp4)
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond = icmp ne i64 %indvars.iv.next, 5
  br i1 %exitcond, label %for.body, label %for.end

for.end:                                          ; preds = %for.body
  ret void
}

define i32 @main() {
entry:
  br label %entry.split

entry.split:                                      ; preds = %entry
  br label %for.body

for.body:                                         ; preds = %entry.split, %for.body
  %indvars.iv = phi i64 [ 0, %entry.split ], [ %indvars.iv.next, %for.body ]
  %arrayidx = getelementptr inbounds [5 x i32], [5 x i32]* @A, i64 0, i64 %indvars.iv
  store i32 -42, i32* %arrayidx, align 4
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond = icmp ne i64 %indvars.iv.next, 5
  br i1 %exitcond, label %for.body, label %for.end

for.end:                                          ; preds = %for.body
  %puts = tail call i32 @puts(i8* getelementptr inbounds ([17 x i8], [17 x i8]* @str, i64 0, i64 0))
  tail call void @print_arr(i32* getelementptr inbounds ([5 x i32], [5 x i32]* @A, i64 0, i64 0))
  %puts1 = tail call i32 @puts(i8* getelementptr inbounds ([20 x i8], [20 x i8]* @str.1, i64 0, i64 0))
  tail call void @f(i32* getelementptr inbounds ([5 x i32], [5 x i32]* @A, i64 0, i64 0))
  %puts2 = tail call i32 @puts(i8* getelementptr inbounds ([14 x i8], [14 x i8]* @str.2, i64 0, i64 0))
  tail call void @print_arr(i32* getelementptr inbounds ([5 x i32], [5 x i32]* @A, i64 0, i64 0))
  ret i32 0
}

; Function Attrs: nounwind
declare i32 @puts(i8* nocapture readonly) #0

attributes #0 = { nounwind }

Optimised .ll file

  • Note that all uses of @A have now become @A.toptr
  • Also notice that wherever we use @A.toptr, we bitcast it from i32 ** to [i32 x 5]*, which is a legal bitcast (?)

You can run the output program through opt, it passes verify module so there's nothing grossly wrong with it.
Is the bitcast incorrect? Am I missing something obvious?

optimised.ll
; ModuleID = 'program.canonical.ll'
source_filename = "program.c"
target datalayout = "e-m:e-i64:64-f80:128-n8:16:32:64-S128"

@.str = private unnamed_addr constant [27 x i8] c"CUDA error on line %d: %s\0A\00", align 1
@.str.1 = private unnamed_addr constant [10 x i8] c"%d := %d\0A\00", align 1
@A = common global [5 x i32] zeroinitializer, align 16
@.str.2 = private unnamed_addr constant [18 x i8] c"A before kernel:\0A\00", align 1
@.str.3 = private unnamed_addr constant [21 x i8] c"launching kernel...\0A\00", align 1
@.str.4 = private unnamed_addr constant [15 x i8] c"printing A...\0A\00", align 1
@str = private unnamed_addr constant [17 x i8] c"A before kernel:\00"
@str.1 = private unnamed_addr constant [20 x i8] c"launching kernel...\00"
@str.2 = private unnamed_addr constant [14 x i8] c"printing A...\00"
@FUNC_f_SCOP_0_KERNEL_0 = private unnamed_addr constant [462 x i8] c"//\0A// Generated by LLVM NVPTX Back-End\0A//\0A\0A.version 3.2\0A.target sm_30\0A.address_size 64\0A\0A\09// .globl\09FUNC_f_SCOP_0_KERNEL_0\0A\0A.visible .entry FUNC_f_SCOP_0_KERNEL_0(\0A\09.param .u64 FUNC_f_SCOP_0_KERNEL_0_param_0\0A)\0A.maxntid 5, 1, 1\0A{\0A\09.reg .b32 \09%r<3>;\0A\09.reg .b64 \09%rd<4>;\0A\0A\09ld.param.u64 \09%rd1, [FUNC_f_SCOP_0_KERNEL_0_param_0];\0A\09mov.u32 \09%r1, %tid.x;\0A\09mul.wide.u32 \09%rd2, %r1, 4;\0A\09add.s64 \09%rd3, %rd1, %rd2;\0A\09mov.u32 \09%r2, 10;\0A\09st.global.u32 \09[%rd3], %r2;\0A\09ret;\0A}\0A\0A\0A\00"
@FUNC_f_SCOP_0_KERNEL_0_name = private unnamed_addr constant [23 x i8] c"FUNC_f_SCOP_0_KERNEL_0\00"
@A.toptr = global i32* null
@llvm.global_ctors = appending global [1 x { i32, void ()*, i8* }] [{ i32, void ()*, i8* } { i32 0, void ()* @A.constructor, i8* bitcast (i32** @A.toptr to i8*) }]

define void @f(i32* %arr) {
entry:
  %polly_launch_0_params = alloca [2 x i8*]
  %polly_launch_0_param_0 = alloca i8*
  %polly_launch_0_param_size_0 = alloca i32
  %polly_launch_0_params_i8ptr = bitcast [2 x i8*]* %polly_launch_0_params to i8*
  br label %entry.split

entry.split:                                      ; preds = %entry
  br label %polly.split_new_and_old

polly.split_new_and_old:                          ; preds = %entry.split
  %0 = call { i64, i1 } @llvm.smul.with.overflow.i64(i64 1, i64 5)
  %.obit = extractvalue { i64, i1 } %0, 1
  %polly.overflow.state = or i1 false, %.obit
  %.res = extractvalue { i64, i1 } %0, 0
  %1 = call { i64, i1 } @llvm.smul.with.overflow.i64(i64 6, i64 %.res)
  %.obit1 = extractvalue { i64, i1 } %1, 1
  %polly.overflow.state2 = or i1 %polly.overflow.state, %.obit1
  %.res3 = extractvalue { i64, i1 } %1, 0
  %2 = call { i64, i1 } @llvm.sadd.with.overflow.i64(i64 0, i64 %.res3)
  %.obit4 = extractvalue { i64, i1 } %2, 1
  %polly.overflow.state5 = or i1 %polly.overflow.state2, %.obit4
  %.res6 = extractvalue { i64, i1 } %2, 0
  %3 = icmp sge i64 %.res6, 0
  %4 = and i1 true, %3
  %polly.rtc.overflown = xor i1 %polly.overflow.state5, true
  %polly.rtc.result = and i1 %4, %polly.rtc.overflown
  br i1 %polly.rtc.result, label %polly.start, label %for.body.pre_entry_bb

for.body.pre_entry_bb:                            ; preds = %polly.split_new_and_old
  br label %for.body

for.body:                                         ; preds = %for.body.pre_entry_bb, %for.body
  %indvars.iv = phi i64 [ %indvars.iv.next, %for.body ], [ 0, %for.body.pre_entry_bb ]
  %arrayidx = getelementptr inbounds i32, i32* %arr, i64 %indvars.iv
  store i32 10, i32* %arrayidx, align 4
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond = icmp ne i64 %indvars.iv.next, 5
  br i1 %exitcond, label %for.body, label %polly.merge_new_and_old

polly.merge_new_and_old:                          ; preds = %polly.exiting, %for.body
  br label %for.end

for.end:                                          ; preds = %polly.merge_new_and_old
  ret void

polly.start:                                      ; preds = %polly.split_new_and_old
  br label %polly.acc.initialize

polly.acc.initialize:                             ; preds = %polly.start
  %5 = call i8* @polly_initContextCUDA()
  %6 = bitcast i32* %arr to i8*
  %7 = getelementptr [2 x i8*], [2 x i8*]* %polly_launch_0_params, i64 0, i64 0
  store i8* %6, i8** %polly_launch_0_param_0
  %8 = bitcast i8** %polly_launch_0_param_0 to i8*
  store i8* %8, i8** %7
  store i32 4, i32* %polly_launch_0_param_size_0
  %9 = getelementptr [2 x i8*], [2 x i8*]* %polly_launch_0_params, i64 0, i64 1
  %10 = bitcast i32* %polly_launch_0_param_size_0 to i8*
  store i8* %10, i8** %9
  %11 = call i8* @polly_getKernel(i8* getelementptr inbounds ([462 x i8], [462 x i8]* @FUNC_f_SCOP_0_KERNEL_0, i32 0, i32 0), i8* getelementptr inbounds ([23 x i8], [23 x i8]* @FUNC_f_SCOP_0_KERNEL_0_name, i32 0, i32 0))
  call void @polly_launchKernel(i8* %11, i32 1, i32 1, i32 5, i32 1, i32 1, i8* %polly_launch_0_params_i8ptr)
  call void @polly_freeKernel(i8* %11)
  call void @polly_synchronizeDevice()
  call void @polly_freeContext(i8* %5)
  br label %polly.exiting

polly.exiting:                                    ; preds = %polly.acc.initialize
  br label %polly.merge_new_and_old
}

define void @_check(i32 %r, i32 %line) {
entry:
  br label %entry.split

entry.split:                                      ; preds = %entry
  %cmp = icmp eq i32 %r, 0
  br i1 %cmp, label %if.end, label %if.then

if.then:                                          ; preds = %entry.split
  %call = tail call i8* @cudaGetErrorString(i32 %r)
  %call1 = tail call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([27 x i8], [27 x i8]* @.str, i64 0, i64 0), i32 %line, i8* %call)
  tail call void @exit(i32 0)
  unreachable

if.end:                                           ; preds = %entry.split
  ret void
}

declare i32 @printf(i8*, ...)

declare i8* @cudaGetErrorString(i32)

declare void @exit(i32)

define void @print_arr(i32* %A) {
entry:
  br label %entry.split

entry.split:                                      ; preds = %entry
  br label %for.body

for.body:                                         ; preds = %for.body, %entry.split
  %indvars.iv = phi i64 [ 0, %entry.split ], [ %indvars.iv.next, %for.body ]
  %arrayidx = getelementptr inbounds i32, i32* %A, i64 %indvars.iv
  %tmp4 = load i32, i32* %arrayidx, align 4
  %0 = trunc i64 %indvars.iv to i32
  %call = tail call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([10 x i8], [10 x i8]* @.str.1, i64 0, i64 0), i32 %0, i32 %tmp4)
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond = icmp ne i64 %indvars.iv.next, 5
  br i1 %exitcond, label %for.body, label %for.end

for.end:                                          ; preds = %for.body
  ret void
}

define i32 @main() {
entry:
  br label %entry.split

entry.split:                                      ; preds = %entry
  br label %for.body

for.body:                                         ; preds = %for.body, %entry.split
  %indvars.iv = phi i64 [ 0, %entry.split ], [ %indvars.iv.next, %for.body ]
  %arrayidx = getelementptr inbounds [5 x i32], [5 x i32]* bitcast (i32** @A.toptr to [5 x i32]*), i64 0, i64 %indvars.iv
  store i32 -42, i32* %arrayidx, align 4
  %indvars.iv.next = add nuw nsw i64 %indvars.iv, 1
  %exitcond = icmp ne i64 %indvars.iv.next, 5
  br i1 %exitcond, label %for.body, label %for.end

for.end:                                          ; preds = %for.body
  %puts = tail call i32 @puts(i8* getelementptr inbounds ([17 x i8], [17 x i8]* @str, i64 0, i64 0))
  tail call void @print_arr(i32* getelementptr inbounds ([5 x i32], [5 x i32]* bitcast (i32** @A.toptr to [5 x i32]*), i64 0, i64 0))
  %puts1 = tail call i32 @puts(i8* getelementptr inbounds ([20 x i8], [20 x i8]* @str.1, i64 0, i64 0))
  tail call void @f(i32* getelementptr inbounds ([5 x i32], [5 x i32]* bitcast (i32** @A.toptr to [5 x i32]*), i64 0, i64 0))
  %puts2 = tail call i32 @puts(i8* getelementptr inbounds ([14 x i8], [14 x i8]* @str.2, i64 0, i64 0))
  tail call void @print_arr(i32* getelementptr inbounds ([5 x i32], [5 x i32]* bitcast (i32** @A.toptr to [5 x i32]*), i64 0, i64 0))
  ret i32 0
}

; Function Attrs: nounwind
declare i32 @puts(i8* nocapture readonly) #0

; Function Attrs: nounwind readnone speculatable
declare { i64, i1 } @llvm.smul.with.overflow.i64(i64, i64) #1

; Function Attrs: nounwind readnone speculatable
declare { i64, i1 } @llvm.sadd.with.overflow.i64(i64, i64) #1

declare i8* @polly_initContextCUDA()

declare i8* @polly_getKernel(i8*, i8*)

declare void @polly_launchKernel(i8*, i32, i32, i32, i32, i32, i8*)

declare void @polly_freeKernel(i8*)

declare void @polly_synchronizeDevice()

declare void @polly_freeContext(i8*)

declare i8* @polly_mallocManaged(i64)

define void @A.constructor() {
entry:
  %mem.raw = call i8* @polly_mallocManaged(i64 16000)
  %mem.typed = bitcast i8* %mem.raw to i32*
  store i32* %mem.typed, i32** @A.toptr
  ret void
}

attributes #0 = { nounwind }
attributes #1 = { nounwind readnone speculatable }
lib/CodeGen/ManagedMemoryRewrite.cpp
137

I believe this bitcast to be correct. Am I doing something "obviously" wrong here?

140

For some reason, this fails. I suspect this indicates some deeper bug in my code.

tools/GPURuntime/GPUJIT.c
36 ↗(On Diff #110543)

This is just a debugging change, I'll remove it at the final commit.

bollu updated this revision to Diff 110693.Aug 11 2017, 5:12 AM
  • [WIP] version now works on test code, going to run it on COSMO to detect bugs in implementation.
Meinersbur edited edge metadata.Aug 12 2017, 3:58 PM

The function editAllUses look more complicated than necessary. Here is what I'd try:

  1. For every function, insert a arrptr.load in the entry (or lazily when needed) (might be beneficial to have just one such load per function)
  1. Iterate of the instructions that contain a use of GlobalArray.
  1. Recurse over the operands of the instruction. Pseudo code:
Value *recursiveReplace(Value *Replacee, GlobalArray*Old, LoadInst *New) {
  if (Value == Old)
    return New;

  Instuction *Replacement = dyn_cast<Instruction>(Replacee); // Replacee is only an instruction for the recursion root.
  for (Use &Op : Value->operands()) {
    auto Replacer= recursiveReplace(Op>get());
    if (Replacer== Op->get())
     continue;
    if (!Replacement) {
     Replacement = Replacee->getAsInstruction();
     // Insert Replacement somewhere, e.g. right after arrptr.load (It was previously a constant, i.e. there are no other dependencies than to arrptr.load)
   }
    Replacement->setOperand(Op.getOperandNo(),  Replacer);
  }
  return Replacement;
}
  1. Call
for (Instruction *UserOfArrayInst : ArrayUserInstructions) {
  auto New = recursiveReplace(UserOfArrayInst , Array, `arrptr.load`);
  if (New != UserOfArrayInst)
    UserOfArrayInst->replaceAllUsesWith(New);
}
lib/CodeGen/ManagedMemoryRewrite.cpp
99

[Suggestion] SmallVector instead of std::vector. Most instructions have only one or two operands, such that we don't need a memory location in the majority of cases.

119–121

[Style] Doxygen comment.

127

[Style] GEP->getPointerOperand() != ArrToRewrite

131

[Suggestion] SmallVector and .reserve()

142–144

[Suggestion] std::set is inefficient. Try SmallPtrSet?

150–151

Why does GEP need to be handled different than other instructions?

There is also GetElementPtrConstantExpr which is not handled here.

158–164

An assert is enough.

221

[Style]

if (auto *I = dyn_cast<Instruction>(Current))
229–234

AFAIK there is nothing else than Instruction and Constant deriving from User. An assert(isa<Constant>(Current)) or auto *C = cast<Constant>(Current) would be enough.

238

Isn't it more replacing than removing?

Also: functions start with lower-case letters

246

What does this return for multidimension arrays?

257

Uninitialized data would be ok as well, no? Could handle them the same as zero-initialized data, which is a value uninitialized data can have.

266

[Style] cast<> instead of dyn_cast<> gives you a type check in asset-builds.

278

Where does the magic number 100 come from?

290–291

Is the priority even important? AFAIK it can be just 0 for everything.

296

unfinished sentence

test/GPGPU/simple-managed-memory-rewrite.ll
7 ↗(On Diff #110693)

REQUIRES: pollyacc missing

bollu updated this revision to Diff 110889.Aug 13 2017, 1:09 PM
  • [Code dump] Dump of all changes that now allow global arrays to be rewritten

correctly.

We still need to rewrite allocas correctly.

bollu updated this revision to Diff 110894.Aug 13 2017, 2:17 PM
  • [UNDEBUG] remove debug printing in GPUJIT
  • [NFC] check polly and fix test case to reduce size by x100
bollu marked 13 inline comments as done.Aug 15 2017, 2:00 AM
bollu added inline comments.
lib/CodeGen/ManagedMemoryRewrite.cpp
127

rewriteGEP has been removed.

131

rewriteGEP has been removed.

150–151

Yep, GEP special casing is removed.

246

It would return the inner array type. However, I don't believe this is a problem due to the way we allocate memory.

We issue a single cudaMallocManaged, and then we convert the [T] to T*. This should still work, memory-layout wise unless I'm missing something.

257

Yes, it would. I set it to zero as a safe default. Do we gain / lose anything by switching to uninitialized?

278

Removed.

bollu updated this revision to Diff 111142.Aug 15 2017, 4:11 AM
bollu marked 4 inline comments as done.
  • [WIP] Update according to Michael's comments.
bollu updated this revision to Diff 111198.Aug 15 2017, 10:15 AM
  • [NFC] undo all changes to GPUJIT.

@Meinersbur - Second round of review, please :)

grosser accepted this revision.Aug 16 2017, 2:40 AM

LGTM.

lib/CodeGen/ManagedMemoryRewrite.cpp
95

is populated WITH all the

120

No braces and no ";"

Maybe also reduce casting.

124

You don't rewrite GEP, so remove the comment.

133

keeps

150

No braces for single statements

163

No braces for single-statements

215

Why is there a "false"?

307

Not needed any more.

lib/Support/RegisterPasses.cpp
48 ↗(On Diff #111198)

Drop.

353 ↗(On Diff #111198)

Unrelated change?

This revision is now accepted and ready to land.Aug 16 2017, 2:40 AM
philip.pfaffe edited edge metadata.Aug 16 2017, 6:14 AM

The idea seems sensible. Some nits inline. Most importantly the intent and algorithm of the pass should be documented much more explicitely.

lib/CodeGen/ManagedMemoryRewrite.cpp
37

Unused include?

49

Function names should be humbleCamelCase

97

This algorithm requires a much more extensive documentation.

103

This is unused?

131

InstsToBeDeleted is unused?

161

Why is this a member?

187

Use SmalVvector instead of std::vector

193

Here and below: Superfluous braces

203

What's the purpose of this?

207

Use a SmallSet instead.

209

ElemTy->getPointerTo(AddrSpace)

234

No need to pass the empty array here.

241

Isn't this off by a factor of 8?

254

Use a SmallVector instead

260

Superfluous braces.

265

InstsToBeDeleted is never populated with anything.

281

static

300

static

313

Just Builder.getInt64(Size)

321

Is this actually ever relevant?

bollu updated this revision to Diff 111351.Aug 16 2017, 8:25 AM
bollu marked 25 inline comments as done.
  • [NFC] Rewrite based on Phillip & Tobias' comments
bollu updated this revision to Diff 111475.Aug 17 2017, 1:28 AM
  • [Bugfix] Send bytes, not bits. Also, move the alloca function extraction to the correct place.
bollu added inline comments.Aug 17 2017, 1:28 AM
lib/CodeGen/ManagedMemoryRewrite.cpp
203

The AddrSpace is a parameter to all pointers. I believe the default address space is zero.

As for why LLVM uses it, for example, the NVPTX backend uses address spaces to refer to where the pointer lives in memory.

215

I need some help with this one, actually. What is the correct linkage type?

321

Probably not, but I'd much rather be defensive about things like this.

bollu added a comment.Aug 17 2017, 1:29 AM

@philip.pfaffe Another round of review, please?

bollu updated this revision to Diff 111476.Aug 17 2017, 1:32 AM
  • [Re-upload] diff against the newest HEAD.
bollu updated this revision to Diff 111478.Aug 17 2017, 1:36 AM
  • [Merge] Merged with master, hoping that GPUJIT does not show up from arc diff this time.
philip.pfaffe added inline comments.Aug 17 2017, 1:58 AM
lib/CodeGen/ManagedMemoryRewrite.cpp
124

Should you not assert before using I above?

151

Does this actually ever happen? You're recursively expanding ConstExpr operands, so after a single pass that shouldn't be necessary anymore. Does this mean that this loop's trip count is actually never greater than two?

203

The question was not what an AddrSpace is, but why you're actively passing the default everywhere, and even store this in an extra variable.

221

Internal or Common linkage should be used.

bollu updated this revision to Diff 111484.Aug 17 2017, 2:32 AM
  • [NFC] Discuss algorithm with phillip offline, he helped to simplify it further.
  • [Linkage] Update linkage code to use the correct linker options as well as the ignore linkage flag. Update test case to match this change
  • [ReplaceUsesOfWith] remove double-loop that was not required.
philip.pfaffe accepted this revision.Aug 17 2017, 3:32 AM

Besides the AddrSpace, LGTM!

bollu updated this revision to Diff 111488.Aug 17 2017, 3:40 AM
  • [NFC] remove 0 address space because that is the default value.
This revision was automatically updated to reflect the committed changes.