This is an archive of the discontinued LLVM Phabricator instance.

[OMPIRBuilder] Fill the Aggregate Struct in the serialized parallel region
AbandonedPublic

Authored by kiranchandramohan on Jul 7 2022, 6:27 AM.

Details

Summary

D110114 enabled usage of a struct for passing parameters to the
kmpc_fork OpenMP runtime call. The filling of the aggregate struct is
currently only happening in the parallel case but not for the serialized
parallel version. I think this happens because the extractor creates
these and it will only do it at the place that it is called.

In this proposed fix, the instructions that fill the aggregate struct are
copied over to the serialized parallel region.

Diff Detail

Event Timeline

Herald added a project: Restricted Project. · View Herald TranscriptJul 7 2022, 6:27 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
kiranchandramohan requested review of this revision.Jul 7 2022, 6:27 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 7 2022, 6:27 AM
jdoerfert added inline comments.Jul 7 2022, 8:39 AM
llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
1054

This is super fragile. What if there is no GEP for 0 offset or a GEP is reused. If we want something like this we should verify the conditions more closely. E.g., GEP->num_uses() == 1, all users of the CapturedStruct should be GEPs or other things we expect, etc.

Honestly, I would suggest to unify the handling rather than adding special cases. So, always outline, always do the same thing. In the serial case just omit the fork_call indirection.

Would that work?

llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
1054

I wasn't able to think of a way to unify the handling. The code to outline exists only in one place. So the extractor can be called only once unless we duplicate the parallel region code or modify the extractor to provide two copies or something like that. Were you suggesting something else?

This is super fragile. What if there is no GEP for 0 offset or a GEP is reused. If we want something like this we should verify the conditions more closely. E.g., GEP->num_uses() == 1, all users of the CapturedStruct should be GEPs or other things we expect, etc.

Yeah, I agree that it is fragile and we should add the asserts. And there is always the case that something changes in the extractor's handling of aggregates and then all these will not work.

jdoerfert added inline comments.Jul 7 2022, 9:25 AM
llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
1054

I'm spitballing here: Could we not call the outlined function in the sequential case too, just without the fork_call indirection? If so, would that solve the issue?
TBH, I'm not sure what the issue is w/o code to look at.

llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
1054

Both of them call the outlined function. I think that the extractor is only called once for the parallel region to outline and it fills the aggregate struct and creates the call to the outlined function at the place the extractor is called. For the serialized version we could previously just call the outlined function directly, but now because the arguments are passed in a structure, we have to fill that structure separately. Please correct me if this is not what actually happens.

I posted an example in D110114 yesterday to start a discussion but later thought can start a discussion in a patch. Copying the example over here. The call to @sb_..omp_par from kmpc_fork_call in block omp_parallel has the aggregate filled in while the call to @sb_..omp_par in the serialized parallel region in block 6 does not have this.

define void @sb_(ptr %0, ptr %1) !dbg !3 {
  %structArg = alloca { ptr }, align 8
  %tid.addr = alloca i32, align 4, !dbg !7
  %zero.addr = alloca i32, align 4, !dbg !7
  store i32 0, ptr %tid.addr, align 4, !dbg !7
  store i32 0, ptr %zero.addr, align 4, !dbg !7
  %3 = load i32, ptr %0, align 4, !dbg !7
  %4 = icmp ne i32 %3, 0, !dbg !7                 
  br label %entry, !dbg !7

entry:                                            ; preds = %2
  %omp_global_thread_num = call i32 @__kmpc_global_thread_num(ptr @1), !dbg !7
  br i1 %4, label %5, label %6

5:                                                ; preds = %entry
  br label %omp_parallel

omp_parallel:                                     ; preds = %5
  %gep_ = getelementptr { ptr }, ptr %structArg, i32 0, i32 0
  store ptr %1, ptr %gep_, align 8
  call void (ptr, i32, ptr, ...) @__kmpc_fork_call(ptr @1, i32 1, ptr @sb_..omp_par, ptr %structArg), !dbg !9
  br label %omp.par.outlined.exit

omp.par.outlined.exit:                            ; preds = %omp_parallel
  br label %omp.par.exit.split

omp.par.exit.split:                               ; preds = %omp.par.outlined.exit
  br label %7

6:                                                ; preds = %entry
  call void @__kmpc_serialized_parallel(ptr @1, i32 %omp_global_thread_num)
  call void @sb_..omp_par(ptr %tid.addr, ptr %zero.addr, ptr %structArg), !dbg !9
  call void @__kmpc_end_serialized_parallel(ptr @1, i32 %omp_global_thread_num)
  br label %7

7:                                                ; preds = %6, %omp.par.exit.split
  ret void, !dbg !10
}

Thanks for the example. That makes sense now. I tried to list some options from the top of my head below. WDYT?

llvm/lib/Frontend/OpenMP/OMPIRBuilder.cpp
1001

Can we copy the block (CI->getParent()) here and use it in the sequential version?
Still not great but probably better than looking for stores. We could also copy
all users of the alloca (structArg) until we hit the call.
Or, we move/generate the stores and such directly in the block in which we branch (%entry in the example).
Maybe if we outline before we do the conditional it is even easier.

kiranchandramohan abandoned this revision.Feb 8 2023, 7:38 AM

D138495 fixed this in a better way.