This patch adds initial support for codegen of private and firstprivate clauses of nvptx.
I simply use the same support in host codegeneration, even though optimizations are possible.
Optimizations may be done by having codegen first analyze data sharing attributes on target and teams (e.g. target firstprivate and teams private on the same variable can drop firstprivate on target). These optimizations will be subject of a successive patch.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
lib/CodeGen/CGOpenMPRuntimeNVPTX.cpp | ||
---|---|---|
53–57 ↗ | (On Diff #51080) | After some investigation I found out that this the same code, that a;ready exists in CodeGenFunction::EmitOMPTeamsDirective(). void CodeGenFunction::EmitOMPTeamsDirective(const OMPTeamsDirective &S) { OMPLexicalScope(*this, S); // Emit parallel region as a standalone region. auto &&CodeGen = [&S](CodeGenFunction &CGF) { OMPPrivateScope PrivateScope(CGF); (void)CGF.EmitOMPFirstprivateClause(S, PrivateScope); CGF.EmitOMPPrivateClause(S, PrivateScope); (void)PrivateScope.Privatize(); CGF.EmitStmt(cast<CapturedStmt>(S.getAssociatedStmt())->getCapturedStmt()); }; if (getLangOpts().OpenMPIsDevice) emitInlinedDirective(CGF, OMPD_teams, CodeGen); else emitCommonOMPTeamsDirective(*this, S, OMPD_teams, CodeGen); } |
Hi Alexey
Thanks for your comment. The suggested change will not work as I intended in my patch when using the host as a device. This happens when you select the following options:
-fomptargets=powerpc64le-ibm-linux-gnu -fopenmp-is-device
In this case we generate device code and the target is ppc64. In ppc64 we need to generate a call to kmpc_fork_teams. In your proposed change, we treat all devices in an undistinguished way and we do not generate a call to fork_teams.
There are various reasons why we should not do that, the most clear ones to me being:
- When using the host as host or as target device we generate different codes. This would mess up with performance profiling.
- On a host it is still important to have teams as that may be the place where coarse grain parallelism comes from.
If you still want no specialization in CGOpenMPRuntimeNVPTX, we will need to check if we are targeting a device and if that device is an nvptx one.
I know that the problem is that we have two CodeGen objects being created in different places if we target nvptx or host. However, by the way the interface is currently structured, I do not see any way out of this duplication.
Thanks!
To make clear my comment I an updating the patch following Alexey's comment and modifying it to make it work.
You can see that now we have the following check:
CGM.getTarget().getTriple().getArch() == llvm::Triple::nvptx ||
CGM.getTarget().getTriple().getArch() == llvm::Triple::nvptx64
lib/CodeGen/CGStmtOpenMP.cpp | ||
---|---|---|
3074–3078 ↗ | (On Diff #51285) | This now duplicates logic from CodeGenModule::createOpenMPRuntime and has to be extended for every new triple in multiple places. I would like to propose a new function like virtual bool CGOpenMPRuntime::shouldEmitInlinedTeamsDirective which defaults to false and can be overriden in CGOpenMPRuntimeNVPTX. This way we can later on easily reuse the logic for new device types and target triples. |
Hi Jonas
Thanks for your comment. That scheme would not work for #parallel on the gpu in case we do not want to inline the parallel region.
There are patches coming in which that is going to happen so I think that we really need to distinguish between different pragmas.
Is a #pragma omp parallel changing the way code is generated for #pragma omp teams? How would that look like in CodeGenFunction::EmitOMPTeamsDirective?
My point is that specialization should happen in CGOpenMPRuntimeNVPTX (by means of overriding) without duplicating code, as Alexey said.
Hi
Thinking more carefully, I believe you are right - no, it is not. I guess we would not call at all shouldEmitInlinedTeamsDirective for the case of paralle.
Thanks!
Thinking more about this patch I think it is a right solution to override void llvm::Value *emitParallelOrTeamsOutlinedFunction() and void emitTeamsCall() for CGOpenMPRuntimeNVPTX class:
llvm::Value *CGOpenMPRuntimeNVPTX::emitParallelOrTeamsOutlinedFunction() { llvm::Value *OutlinedFn = CGOpenMPRuntime::emitParallelOrTeamsOutlinedFunction(); cast<llvm::Function>(OutlinedFn)->addFnAttr(llvm::Attribute::AlwaysInline); return OutlinedFn; }
void CGOpenMPRuntimeNVPTX::emitTeamsCall() { if (!CGF.HaveInsertPoint()) return; Address ZeroAddr = CGF.CreateTempAlloca(CGF.Int32Ty, CharUnits::fromQuantity(4), /*Name*/ ".zero.addr"); CGF.InitTempAlloca(ZeroAddr, CGF.Builder.getInt32(/*C*/ 0)); llvm::SmallVector<llvm::Value *, 16> OutlinedFnArgs; OutlinedFnArgs.push_back(ZeroAddr.getPointer()); OutlinedFnArgs.push_back(ZeroAddr.getPointer()); OutlinedFnArgs.append(CapturedVars.begin(), CapturedVars.end()); CGF.EmitCallOrInvoke(OutlinedFn, OutlinedFnArgs); }
Also, please update patch to the latest revision
[OPENMP] Rebase patch on new support for codegen for teams in nvptx target. private and firstprivate clauses are now dealt with in the teams patch and this one only becomes regression tests.
Hi
Yes, the purpose of these patches was to check correctness of code gen for the related clauses when targeting nvptx exclusively.
Since I wrote these tests, the base support changed to generate same code on host and any device type, and it seemed to me that the need for a differentiated regression test was not there any more (happy to be told otherwise).
Of course, there are things that we would like to do a bit different on nvptx, especially for firstprivate, but I have not yet had time to work on this and to reflect this into this regression test. If I do, I would like to update this patch, but it seemed too early to just abandon this.
Thanks and please do let me know what is your viewpoint on this
- Carlo