One of the LLVM optimizations, split critical edges, also clones tail
instructions. This is a dangerous operation for syncthreads()
functions and this transformation leads to undefined behavior or
incorrect results. Patch fixes this problem by replacing syncthreads()
function with the assembler instruction, which cost is too high and
wich cannot be copied.
Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
I'll accept the patch for the sake of consistency and correctness of execution. Just one question:
which cost is too high
So should we expect a performance penalty until function copy is fixed in LLVM and we can revert back to __syncthreads()?
No, the cost in LLVM terms is high, but in the end, we end up with absolutely the same code as before. That's why it is marked as NFC.
Do I understand correctly that the intrinsic is apparently "misoptimized" by LLVM?
If so, would that also be a problem for other CUDA code, for example, User code that uses this intrinsic?
Marks,but some of the optimizations ignore this attribute. I don't remebet which one exactly, something like critical edge splitting.
I think critical edge splitting handles convergent correctly, since it is one of the motivating examples. I just looked at a random example in test/OpenMP, and this doesn't look correct to me:
__kmpc_barrier is declared as convergent, but the callers are not:
declare void @__kmpc_barrier(%struct.ident_t*, i32) #3 define internal void @__omp_outlined__78(i32* noalias %.global_tid., i32* noalias %.bound_tid.) #0 { attributes #0 = { noinline norecurse nounwind optnone "correctly-rounded-divide-sqrt-fp-math"="false" "disable-tail-calls"="false" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-frame-pointer-elim"="false" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="false" "stack-protector-buffer-size"="8" "target-features"="+ptx32,+sm_20" "unsafe-fp-math"="false" "use-soft-float"="false" } attributes #1 = { nounwind readnone } attributes #2 = { argmemonly nounwind }
*All* functions need to be assumed convergent, not just the convergent barrier leafs.
The problem is not in the OpenMP code, it is in Cuda code. It appears only when we inline the runtime written in Cuda, where everything is marked correctly. For OpenMP code it is not necessary to mark all the functions as convergent, all required functions are marked by Cuda.
I don't follow how this is unnecessary. This is producing an IR module with a convergent call from a non-convergent function. This is plainly broken, and the verifier should probably reject it. Any transform on the caller of these could violate the convergent rules. The IR should be semantically correct at all times regardless of what is inlined or linked
If the verifier is broken, it must be fixed, of course. and kmpc_barrier too. But the problem still remains. One of the functions, at least, that calculates cost of the function in splitting edge, does not take convergent attribute into account and it leads to dangerous optimizations.
Better to ask Doru, he tried to investigate this problem (after my patch, which is just a copy of the named barriers, asm volatile construct does not have this problem) and, if I recall it correctly, reported about this problem. But I'm not sure, to whom he reported, to LLVM or to NVidia.
I reported several problems to NVIDIA. Is the problem below the one you're referring to?
For the following CUDA code:
if (threadIdx.x == 0) { // do some initialization (A) } __synchtreads(); // some code (B)
when I enable optimizations I get the syncthreads being duplicated and the code hangs at runtime:
entry: %0 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #6, !range !12 %cmp.i2 = icmp eq i32 %0, 0 br i1 %cmp.i2, label %if.then, label %if.end.split if.end.split: tail call void @llvm.nvvm.barrier0()#6 // LLVM IR for B code block br label %if.end if.then: // LLVM IR for A code block tail call void @llvm.nvvm.barrier0()#6 // LLVM IR for B code block br label %if.end if.end:
This is the code without optimizations enabled:
If I don't enable optimizations then I get the following code which works correctly and doesn't hang:
entry: %0 = tail call i32 @llvm.nvvm.read.ptx.sreg.tid.x() #6, !range !12 %cmp.i2 = icmp eq i32 %0, 0 br i1 %cmp.i2, label %if.then, label %if.end if.then: // LLVM IR for A code block br label %if.end if.end: tail call void @llvm.nvvm.barrier0() #6 // LLVM IR for B code block
I reported several problems to NVIDIA.
Unfortunately, reporting problems to NVIDIA doesn't necessarily cause things to be fixed upstream. If you can (also) report problems occurring upstream to the upstream bug tracker, that will be greatly helpful.
The scheme I'd like to see us follow upstream is, when workaround are required, to leave the proper code in addition to the workarounds. Guard the workarounds with ifdefs so that we can use them for older versions as necessary, but use the proper version when compiling with newer compilers (once the bugs have been fixed). Over time, we'll be able to eliminate workarounds for compilers we no longer support.
Thanks for this! Should we now revert this patch to use syncthreads() once again instead of the asm instruction?
The asm also isn't necessarily a real workaround. Inline asm call sites need to be marked as convergent as well
I agree. Plus, we have the same problem with the named barriers. They also are represented as inline asm and also must be marked as convergent.