We want to fold more aggressively on the GPU and we therefore disable
the generation of llvm.trap before an unreachable by default.
Depends on D106299
Differential D106301
[OpenMP] Disable trap before unreachable for OpenMP device jobs jdoerfert on Jul 19 2021, 12:06 PM. Authored by
Details
We want to fold more aggressively on the GPU and we therefore disable Depends on D106299
Diff Detail
Unit Tests Event TimelineComment Actions What's the problem with emitting llvm.trap in various unreachable places? Wondering if it also affects translating assert to an llvm.trap Comment Actions llvm.trap is preserved, thus branches to an llvm.trap are preserved.
no. Comment Actions As an example, often end up with code like this right now: %26 = load i32, i32* addrspacecast (i32 addrspace(3)* @execution_param to i32*), align 4, !dbg !39, !tbaa !27 %and.i13.i.i = and i32 %26, 4, !dbg !39 %cmp.i14.not.i.i = icmp eq i32 %and.i13.i.i, 0, !dbg !39 br i1 %cmp.i14.not.i.i, label %if.end.i129.i.i, label %__kmpc_parallel_51.exit.i, !dbg !39 if.end.i129.i.i: ; preds = %_Z16DecParallelLevelbj.exit.i.i tail call void @llvm.trap() #10, !dbg !39 unreachable, !dbg !39 which could be: br label %__kmpc_parallel_51.exit.i, !dbg !39 Comment Actions That's interesting. Consistent with IR in general, template <bool Trap> int test(int x) { if (x < 42) { return x; } else { if (Trap) __builtin_trap(); __builtin_unreachable(); } } extern "C" { int trap(int x) { return test<true>(x); } int none(int x) { return test<false>(x); } } => define i32 @trap(i32 returned %0) { %2 = icmp slt i32 %0, 42 br i1 %2, label %4, label %3 3: ; preds = %1 tail call void @llvm.trap() #3 unreachable 4: ; preds = %1 ret i32 %0 } define i32 @none(i32 returned %0) { %2 = icmp slt i32 %0, 42 tail call void @llvm.assume(i1 %2) #3 ret i32 %0 } So yes, we'll get faster codegen if we are willing to throw away traps followed by unreachable code. If that's a legitimate transform to do, it seems like something we should do in instcombine, instead of a separate pass. I.e. fold trap, unreachable to unreachable. Can we do that instead? edit: seems reasonably likely that the transform always makes gpu code faster, so perhaps it could be an architecture-specific instcombine Comment Actions We could, but we should not. A trap inserted by assert or the user should stay (IMHO). Comment Actions I'm not sure about that - we could tie instcombine to -O0 or some similar proxy for debugging vs performance - but in practice it's fairly likely that most traps are compiler inserted so it probably works out the same. Conditional instcombine would let us remove much of the current logic for conditionally inserting traps which seems a win for implementation complexity. Doesn't matter much for this patch, if D106299 lands then sure, let's switch it on for openmp GPU. If it goes the instcombine route then we don't need to toggle a switch. |