This patch fix PR25342.
This updated patch enhances InstCombineCasts to handle following case as suggested by Chandler:
A -> B cast PHI B -> A cast
so all bitcast operations can be removed.
Differential D14596
[SROA] Choose more profitable type in findCommonType Carrot on Nov 11 2015, 4:26 PM. Authored by
Details
Diff Detail Event TimelineComment Actions I'm undecided about this. Chandler, any thoughts? If we add a DAGCombine that makes bitcast(load(addr)) into a load of a different type, and similar for stores, would that address all of the current motivating examples for this? Comment Actions There are other heuristics that can be applied here:
Comment Actions I was really confused about this patch until I read the bug. =] For the record, bugs don't go to a widely followed mailing list, and so it is often helpful to provide a summary of the problem being faced when posting a patch for review in addition to a link to the bug. Otherwise it can be hard to figure out. However, I don't think this is the correct fix. Regardless of what type SROA chooses to use, the optimizer should *always* fix the type when we discover that in the end loads and stores end up feeding floating point instructions. I worked really hard to fix this exact issue so that SROA could stay simple and ignorant of such issues, and so we could continue to lower memcpy as integer loads and stores. It would really help to provide a reduced IR test case. The bug is full of C++ code and asm, but only has excerpts of the IR. The integer loads and stores should only exist within the function as long as there are no floating point operations on the loaded or stored type. The question is why that isn't proving to be the case, as there is code that tries to ensure that in the tree. Comment Actions Here is a simple case (not yet minimized): #include <complex> DD compute(DD &c1, DD& c2){ float r1 = c1.real(); float i1 = c1.imag(); float r2 = c2.real(); float i2 = c2.imag(); return DD(r1 * r2 - i1 * i2, r2 * i1 + r1 * i2); } void foo(int n) { int i; DD ldd = 0; for (i = 0; i < n; i++) ldd += compute(dd, dd2); dd = ldd; } On x86, the kernel loop generated by clang/LLVM is: .LBB1_2: # =>This Inner Loop Header: Depth=1 movd %eax, %xmm3 addss %xmm1, %xmm3 movd %xmm3, %eax addss %xmm0, %xmm2 decl %edi jne .LBB1_2 The better one should be: L4: vaddss %xmm2, %xmm0, %xmm0 vaddss %xmm3, %xmm1, %xmm1 decl %edi jne .L4 Comment Actions The memcpy is lowered in instruction combine pass, so you mean the root cause should be in this pass? Comment Actions As I said, what would help me here is a reduced *IR* test case. I think C++ is too high level and clearly everything is broken by the time you get to x86. Without a good IR test case, its impossible for me to give a concrete path forward here. Comment Actions Following is the IR of function foo in David's test case, after the last pass that still has the correct type for %ldd, define void @_Z3fooi(i32 signext %n) #0 { %ldd = alloca %"struct.std::complex", align 4 %ref.tmp = alloca %"struct.std::complex", align 4 %0 = bitcast %"struct.std::complex"* %ldd to i8* call void @llvm.lifetime.start(i64 8, i8* %0) #3 call void @_ZNSt7complexIfEC2Eff(%"struct.std::complex"* %ldd, float 0.000000e+00, float 0.000000e+00) br label %for.cond for.cond: ; preds = %for.body, %entry %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ] %cmp = icmp slt i32 %i.0, %n br i1 %cmp, label %for.body, label %for.end for.body: ; preds = %for.cond %call = call fastcc [2 x float] @_ZL7computeRSt7complexIfES1_() %coerce.dive = getelementptr inbounds %"struct.std::complex", %"struct.std::complex"* %ref.tmp, i32 0, i32 0 %1 = bitcast { float, float }* %coerce.dive to [2 x float]* %call.fca.0.gep = getelementptr inbounds [2 x float], [2 x float]* %1, i32 0, i32 0 %call.fca.0.extract = extractvalue [2 x float] %call, 0 store float %call.fca.0.extract, float* %call.fca.0.gep %call.fca.1.gep = getelementptr inbounds [2 x float], [2 x float]* %1, i32 0, i32 1 %call.fca.1.extract = extractvalue [2 x float] %call, 1 store float %call.fca.1.extract, float* %call.fca.1.gep %call1 = call dereferenceable(8) %"struct.std::complex"* @_ZNSt7complexIfEpLIfEERS0_RKS_IT_E(%"struct.std::complex"* %ldd, %"struct.std::complex"* dereferenceable(8) %ref.tmp) %inc = add nsw i32 %i.0, 1 br label %for.cond for.end: ; preds = %for.cond call void @llvm.memcpy.p0i8.p0i8.i64(i8* bitcast (%"struct.std::complex"* @dd to i8*), i8* %0, i64 8, i32 4, i1 false) call void @llvm.lifetime.end(i64 8, i8* %0) #3 ret void } In this function there is no explicit floating point operation, all of them are wrapped by function call. So combine pass actually works as intended. The memcpy is lowered to integer ld/st. But later in inlining pass all the complex number function calls are inlined, so in SROA pass it can see both integer ld/st and fp operations. Comment Actions you can probably create a smaller test case by simplifying post-inline IR dump. David Comment Actions Yea, as David was somewhat saying, I think the interesting test case is the one after inlining where there are both integer and FP operations. Comment Actions The IR after inlining:
; Function Attrs: nounwind %ldd = alloca i64, align 8 %tmpcast = bitcast i64* %ldd to %"struct.std::complex"* %ref.tmp = alloca %"struct.std::complex", align 4 %0 = bitcast i64* %ldd to i8* call void @llvm.lifetime.start(i64 8, i8* %0) #3 %_M_value.realp.i = getelementptr inbounds %"struct.std::complex", %"struct.std::complex"* %tmpcast, i64 0, i32 0, i32 0 store float 0.000000e+00, float* %_M_value.realp.i, align 4 %_M_value2.imagp.i = getelementptr inbounds %"struct.std::complex", %"struct.std::complex"* %tmpcast, i64 0, i32 0, i32 1 store float 0.000000e+00, float* %_M_value2.imagp.i, align 4 br label %for.cond for.cond: ; preds = %for.body, %entry %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ] %cmp = icmp slt i32 %i.0, %n br i1 %cmp, label %for.body, label %for.end for.body: ; preds = %for.cond %1 = load float, float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd, i64 0, i32 0, i32 0), align 4 %2 = load float, float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd, i64 0, i32 0, i32 1), align 4 %3 = load float, float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd2, i64 0, i32 0, i32 0), align 4 %4 = load float, float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd2, i64 0, i32 0, i32 1), align 4 %mul.i = fmul float %1, %3 %mul4.i = fmul float %2, %4 %sub.i = fsub float %mul.i, %mul4.i %mul5.i = fmul float %2, %3 %mul6.i = fmul float %1, %4 %add.i.4 = fadd float %mul5.i, %mul6.i %.fca.0.insert.i = insertvalue [2 x float] undef, float %sub.i, 0 %.fca.1.insert.i = insertvalue [2 x float] %.fca.0.insert.i, float %add.i.4, 1 %call.fca.0.gep = getelementptr inbounds %"struct.std::complex", %"struct.std::complex"* %ref.tmp, i64 0, i32 0, i32 0 %call.fca.0.extract = extractvalue [2 x float] %.fca.1.insert.i, 0 store float %call.fca.0.extract, float* %call.fca.0.gep, align 4 %5 = getelementptr inbounds %"struct.std::complex", %"struct.std::complex"* %ref.tmp, i64 0, i32 0, i32 1 %call.fca.1.extract = extractvalue [2 x float] %.fca.1.insert.i, 1 store float %call.fca.1.extract, float* %5, align 4 %_M_value.realp.i.i = getelementptr inbounds %"struct.std::complex", %"struct.std::complex"* %ref.tmp, i64 0, i32 0, i32 0 %6 = load float, float* %_M_value.realp.i.i, align 4 %_M_value.realp.i.3 = getelementptr inbounds %"struct.std::complex", %"struct.std::complex"* %tmpcast, i64 0, i32 0, i32 0 %7 = load float, float* %_M_value.realp.i.3, align 4 %add.i = fadd float %6, %7 store float %add.i, float* %_M_value.realp.i.3, align 4 %_M_value.imagp.i.i = getelementptr inbounds %"struct.std::complex", %"struct.std::complex"* %ref.tmp, i64 0, i32 0, i32 1 %8 = load float, float* %_M_value.imagp.i.i, align 4 %_M_value3.imagp.i = getelementptr inbounds %"struct.std::complex", %"struct.std::complex"* %tmpcast, i64 0, i32 0, i32 1 %9 = load float, float* %_M_value3.imagp.i, align 4 %add4.i = fadd float %8, %9 store float %add4.i, float* %_M_value3.imagp.i, align 4 %inc = add nsw i32 %i.0, 1 br label %for.cond for.end: ; preds = %for.cond %10 = load i64, i64* %ldd, align 8 store i64 %10, i64* bitcast (%"struct.std::complex"* @dd to i64*), align 4 call void @llvm.lifetime.end(i64 8, i8* %0) #3 ret void } And the IR after SROA pass
; Function Attrs: nounwind br label %for.cond for.cond: ; preds = %for.body, %entry %ldd.sroa.0.0 = phi i32 [ 0, %entry ], [ %5, %for.body ] %ldd.sroa.6.0 = phi i32 [ 0, %entry ], [ %7, %for.body ] %i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ] %cmp = icmp slt i32 %i.0, %n br i1 %cmp, label %for.body, label %for.end for.body: ; preds = %for.cond %0 = load float, float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd, i64 0, i32 0, i32 0), align 4 %1 = load float, float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd, i64 0, i32 0, i32 1), align 4 %2 = load float, float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd2, i64 0, i32 0, i32 0), align 4 %3 = load float, float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd2, i64 0, i32 0, i32 1), align 4 %mul.i = fmul float %0, %2 %mul4.i = fmul float %1, %3 %sub.i = fsub float %mul.i, %mul4.i %mul5.i = fmul float %1, %2 %mul6.i = fmul float %0, %3 %add.i.4 = fadd float %mul5.i, %mul6.i %.fca.0.insert.i = insertvalue [2 x float] undef, float %sub.i, 0 %.fca.1.insert.i = insertvalue [2 x float] %.fca.0.insert.i, float %add.i.4, 1 %call.fca.0.extract = extractvalue [2 x float] %.fca.1.insert.i, 0 %call.fca.1.extract = extractvalue [2 x float] %.fca.1.insert.i, 1 %4 = bitcast i32 %ldd.sroa.0.0 to float %add.i = fadd float %call.fca.0.extract, %4 %5 = bitcast float %add.i to i32 %6 = bitcast i32 %ldd.sroa.6.0 to float %add4.i = fadd float %call.fca.1.extract, %6 %7 = bitcast float %add4.i to i32 %inc = add nsw i32 %i.0, 1 br label %for.cond for.end: ; preds = %for.cond store i32 %ldd.sroa.0.0, i32* bitcast (%"struct.std::complex"* @dd to i32*), align 4 store i32 %ldd.sroa.6.0, i32* bitcast (float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd, i64 0, i32 0, i32 1) to i32*), align 4 ret void } Comment Actions So, the problem here is that the instcombine pass tries to remove the bitcast from i32 to float, but fails because of the phi node. Specifically, we have a combine that handles bitcasts of loads and bitcasts prior to stores, but we don't have anything that looks through phi-nodes. If you were to reg-to-mem the phi nodes so that they were replaced with loads and stores to an i32 alloca, I believe you would find that instcombine would remove the bitcasts and make all of these floating point. The key will be to push this to handle phi nodes as well. This may be a bit trickier because you really want to handle even complex cycles of phi-nodes. Does this make sense? I'm familiar with this part of instcombine if it would be helpful for me to help you craft this change. Comment Actions This updated patch improves inst combine pass to handle following bitcast case TyA -> TyB bitcast PHI TyB -> TyA bitcast The optimized code can skip the two bitcast operations, and use TyA directly in later instructions. A test case is also added. Comment Actions Can you also add a test case with multiple levels of phis?
Comment Actions Having David look at this is definitely a good thing. This will be a really nice change to instcombine, but it is in the tricky space of instcombines because it combines a fully phi-node cycle at a time, and has tricky oscillation challenges. Some comments below.
Comment Actions This updated patch now handles load instruction.
Comment Actions FWIW none of this discussion has been on the commits list, so it should probably be added so it can be seen there. Comment Actions (Or abandon this revision and start fresh with a patch that has llvm-commits as a subscriber) Comment Actions I think this can be N**2 if there is a big tree of phi nodes with bitcasts at different levels of the tree. It might make sense to limit the total amount of work we are willing to do this combine. How many items of work does the worklist typically process before firing?
Comment Actions In the regression testing there are 1.01 PHI nodes added to the worklist in average, the maximum is 3. Comment Actions Silly improvement: Can probably constify all of the Type * variables. No need to repost though. Comment Actions I think the biggest question left is the potential quadratic behavior: I think you should look at a substantially larger test than regression testing. For example, collecting the stats across a bootstrap of Clang itself at least? Comment Actions In the bootstrap of clang, the average number of visited PHI nodes is 1.047, the maximum is 14. Comment Actions Given the data collected, all concerns have been addressed -- can an explicit LGTM be given so that it can be closed? Comment Actions LGTM Please be sure to mention [InstCombine] instead of [SROA] in the commit message. Comment Actions Please take another look. I updated the patch a little to avoid change volatile memory access instructions, since volatile ld/st with bitcast can't be combined. The changes compared to last version is: Index: lib/Transforms/InstCombine/InstCombineCasts.cpp
+++ lib/Transforms/InstCombine/InstCombineCasts.cpp (working copy) if (isa<Constant>(IncValue)) continue;
+ auto *LI = dyn_cast<LoadInst>(IncValue); continue; // If a LoadInst has more than one use, changing the type of loaded // value may create another bitcast. @@ -1875,7 +1876,7 @@ // If there is a store with type B, change it to type A. for (User *U : PN->users()) { auto *SI = dyn_cast<StoreInst>(U);
+ if (SI && SI->isSimple() && SI->getOperand(0) == PN) { Builder->SetInsertPoint(SI); SI->setOperand(0, Builder->CreateBitCast(NewPNodes[PN], SrcTy)); } Comment Actions This updated patch fixed https://llvm.org/bugs/show_bug.cgi?id=26918. The problem of the reverted patch is when we have following code ; <label>:2: ; preds = %0 %3 = load atomic i64, i64* %1 monotonic, align 8 br label %8 ; <label>:4: ; preds = %0, %0 %5 = load atomic i64, i64* %1 acquire, align 8 br label %8 ; <label>:6: ; preds = %0 %7 = load atomic i64, i64* %1 seq_cst, align 8 br label %8 ; <label>:8: ; preds = %6, %4, %2 %.in1097491 = phi i64 [ %3, %2 ], [ %7, %6 ], [ %5, %4 ] %9 = bitcast i64 %.in1097491 to double ret double %9 My patch changed it to ; <label>:2: ; preds = %0 %3 = load atomic i64, i64* %1 monotonic, align 8 %4 = bitcast i64 %3 to double br label %11 ; <label>:5: ; preds = %0, %0 %6 = load atomic i64, i64* %1 acquire, align 8 %7 = bitcast i64 %6 to double br label %11 ; <label>:8: ; preds = %0 %9 = load atomic i64, i64* %1 seq_cst, align 8 %10 = bitcast i64 %9 to double br label %11 ; <label>:11: ; preds = %8, %5, %2 %12 = phi double [ %4, %2 ], [ %10, %8 ], [ %7, %5 ] %.in1097491 = phi i64 [ %3, %2 ], [ %9, %8 ], [ %6, %5 ] ret double %12 And expects ld/st combining can combine the load instructions with following bitcast. But ld/st combining doesn't change volatile/atomic memory access, so the bitcast instructions are left there. Later phi combining transformed it back to the original form, so the two combining transforms the code back and forth indefinitely. There are two changes compared to the original approved patch: 1 avoid changing volatile/atomic memory access 2 add the impacted instructions to combining worklist Please take another look. Comment Actions I've added a patch D20847 to fix potential infinite loop for the case here. Can you please take a look? |
Counting the number of static occurrences is not the best way IMO. Why not using profile/frequency information of the uses?