This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
lib/Transforms/InstCombine/
-
Transforms/
-
InstCombine/
23/28
InstCombineCasts.cpp
-
InstCombineInternal.h
-
test/Transforms/InstCombine/
-
Transforms/
-
InstCombine/
-
pr25342.ll

Differential D14596

[SROA] Choose more profitable type in findCommonType
ClosedPublic

Authored by Carrot on Nov 11 2015, 4:26 PM.

Download Raw Diff

Details

Reviewers

chandlerc
majnemer
hfinkel

Summary

This patch fix PR25342.

This updated patch enhances InstCombineCasts to handle following case as suggested by Chandler:

A  ->  B    cast
PHI
B  ->  A    cast

so all bitcast operations can be removed.

Diff Detail

Event Timeline

Carrot updated this revision to Diff 39976.Nov 11 2015, 4:26 PM

Carrot retitled this revision from to [SROA] Choose more profitable type in findCommonType.

Carrot updated this object.

Carrot added a subscriber: llvm-commits.

Carrot added a reviewer: hfinkel.Nov 11 2015, 4:33 PM

Carrot added a reviewer: chandlerc.Nov 18 2015, 4:51 PM

I'm undecided about this. Chandler, any thoughts?

If we add a DAGCombine that makes bitcast(load(addr)) into a load of a different type, and similar for stores, would that address all of the current motivating examples for this?

davidxl added a subscriber: davidxl.Dec 9 2015, 9:27 AM

There are other heuristics that can be applied here:

recognize the load/store pattern generated from lowered memcpy and 'ignore' use types from them if there are other use types;
Go deeper in the use type analysis -- instead of stopping at the load instruction -- follow the DU chain to find the real use-type of the expanded values (as in arithmetic operations)

lib/Transforms/Scalar/SROA.cpp
1105 ↗	(On Diff #39976)	Counting the number of static occurrences is not the best way IMO. Why not using profile/frequency information of the uses?

I was really confused about this patch until I read the bug. =]

For the record, bugs don't go to a widely followed mailing list, and so it is often helpful to provide a summary of the problem being faced when posting a patch for review in addition to a link to the bug. Otherwise it can be hard to figure out.

However, I don't think this is the correct fix. Regardless of what type SROA chooses to use, the optimizer should *always* fix the type when we discover that in the end loads and stores end up feeding floating point instructions. I worked really hard to fix this exact issue so that SROA could stay simple and ignorant of such issues, and so we could continue to lower memcpy as integer loads and stores.

It would really help to provide a reduced IR test case. The bug is full of C++ code and asm, but only has excerpts of the IR. The integer loads and stores should only exist within the function as long as there are no floating point operations on the loaded or stored type. The question is why that isn't proving to be the case, as there is code that tries to ensure that in the tree.

Here is a simple case (not yet minimized):

#include <complex>
using namespace std;
typedef complex<float> DD;
extern DD dd, dd2;

DD compute(DD &c1, DD& c2){

float r1 = c1.real();
 float i1 = c1.imag();
 float r2 = c2.real();
 float i2 = c2.imag();
return DD(r1 * r2 - i1 * i2, r2 * i1 + r1 * i2);

}

void foo(int n) {

int i;
DD ldd = 0;

for (i = 0; i < n; i++)
  ldd += compute(dd, dd2);

dd = ldd;

}

On x86, the kernel loop generated by clang/LLVM is:

.LBB1_2: # =>This Inner Loop Header: Depth=1

movd    %eax, %xmm3
addss   %xmm1, %xmm3
movd    %xmm3, %eax
addss   %xmm0, %xmm2
decl    %edi
jne     .LBB1_2

The better one should be:

L4:

vaddss  %xmm2, %xmm0, %xmm0
vaddss  %xmm3, %xmm1, %xmm1
decl  %edi
jne     .L4

In D14596#306559, @chandlerc wrote:

I was really confused about this patch until I read the bug. =]

For the record, bugs don't go to a widely followed mailing list, and so it is often helpful to provide a summary of the problem being faced when posting a patch for review in addition to a link to the bug. Otherwise it can be hard to figure out.

However, I don't think this is the correct fix. Regardless of what type SROA chooses to use, the optimizer should *always* fix the type when we discover that in the end loads and stores end up feeding floating point instructions. I worked really hard to fix this exact issue so that SROA could stay simple and ignorant of such issues, and so we could continue to lower memcpy as integer loads and stores.

It would really help to provide a reduced IR test case. The bug is full of C++ code and asm, but only has excerpts of the IR. The integer loads and stores should only exist within the function as long as there are no floating point operations on the loaded or stored type. The question is why that isn't proving to be the case, as there is code that tries to ensure that in the tree.

The memcpy is lowered in instruction combine pass, so you mean the root cause should be in this pass?

As I said, what would help me here is a reduced *IR* test case. I think C++ is too high level and clearly everything is broken by the time you get to x86.

Without a good IR test case, its impossible for me to give a concrete path forward here.

Following is the IR of function foo in David's test case, after the last pass that still has the correct type for %ldd,

define void @_Z3fooi(i32 signext %n) #0 {
entry:

%ldd = alloca %"struct.std::complex", align 4
%ref.tmp = alloca %"struct.std::complex", align 4
%0 = bitcast %"struct.std::complex"* %ldd to i8*
call void @llvm.lifetime.start(i64 8, i8* %0) #3
call void @_ZNSt7complexIfEC2Eff(%"struct.std::complex"* %ldd, float 0.000000e+00, float 0.000000e+00)
br label %for.cond

for.cond: ; preds = %for.body, %entry

%i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
%cmp = icmp slt i32 %i.0, %n
br i1 %cmp, label %for.body, label %for.end

for.body: ; preds = %for.cond

%call = call fastcc [2 x float] @_ZL7computeRSt7complexIfES1_()
%coerce.dive = getelementptr inbounds %"struct.std::complex", %"struct.std::complex"* %ref.tmp, i32 0, i32 0 
%1 = bitcast { float, float }* %coerce.dive to [2 x float]*
%call.fca.0.gep = getelementptr inbounds [2 x float], [2 x float]* %1, i32 0, i32 0 
%call.fca.0.extract = extractvalue [2 x float] %call, 0
store float %call.fca.0.extract, float* %call.fca.0.gep
%call.fca.1.gep = getelementptr inbounds [2 x float], [2 x float]* %1, i32 0, i32 1 
%call.fca.1.extract = extractvalue [2 x float] %call, 1
store float %call.fca.1.extract, float* %call.fca.1.gep
%call1 = call dereferenceable(8) %"struct.std::complex"* @_ZNSt7complexIfEpLIfEERS0_RKS_IT_E(%"struct.std::complex"* %ldd, %"struct.std::complex"* dereferenceable(8) %ref.tmp)
%inc = add nsw i32 %i.0, 1
br label %for.cond

for.end: ; preds = %for.cond

call void @llvm.memcpy.p0i8.p0i8.i64(i8* bitcast (%"struct.std::complex"* @dd to i8*), i8* %0, i64 8, i32 4, i1 false)
call void @llvm.lifetime.end(i64 8, i8* %0) #3
ret void

}

In this function there is no explicit floating point operation, all of them are wrapped by function call. So combine pass actually works as intended. The memcpy is lowered to integer ld/st. But later in inlining pass all the complex number function calls are inlined, so in SROA pass it can see both integer ld/st and fp operations.

you can probably create a smaller test case by simplifying post-inline IR dump.

David

In D14596#311601, @Carrot wrote:

In this function there is no explicit floating point operation, all of them are wrapped by function call. So combine pass actually works as intended. The memcpy is lowered to integer ld/st. But later in inlining pass all the complex number function calls are inlined, so in SROA pass it can see both integer ld/st and fp operations.

Yea, as David was somewhat saying, I think the interesting test case is the one after inlining where there are both integer and FP operations.

The IR after inlining:

IR Dump After Function Integration/Inlining ***

; Function Attrs: nounwind
define void @_Z3fooi(i32 signext %n) #0 {
entry:

%ldd = alloca i64, align 8
%tmpcast = bitcast i64* %ldd to %"struct.std::complex"*
%ref.tmp = alloca %"struct.std::complex", align 4
%0 = bitcast i64* %ldd to i8*
call void @llvm.lifetime.start(i64 8, i8* %0) #3
%_M_value.realp.i = getelementptr inbounds %"struct.std::complex", %"struct.std::complex"* %tmpcast, i64 0, i32 0, i32 0 
store float 0.000000e+00, float* %_M_value.realp.i, align 4
%_M_value2.imagp.i = getelementptr inbounds %"struct.std::complex", %"struct.std::complex"* %tmpcast, i64 0, i32 0, i32 1 
store float 0.000000e+00, float* %_M_value2.imagp.i, align 4
br label %for.cond

for.cond: ; preds = %for.body, %entry

%i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
%cmp = icmp slt i32 %i.0, %n
br i1 %cmp, label %for.body, label %for.end

for.body: ; preds = %for.cond

%1 = load float, float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd, i64 0, i32 0, i32 0), align 4
%2 = load float, float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd, i64 0, i32 0, i32 1), align 4
%3 = load float, float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd2, i64 0, i32 0, i32 0), align 4
%4 = load float, float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd2, i64 0, i32 0, i32 1), align 4
%mul.i = fmul float %1, %3
%mul4.i = fmul float %2, %4
%sub.i = fsub float %mul.i, %mul4.i
%mul5.i = fmul float %2, %3
%mul6.i = fmul float %1, %4
%add.i.4 = fadd float %mul5.i, %mul6.i
%.fca.0.insert.i = insertvalue [2 x float] undef, float %sub.i, 0
%.fca.1.insert.i = insertvalue [2 x float] %.fca.0.insert.i, float %add.i.4, 1
%call.fca.0.gep = getelementptr inbounds %"struct.std::complex", %"struct.std::complex"* %ref.tmp, i64 0, i32 0, i32 0 
%call.fca.0.extract = extractvalue [2 x float] %.fca.1.insert.i, 0
store float %call.fca.0.extract, float* %call.fca.0.gep, align 4
%5 = getelementptr inbounds %"struct.std::complex", %"struct.std::complex"* %ref.tmp, i64 0, i32 0, i32 1 
%call.fca.1.extract = extractvalue [2 x float] %.fca.1.insert.i, 1
store float %call.fca.1.extract, float* %5, align 4
%_M_value.realp.i.i = getelementptr inbounds %"struct.std::complex", %"struct.std::complex"* %ref.tmp, i64 0, i32 0, i32 0 
%6 = load float, float* %_M_value.realp.i.i, align 4
%_M_value.realp.i.3 = getelementptr inbounds %"struct.std::complex", %"struct.std::complex"* %tmpcast, i64 0, i32 0, i32 0 
%7 = load float, float* %_M_value.realp.i.3, align 4
%add.i = fadd float %6, %7
store float %add.i, float* %_M_value.realp.i.3, align 4
%_M_value.imagp.i.i = getelementptr inbounds %"struct.std::complex", %"struct.std::complex"* %ref.tmp, i64 0, i32 0, i32 1 
%8 = load float, float* %_M_value.imagp.i.i, align 4
%_M_value3.imagp.i = getelementptr inbounds %"struct.std::complex", %"struct.std::complex"* %tmpcast, i64 0, i32 0, i32 1 
%9 = load float, float* %_M_value3.imagp.i, align 4
%add4.i = fadd float %8, %9
store float %add4.i, float* %_M_value3.imagp.i, align 4
%inc = add nsw i32 %i.0, 1
br label %for.cond

for.end: ; preds = %for.cond

%10 = load i64, i64* %ldd, align 8
store i64 %10, i64* bitcast (%"struct.std::complex"* @dd to i64*), align 4
call void @llvm.lifetime.end(i64 8, i8* %0) #3
ret void

}

And the IR after SROA pass

IR Dump After SROA ***

; Function Attrs: nounwind
define void @_Z3fooi(i32 signext %n) #0 {
entry:

br label %for.cond

for.cond: ; preds = %for.body, %entry

%ldd.sroa.0.0 = phi i32 [ 0, %entry ], [ %5, %for.body ]
%ldd.sroa.6.0 = phi i32 [ 0, %entry ], [ %7, %for.body ]
%i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
%cmp = icmp slt i32 %i.0, %n
br i1 %cmp, label %for.body, label %for.end

for.body: ; preds = %for.cond

%0 = load float, float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd, i64 0, i32 0, i32 0), align 4
%1 = load float, float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd, i64 0, i32 0, i32 1), align 4
%2 = load float, float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd2, i64 0, i32 0, i32 0), align 4
%3 = load float, float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd2, i64 0, i32 0, i32 1), align 4
%mul.i = fmul float %0, %2
%mul4.i = fmul float %1, %3
%sub.i = fsub float %mul.i, %mul4.i
%mul5.i = fmul float %1, %2
%mul6.i = fmul float %0, %3
%add.i.4 = fadd float %mul5.i, %mul6.i
%.fca.0.insert.i = insertvalue [2 x float] undef, float %sub.i, 0
%.fca.1.insert.i = insertvalue [2 x float] %.fca.0.insert.i, float %add.i.4, 1
%call.fca.0.extract = extractvalue [2 x float] %.fca.1.insert.i, 0
%call.fca.1.extract = extractvalue [2 x float] %.fca.1.insert.i, 1
%4 = bitcast i32 %ldd.sroa.0.0 to float 
%add.i = fadd float %call.fca.0.extract, %4
%5 = bitcast float %add.i to i32
%6 = bitcast i32 %ldd.sroa.6.0 to float 
%add4.i = fadd float %call.fca.1.extract, %6
%7 = bitcast float %add4.i to i32
%inc = add nsw i32 %i.0, 1
br label %for.cond

for.end: ; preds = %for.cond

store i32 %ldd.sroa.0.0, i32* bitcast (%"struct.std::complex"* @dd to i32*), align 4
store i32 %ldd.sroa.6.0, i32* bitcast (float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd, i64 0, i32 0, i32 1) to i32*), align 4
ret void

}

So, the problem here is that the instcombine pass tries to remove the bitcast from i32 to float, but fails because of the phi node.

Specifically, we have a combine that handles bitcasts of loads and bitcasts prior to stores, but we don't have anything that looks through phi-nodes. If you were to reg-to-mem the phi nodes so that they were replaced with loads and stores to an i32 alloca, I believe you would find that instcombine would remove the bitcasts and make all of these floating point.

The key will be to push this to handle phi nodes as well. This may be a bit trickier because you really want to handle even complex cycles of phi-nodes. Does this make sense? I'm familiar with this part of instcombine if it would be helpful for me to help you craft this change.

This updated patch improves inst combine pass to handle following bitcast case

TyA  ->  TyB    bitcast
PHI
TyB  ->  TyA    bitcast

The optimized code can skip the two bitcast operations, and use TyA directly in later instructions.

A test case is also added.

davidxl added a reviewer: majnemer.Jan 8 2016, 9:46 AM

ping

Can you also add a test case with multiple levels of phis?

lib/Transforms/InstCombine/InstCombineCasts.cpp
1795	Add more documentation of the method: Replace \p CI instruction with a new PHI ...
1797	add comment -- type B
1798	comment type A
1804	Use the more common worklist pattern: while (!Worklist.empty()) { PHINode *Item = Worklist.pop(); if (!ResultSet.insert(Item)) // already processed continue; // process Item ... Worklist.insert(IncValue); }
1842	use range based iteration : for (auto OldPN : PNodes) { }
1845	clang formatting with space: for (unsigned J = 0, E = OldPN-> ...) {

davidxl added inline comments.Feb 2 2016, 10:24 AM

lib/Transforms/InstCombine/InstCombineCasts.cpp
1846	the two loops can be fused together.

Having David look at this is definitely a good thing. This will be a really nice change to instcombine, but it is in the tricky space of instcombines because it combines a fully phi-node cycle at a time, and has tricky oscillation challenges.

Some comments below.

lib/Transforms/InstCombine/InstCombineCasts.cpp
1805	nit: "Find out all the" -> "Find all of the". The comments really make this clear, but I assume part of your intent here is to handle PHI-cycles, where a PHI node is one of its own inputs (or an indirect cycle)? It took me a while to realize that this would necessitate scanning across a worklist in this way to see if you can saturate the PHI graph with all non-PHI inputs and outputs being bitcasts or constants. I think this really would help to be laid out in some detail in a comment so that the reader of the code knows what the intended algorithm here is.
1823	Make this an early return to reduce indentation.
1846	This is hard, because in the current structure you might need a new phi node to use as the input to another phi node. And this can't be solved because phi nodes are inherently cyclic. I think you will have to build all the nodes first, and then fill in their operands. However, the current code has a problem where the iteration order isn't stable. I would suggest using a SetVector to build the list of phi nodes, and I would give it a better name than "visited" to make it clear that we're going to use it here to rewrite the nodes. Maybe just "PhiNodes" or "NodeSet" to make it clear that we're going to keep using this set past the worklist walk. There is other code in instcombine that deals with phi node cycles, and it might be useful to check how they handle these patterns and name things (I don't know personally...)
1863–1864	Hmm, if you handle stores in addition to bitcasts for phi node users, you should handle loads in addition to bitcasts for phi node operands. However, you shouldn't need to rewrite the loads and stores. Instead, you should be able to introduce a bitcast for the loads and stores, and remove the other bitcasts. Then instcombine should trigger the existing code to fold the loads and stores through the casts. The only thing to watch out for here are introducing cycles where we switch bitcasts from one place to another on phi nodes. I'm not sure how best to do that. Maybe David Majnemer has some ideas here?
2002	No need for the {'s

davidxl added inline comments.Feb 2 2016, 2:27 PM

lib/Transforms/InstCombine/InstCombineCasts.cpp
1846	Right -- there is indeed a loop carried dependency I missed at the first look.

This updated patch now handles load instruction.

lib/Transforms/InstCombine/InstCombineCasts.cpp
1846	I found the code in function InstCombiner::SliceUpIllegalIntegerPHI, it has similar code structure. It uses a SmallPtrSet variable named PHIsInspected, but it is not used for other purposes. I changed my var to type SmallSetVector, and named as "OldPhiNodes".
1863–1864	For the load case, I handle the load instruction that has only one use. Otherwise it may create another bitcast for other uses, it may not be good for performance.

FWIW none of this discussion has been on the commits list, so it should probably be added so it can be seen there.

(Or abandon this revision and start fresh with a patch that has llvm-commits as a subscriber)

Nevermind, I'm apparently blind.

I think this can be N**2 if there is a big tree of phi nodes with bitcasts at different levels of the tree. It might make sense to limit the total amount of work we are willing to do this combine.

How many items of work does the worklist typically process before firing?

lib/Transforms/InstCombine/InstCombineCasts.cpp
1803–1804	Pointers lean right.
1821	change -> changing
1826	Please use `auto *`
1833	Please use `auto *`
1847	Pointers lean right.
1848	Here, I'd use `PHINode ` or at least `auto `.
1855	Why not just use the IRBuilder? It should appropriately forward to the constant folder.
1856	Here, I'd use `PHINode ` or at least `auto `.
1861	Please use `auto *`.
1863	Please use `auto *`.
1866	Please use `auto *`.
1868	Ditto.
1878	Ditto.

nemanjai added a subscriber: nemanjai.Feb 10 2016, 12:21 AM

In D14596#347537, @majnemer wrote:

I think this can be N**2 if there is a big tree of phi nodes with bitcasts at different levels of the tree. It might make sense to limit the total amount of work we are willing to do this combine.

How many items of work does the worklist typically process before firing?

In the regression testing there are 1.01 PHI nodes added to the worklist in average, the maximum is 3.

Carrot updated this revision to Diff 47521.Feb 10 2016, 2:17 PM

Carrot edited edge metadata.

Reviewers, any more comments on the patch?

David

Silly improvement: Can probably constify all of the Type * variables. No need to repost though.

I think the biggest question left is the potential quadratic behavior:

In D14596#349033, @Carrot wrote:

In D14596#347537, @majnemer wrote:

I think this can be N**2 if there is a big tree of phi nodes with bitcasts at different levels of the tree. It might make sense to limit the total amount of work we are willing to do this combine.

How many items of work does the worklist typically process before firing?

In the regression testing there are 1.01 PHI nodes added to the worklist in average, the maximum is 3.

I think you should look at a substantially larger test than regression testing. For example, collecting the stats across a bootstrap of Clang itself at least?

In D14596#356293, @chandlerc wrote:

I think the biggest question left is the potential quadratic behavior:

In D14596#349033, @Carrot wrote:

In D14596#347537, @majnemer wrote:

I think this can be N**2 if there is a big tree of phi nodes with bitcasts at different levels of the tree. It might make sense to limit the total amount of work we are willing to do this combine.

How many items of work does the worklist typically process before firing?

In the regression testing there are 1.01 PHI nodes added to the worklist in average, the maximum is 3.

I think you should look at a substantially larger test than regression testing. For example, collecting the stats across a bootstrap of Clang itself at least?

In the bootstrap of clang, the average number of visited PHI nodes is 1.047, the maximum is 14.

Given the data collected, all concerns have been addressed -- can an explicit LGTM be given so that it can be closed?

Ping

LGTM

Please be sure to mention [InstCombine] instead of [SROA] in the commit message.

This revision is now accepted and ready to land.Mar 2 2016, 2:31 PM

kbarton added a subscriber: kbarton.Mar 9 2016, 7:26 AM

Please take another look.

I updated the patch a little to avoid change volatile memory access instructions, since volatile ld/st with bitcast can't be combined.

The changes compared to last version is:

Index: lib/Transforms/InstCombine/InstCombineCasts.cpp

lib/Transforms/InstCombine/InstCombineCasts.cpp (revision 262910)

+++ lib/Transforms/InstCombine/InstCombineCasts.cpp (working copy)
@@ -1815,8 +1815,9 @@

if (isa<Constant>(IncValue))
  continue;

if (isa<LoadInst>(IncValue)) {
if (IncValue->hasOneUse())

+ auto *LI = dyn_cast<LoadInst>(IncValue);
+ if (LI) {
+ if (LI->hasOneUse() && LI->isSimple())

  continue;
// If a LoadInst has more than one use, changing the type of loaded
// value may create another bitcast.

@@ -1875,7 +1876,7 @@

// If there is a store with type B, change it to type A.
for (User *U : PN->users()) {
  auto *SI = dyn_cast<StoreInst>(U);

if (SI && SI->getOperand(0) == PN) {

+ if (SI && SI->isSimple() && SI->getOperand(0) == PN) {

  Builder->SetInsertPoint(SI);
  SI->setOperand(0, Builder->CreateBitCast(NewPNodes[PN], SrcTy));
}

This updated patch fixed https://llvm.org/bugs/show_bug.cgi?id=26918.

The problem of the reverted patch is when we have following code

; <label>:2: ; preds = %0

%3 = load atomic i64, i64* %1 monotonic, align 8
br label %8

; <label>:4: ; preds = %0, %0

%5 = load atomic i64, i64* %1 acquire, align 8
br label %8

; <label>:6: ; preds = %0

%7 = load atomic i64, i64* %1 seq_cst, align 8
br label %8

; <label>:8: ; preds = %6, %4, %2

%.in1097491 = phi i64 [ %3, %2 ], [ %7, %6 ], [ %5, %4 ]
%9 = bitcast i64 %.in1097491 to double
ret double %9

My patch changed it to

; <label>:2: ; preds = %0

%3 = load atomic i64, i64* %1 monotonic, align 8
%4 = bitcast i64 %3 to double
br label %11

; <label>:5: ; preds = %0, %0

%6 = load atomic i64, i64* %1 acquire, align 8
%7 = bitcast i64 %6 to double
br label %11

; <label>:8: ; preds = %0

%9 = load atomic i64, i64* %1 seq_cst, align 8
%10 = bitcast i64 %9 to double
br label %11

; <label>:11: ; preds = %8, %5, %2

%12 = phi double [ %4, %2 ], [ %10, %8 ], [ %7, %5 ]
%.in1097491 = phi i64 [ %3, %2 ], [ %9, %8 ], [ %6, %5 ]
ret double %12

And expects ld/st combining can combine the load instructions with following bitcast. But ld/st combining doesn't change volatile/atomic memory access, so the bitcast instructions are left there. Later phi combining transformed it back to the original form, so the two combining transforms the code back and forth indefinitely.

There are two changes compared to the original approved patch:

1 avoid changing volatile/atomic memory access
2 add the impacted instructions to combining worklist

Please take another look.

LGTM

evstupac mentioned this in D20847: BitCast infinite loop fix.May 31 2016, 6:23 PM

I've added a patch D20847 to fix potential infinite loop for the case here. Can you please take a look?

This has been committed for some time.

Revision Contents

Path

Size

lib/

Transforms/

InstCombine/

InstCombineCasts.cpp

106 lines

InstCombineInternal.h

1 line

test/

Transforms/

InstCombine/

pr25342.ll

93 lines

Diff 50887

lib/Transforms/InstCombine/InstCombineCasts.cpp

Context not available.
	//===----------------------------------------------------------------------===//	//===----------------------------------------------------------------------===//

	#include "InstCombineInternal.h"	#include "InstCombineInternal.h"
		#include "llvm/ADT/SetVector.h"
	#include "llvm/Analysis/ConstantFolding.h"	#include "llvm/Analysis/ConstantFolding.h"
	#include "llvm/IR/DataLayout.h"	#include "llvm/IR/DataLayout.h"
	#include "llvm/IR/PatternMatch.h"	#include "llvm/IR/PatternMatch.h"
Context not available.
	return ExtractElementInst::Create(NewBC, ExtElt->getIndexOperand());	return ExtractElementInst::Create(NewBC, ExtElt->getIndexOperand());
	}	}

		/// This function handles following case
		///
		/// A -> B cast
		/// PHI
		/// B -> A cast
		///
		davidxlUnsubmitted Done Reply Inline Actions Add more documentation of the method: Replace \p CI instruction with a new PHI ... davidxl: Add more documentation of the method: Replace \p CI instruction with a new PHI ...
		/// All the related PHI nodes can be replaced by new PHI nodes with type A.
		/// The uses of \p CI can be changed to the new PHI node corresponding to \p PN.
		davidxlUnsubmitted Done Reply Inline Actions add comment -- type B davidxl: add comment -- type B
		Instruction InstCombiner::optimizeBitCastFromPhi(CastInst &CI, PHINode PN) {
		davidxlUnsubmitted Done Reply Inline Actions comment type A davidxl: comment type A
		Value *Src = CI.getOperand(0);
		Type *SrcTy = Src->getType(); // Type B
		Type *DestTy = CI.getType(); // Type A

		SmallVector<PHINode *, 4> PhiWorklist;
		SmallSetVector<PHINode *, 4> OldPhiNodes;
		davidxlUnsubmitted Done Reply Inline Actions Use the more common worklist pattern: while (!Worklist.empty()) { PHINode Item = Worklist.pop(); if (!ResultSet.insert(Item)) // already processed continue; // process Item ... Worklist.insert(IncValue); } davidxl:* Use the more common worklist pattern: while (!Worklist.empty()) { PHINode *Item = Worklist.
		majnemerUnsubmitted Done Reply Inline Actions Pointers lean right. majnemer: Pointers lean right.

		chandlercUnsubmitted Done Reply Inline Actions nit: "Find out all the" -> "Find all of the". The comments really make this clear, but I assume part of your intent here is to handle PHI-cycles, where a PHI node is one of its own inputs (or an indirect cycle)? It took me a while to realize that this would necessitate scanning across a worklist in this way to see if you can saturate the PHI graph with all non-PHI inputs and outputs being bitcasts or constants. I think this really would help to be laid out in some detail in a comment so that the reader of the code knows what the intended algorithm here is. chandlerc: nit: "Find out all the" -> "Find all of the". The comments really make this clear, but I…
		// Find all of the A->B casts and PHI nodes.
		// We need to inpect all related PHI nodes, but PHIs can be cyclic, so
		// OldPhiNodes is used to track all known PHI nodes, before adding a new
		// PHI to PhiWorklist, it is checked against and added to OldPhiNodes first.
		PhiWorklist.push_back(PN);
		OldPhiNodes.insert(PN);
		while (!PhiWorklist.empty()) {
		auto *OldPN = PhiWorklist.pop_back_val();
		for (Value *IncValue : OldPN->incoming_values()) {
		if (isa<Constant>(IncValue))
		continue;

		auto *LI = dyn_cast<LoadInst>(IncValue);
		if (LI) {
		if (LI->hasOneUse() && LI->isSimple())
		continue;
		majnemerUnsubmitted Done Reply Inline Actions change -> changing majnemer: change -> changing
		// If a LoadInst has more than one use, changing the type of loaded
		// value may create another bitcast.
		chandlercUnsubmitted Done Reply Inline Actions Make this an early return to reduce indentation. chandlerc: Make this an early return to reduce indentation.
		return nullptr;
		}

		majnemerUnsubmitted Done Reply Inline Actions Please use `auto ` majnemer:* Please use `auto *`
		auto *PNode = dyn_cast<PHINode>(IncValue);
		if (PNode) {
		if (OldPhiNodes.insert(PNode))
		PhiWorklist.push_back(PNode);
		continue;
		}

		majnemerUnsubmitted Done Reply Inline Actions Please use `auto ` majnemer:* Please use `auto *`
		auto *BCI = dyn_cast<BitCastInst>(IncValue);
		// We can't handle other instructions.
		if (!BCI)
		return nullptr;

		// Verify it's a A->B cast.
		Type *TyA = BCI->getOperand(0)->getType();
		Type *TyB = BCI->getType();
		if (TyA != DestTy \|\| TyB != SrcTy)
		davidxlUnsubmitted Done Reply Inline Actions use range based iteration : for (auto OldPN : PNodes) { } davidxl: use range based iteration : for (auto OldPN : PNodes) { }
		return nullptr;
		}
		}
		davidxlUnsubmitted Done Reply Inline Actions clang formatting with space: for (unsigned J = 0, E = OldPN-> ...) { davidxl: clang formatting with space: for (unsigned J = 0, E = OldPN-> ...) {

		davidxlUnsubmitted Not Done Reply Inline Actions the two loops can be fused together. davidxl: the two loops can be fused together.
		chandlercUnsubmitted Not Done Reply Inline Actions This is hard, because in the current structure you might need a new phi node to use as the input to another phi node. And this can't be solved because phi nodes are inherently cyclic. I think you will have to build all the nodes first, and then fill in their operands. However, the current code has a problem where the iteration order isn't stable. I would suggest using a SetVector to build the list of phi nodes, and I would give it a better name than "visited" to make it clear that we're going to use it here to rewrite the nodes. Maybe just "PhiNodes" or "NodeSet" to make it clear that we're going to keep using this set past the worklist walk. There is other code in instcombine that deals with phi node cycles, and it might be useful to check how they handle these patterns and name things (I don't know personally...) chandlerc: This is hard, because in the current structure you might need a new phi node to use as the…
		davidxlUnsubmitted Not Done Reply Inline Actions Right -- there is indeed a loop carried dependency I missed at the first look. davidxl: Right -- there is indeed a loop carried dependency I missed at the first look.
		CarrotAuthorUnsubmitted Not Done Reply Inline Actions I found the code in function InstCombiner::SliceUpIllegalIntegerPHI, it has similar code structure. It uses a SmallPtrSet variable named PHIsInspected, but it is not used for other purposes. I changed my var to type SmallSetVector, and named as "OldPhiNodes". Carrot: I found the code in function InstCombiner::SliceUpIllegalIntegerPHI, it has similar code…
		// For each old PHI node, create a corresponding new PHI node with a type A.
		majnemerUnsubmitted Done Reply Inline Actions Pointers lean right. majnemer: Pointers lean right.
		SmallDenseMap<PHINode , PHINode > NewPNodes;
		majnemerUnsubmitted Done Reply Inline Actions Here, I'd use `PHINode ` or at least `auto `. majnemer: Here, I'd use `PHINode ` or at least `auto `.
		for (auto *OldPN : OldPhiNodes) {
		Builder->SetInsertPoint(OldPN);
		PHINode *NewPN = Builder->CreatePHI(DestTy, OldPN->getNumOperands());
		NewPNodes[OldPN] = NewPN;
		}

		// Fill in the operands of new PHI nodes.
		majnemerUnsubmitted Done Reply Inline Actions Why not just use the IRBuilder? It should appropriately forward to the constant folder. majnemer: Why not just use the IRBuilder? It should appropriately forward to the constant folder.
		for (auto *OldPN : OldPhiNodes) {
		majnemerUnsubmitted Done Reply Inline Actions Here, I'd use `PHINode ` or at least `auto `. majnemer: Here, I'd use `PHINode ` or at least `auto `.
		PHINode *NewPN = NewPNodes[OldPN];
		for (unsigned j = 0, e = OldPN->getNumOperands(); j != e; ++j) {
		Value *V = OldPN->getOperand(j);
		Value *NewV = nullptr;
		if (auto *C = dyn_cast<Constant>(V)) {
		majnemerUnsubmitted Done Reply Inline Actions Please use `auto `. majnemer:* Please use `auto *`.
		NewV = Builder->CreateBitCast(C, DestTy);
		} else if (auto *LI = dyn_cast<LoadInst>(V)) {
		majnemerUnsubmitted Done Reply Inline Actions Please use `auto `. majnemer:* Please use `auto *`.
		Builder->SetInsertPoint(OldPN->getIncomingBlock(j)->getTerminator());
		chandlercUnsubmitted Done Reply Inline Actions Hmm, if you handle stores in addition to bitcasts for phi node users, you should handle loads in addition to bitcasts for phi node operands. However, you shouldn't need to rewrite the loads and stores. Instead, you should be able to introduce a bitcast for the loads and stores, and remove the other bitcasts. Then instcombine should trigger the existing code to fold the loads and stores through the casts. The only thing to watch out for here are introducing cycles where we switch bitcasts from one place to another on phi nodes. I'm not sure how best to do that. Maybe David Majnemer has some ideas here? chandlerc: Hmm, if you handle stores in addition to bitcasts for phi node users, you should handle loads…
		CarrotAuthorUnsubmitted Not Done Reply Inline Actions For the load case, I handle the load instruction that has only one use. Otherwise it may create another bitcast for other uses, it may not be good for performance. Carrot: For the load case, I handle the load instruction that has only one use. Otherwise it may create…
		NewV = Builder->CreateBitCast(LI, DestTy);
		Worklist.Add(LI);
		majnemerUnsubmitted Done Reply Inline Actions Please use `auto `. majnemer:* Please use `auto *`.
		} else if (auto *BCI = dyn_cast<BitCastInst>(V)) {
		NewV = BCI->getOperand(0);
		majnemerUnsubmitted Done Reply Inline Actions Ditto. majnemer: Ditto.
		} else if (auto *PrevPN = dyn_cast<PHINode>(V)) {
		NewV = NewPNodes[PrevPN];
		}
		assert(NewV);
		NewPN->addIncoming(NewV, OldPN->getIncomingBlock(j));
		}
		}

		// If there is a store with type B, change it to type A.
		for (User *U : PN->users()) {
		majnemerUnsubmitted Done Reply Inline Actions Ditto. majnemer: Ditto.
		auto *SI = dyn_cast<StoreInst>(U);
		if (SI && SI->isSimple() && SI->getOperand(0) == PN) {
		Builder->SetInsertPoint(SI);
		SI->setOperand(0, Builder->CreateBitCast(NewPNodes[PN], SrcTy));
		Worklist.Add(SI);
		}
		}

		return replaceInstUsesWith(CI, NewPNodes[PN]);
		}

	Instruction *InstCombiner::visitBitCast(BitCastInst &CI) {	Instruction *InstCombiner::visitBitCast(BitCastInst &CI) {
	// If the operands are integer typed then apply the integer transforms,	// If the operands are integer typed then apply the integer transforms,
	// otherwise just apply the common ones.	// otherwise just apply the common ones.
		chandlercUnsubmitted Done Reply Inline Actions No need for the {'s chandlerc: No need for the {'s
Context not available.
	}	}
	}	}

		// Handle the A->B->A cast, and there is an intervening PHI node.
		if (PHINode *PN = dyn_cast<PHINode>(Src))
		if (Instruction *I = optimizeBitCastFromPhi(CI, PN))
		return I;

	if (Instruction I = canonicalizeBitCastExtElt(CI, this, DL))	if (Instruction I = canonicalizeBitCastExtElt(CI, this, DL))
	return I;	return I;

Context not available.

lib/Transforms/InstCombine/InstCombineInternal.h

Context not available.
	Instruction scalarizePHI(ExtractElementInst &EI, PHINode PN);	Instruction scalarizePHI(ExtractElementInst &EI, PHINode PN);
	Value EvaluateInDifferentElementOrder(Value V, ArrayRef<int> Mask);	Value EvaluateInDifferentElementOrder(Value V, ArrayRef<int> Mask);
	Instruction *foldCastedBitwiseLogic(BinaryOperator &I);	Instruction *foldCastedBitwiseLogic(BinaryOperator &I);
		Instruction optimizeBitCastFromPhi(CastInst &CI, PHINode PN);

	public:	public:
	/// \brief Inserts an instruction \p New before instruction \p Old	/// \brief Inserts an instruction \p New before instruction \p Old
Context not available.

test/Transforms/InstCombine/pr25342.ll

				; RUN: opt < %s -instcombine -S \| FileCheck %s

				%"struct.std::complex" = type { { float, float } }
				@dd = external global %"struct.std::complex", align 4
				@dd2 = external global %"struct.std::complex", align 4

				define void @_Z3fooi(i32 signext %n) {
				entry:
				br label %for.cond

				for.cond:
				%ldd.sroa.0.0 = phi i32 [ 0, %entry ], [ %5, %for.body ]
				%ldd.sroa.6.0 = phi i32 [ 0, %entry ], [ %7, %for.body ]
				%i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
				%cmp = icmp slt i32 %i.0, %n
				br i1 %cmp, label %for.body, label %for.end

				for.body:
				%0 = load float, float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd, i64 0, i32 0, i32 0), align 4
				%1 = load float, float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd, i64 0, i32 0, i32 1), align 4
				%2 = load float, float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd2, i64 0, i32 0, i32 0), align 4
				%3 = load float, float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd2, i64 0, i32 0, i32 1), align 4
				%mul.i = fmul float %0, %2
				%mul4.i = fmul float %1, %3
				%sub.i = fsub float %mul.i, %mul4.i
				%mul5.i = fmul float %1, %2
				%mul6.i = fmul float %0, %3
				%add.i4 = fadd float %mul5.i, %mul6.i
				%4 = bitcast i32 %ldd.sroa.0.0 to float
				%add.i = fadd float %sub.i, %4
				%5 = bitcast float %add.i to i32
				%6 = bitcast i32 %ldd.sroa.6.0 to float
				%add4.i = fadd float %add.i4, %6
				%7 = bitcast float %add4.i to i32
				%inc = add nsw i32 %i.0, 1
				br label %for.cond

				for.end:
				store i32 %ldd.sroa.0.0, i32* bitcast (%"struct.std::complex"* @dd to i32*), align 4
				store i32 %ldd.sroa.6.0, i32* bitcast (float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd, i64 0, i32 0, i32 1) to i32*), align 4
				ret void

				; CHECK: phi float
				; CHECK: store float
				; CHECK-NOT: bitcast
				}


				define void @multi_phi(i32 signext %n) {
				entry:
				br label %for.cond

				for.cond:
				%ldd.sroa.0.0 = phi i32 [ 0, %entry ], [ %9, %odd.bb ]
				%i.0 = phi i32 [ 0, %entry ], [ %inc, %odd.bb ]
				%cmp = icmp slt i32 %i.0, %n
				br i1 %cmp, label %for.body, label %for.end

				for.body:
				%0 = load float, float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd, i64 0, i32 0, i32 0), align 4
				%1 = load float, float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd, i64 0, i32 0, i32 1), align 4
				%2 = load float, float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd2, i64 0, i32 0, i32 0), align 4
				%3 = load float, float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd2, i64 0, i32 0, i32 1), align 4
				%mul.i = fmul float %0, %2
				%mul4.i = fmul float %1, %3
				%sub.i = fsub float %mul.i, %mul4.i
				%4 = bitcast i32 %ldd.sroa.0.0 to float
				%add.i = fadd float %sub.i, %4
				%5 = bitcast float %add.i to i32
				%inc = add nsw i32 %i.0, 1
				%bit0 = and i32 %inc, 1
				%even = icmp slt i32 %bit0, 1
				br i1 %even, label %even.bb, label %odd.bb

				even.bb:
				%6 = bitcast i32 %5 to float
				%7 = fadd float %sub.i, %6
				%8 = bitcast float %7 to i32
				br label %odd.bb

				odd.bb:
				%9 = phi i32 [ %5, %for.body ], [ %8, %even.bb ]
				br label %for.cond

				for.end:
				store i32 %ldd.sroa.0.0, i32* bitcast (%"struct.std::complex"* @dd to i32*), align 4
				ret void

				; CHECK-LABEL: @multi_phi(
				; CHECK: phi float
				; CHECK: store float
				; CHECK-NOT: bitcast
				}

This is an archive of the discontinued LLVM Phabricator instance.

[SROA] Choose more profitable type in findCommonTypeClosedPublic

Details

Diff Detail

Event Timeline

Index: lib/Transforms/InstCombine/InstCombineCasts.cpp

Revision Contents

Diff 50887

lib/Transforms/InstCombine/InstCombineCasts.cpp

lib/Transforms/InstCombine/InstCombineInternal.h

test/Transforms/InstCombine/pr25342.ll

[SROA] Choose more profitable type in findCommonType
ClosedPublic