This is an archive of the discontinued LLVM Phabricator instance.

[SROA] Choose more profitable type in findCommonType
ClosedPublic

Authored by Carrot on Nov 11 2015, 4:26 PM.

Details

Summary

This patch fix PR25342.

This updated patch enhances InstCombineCasts to handle following case as suggested by Chandler:

A  ->  B    cast
PHI
B  ->  A    cast

so all bitcast operations can be removed.

Diff Detail

Event Timeline

Carrot updated this revision to Diff 39976.Nov 11 2015, 4:26 PM
Carrot retitled this revision from to [SROA] Choose more profitable type in findCommonType.
Carrot updated this object.
Carrot added a subscriber: llvm-commits.
hfinkel edited edge metadata.Nov 24 2015, 3:06 PM

I'm undecided about this. Chandler, any thoughts?

If we add a DAGCombine that makes bitcast(load(addr)) into a load of a different type, and similar for stores, would that address all of the current motivating examples for this?

davidxl added a subscriber: davidxl.Dec 9 2015, 9:27 AM

There are other heuristics that can be applied here:

  1. recognize the load/store pattern generated from lowered memcpy and 'ignore' use types from them if there are other use types;
  2. Go deeper in the use type analysis -- instead of stopping at the load instruction -- follow the DU chain to find the real use-type of the expanded values (as in arithmetic operations)
lib/Transforms/Scalar/SROA.cpp
1105

Counting the number of static occurrences is not the best way IMO. Why not using profile/frequency information of the uses?

chandlerc edited edge metadata.Dec 9 2015, 3:34 PM

I was really confused about this patch until I read the bug. =]

For the record, bugs don't go to a widely followed mailing list, and so it is often helpful to provide a summary of the problem being faced when posting a patch for review in addition to a link to the bug. Otherwise it can be hard to figure out.

However, I don't think this is the correct fix. Regardless of what type SROA chooses to use, the optimizer should *always* fix the type when we discover that in the end loads and stores end up feeding floating point instructions. I worked really hard to fix this exact issue so that SROA could stay simple and ignorant of such issues, and so we could continue to lower memcpy as integer loads and stores.

It would really help to provide a reduced IR test case. The bug is full of C++ code and asm, but only has excerpts of the IR. The integer loads and stores should only exist within the function as long as there are no floating point operations on the loaded or stored type. The question is why that isn't proving to be the case, as there is code that tries to ensure that in the tree.

Here is a simple case (not yet minimized):

#include <complex>
using namespace std;
typedef complex<float> DD;
extern DD dd, dd2;

DD compute(DD &c1, DD& c2){

float r1 = c1.real();
 float i1 = c1.imag();
 float r2 = c2.real();
 float i2 = c2.imag();
return DD(r1 * r2 - i1 * i2, r2 * i1 + r1 * i2);

}

void foo(int n) {

int i;
DD ldd = 0;

for (i = 0; i < n; i++)
  ldd += compute(dd, dd2);

dd = ldd;

}

On x86, the kernel loop generated by clang/LLVM is:

.LBB1_2: # =>This Inner Loop Header: Depth=1

movd    %eax, %xmm3
addss   %xmm1, %xmm3
movd    %xmm3, %eax
addss   %xmm0, %xmm2
decl    %edi
jne     .LBB1_2

The better one should be:

L4:

vaddss  %xmm2, %xmm0, %xmm0
vaddss  %xmm3, %xmm1, %xmm1
decl  %edi
jne     .L4
Carrot added a comment.Dec 9 2015, 4:50 PM

I was really confused about this patch until I read the bug. =]

For the record, bugs don't go to a widely followed mailing list, and so it is often helpful to provide a summary of the problem being faced when posting a patch for review in addition to a link to the bug. Otherwise it can be hard to figure out.

However, I don't think this is the correct fix. Regardless of what type SROA chooses to use, the optimizer should *always* fix the type when we discover that in the end loads and stores end up feeding floating point instructions. I worked really hard to fix this exact issue so that SROA could stay simple and ignorant of such issues, and so we could continue to lower memcpy as integer loads and stores.

It would really help to provide a reduced IR test case. The bug is full of C++ code and asm, but only has excerpts of the IR. The integer loads and stores should only exist within the function as long as there are no floating point operations on the loaded or stored type. The question is why that isn't proving to be the case, as there is code that tries to ensure that in the tree.

The memcpy is lowered in instruction combine pass, so you mean the root cause should be in this pass?

As I said, what would help me here is a reduced *IR* test case. I think C++ is too high level and clearly everything is broken by the time you get to x86.

Without a good IR test case, its impossible for me to give a concrete path forward here.

Following is the IR of function foo in David's test case, after the last pass that still has the correct type for %ldd,

define void @_Z3fooi(i32 signext %n) #0 {
entry:

%ldd = alloca %"struct.std::complex", align 4
%ref.tmp = alloca %"struct.std::complex", align 4
%0 = bitcast %"struct.std::complex"* %ldd to i8*
call void @llvm.lifetime.start(i64 8, i8* %0) #3
call void @_ZNSt7complexIfEC2Eff(%"struct.std::complex"* %ldd, float 0.000000e+00, float 0.000000e+00)
br label %for.cond

for.cond: ; preds = %for.body, %entry

%i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
%cmp = icmp slt i32 %i.0, %n
br i1 %cmp, label %for.body, label %for.end

for.body: ; preds = %for.cond

%call = call fastcc [2 x float] @_ZL7computeRSt7complexIfES1_()
%coerce.dive = getelementptr inbounds %"struct.std::complex", %"struct.std::complex"* %ref.tmp, i32 0, i32 0 
%1 = bitcast { float, float }* %coerce.dive to [2 x float]*
%call.fca.0.gep = getelementptr inbounds [2 x float], [2 x float]* %1, i32 0, i32 0 
%call.fca.0.extract = extractvalue [2 x float] %call, 0
store float %call.fca.0.extract, float* %call.fca.0.gep
%call.fca.1.gep = getelementptr inbounds [2 x float], [2 x float]* %1, i32 0, i32 1 
%call.fca.1.extract = extractvalue [2 x float] %call, 1
store float %call.fca.1.extract, float* %call.fca.1.gep
%call1 = call dereferenceable(8) %"struct.std::complex"* @_ZNSt7complexIfEpLIfEERS0_RKS_IT_E(%"struct.std::complex"* %ldd, %"struct.std::complex"* dereferenceable(8) %ref.tmp)
%inc = add nsw i32 %i.0, 1
br label %for.cond

for.end: ; preds = %for.cond

call void @llvm.memcpy.p0i8.p0i8.i64(i8* bitcast (%"struct.std::complex"* @dd to i8*), i8* %0, i64 8, i32 4, i1 false)
call void @llvm.lifetime.end(i64 8, i8* %0) #3
ret void

}

In this function there is no explicit floating point operation, all of them are wrapped by function call. So combine pass actually works as intended. The memcpy is lowered to integer ld/st. But later in inlining pass all the complex number function calls are inlined, so in SROA pass it can see both integer ld/st and fp operations.

you can probably create a smaller test case by simplifying post-inline IR dump.

David

In this function there is no explicit floating point operation, all of them are wrapped by function call. So combine pass actually works as intended. The memcpy is lowered to integer ld/st. But later in inlining pass all the complex number function calls are inlined, so in SROA pass it can see both integer ld/st and fp operations.

Yea, as David was somewhat saying, I think the interesting test case is the one after inlining where there are both integer and FP operations.

The IR after inlining:

  • IR Dump After Function Integration/Inlining ***

; Function Attrs: nounwind
define void @_Z3fooi(i32 signext %n) #0 {
entry:

%ldd = alloca i64, align 8
%tmpcast = bitcast i64* %ldd to %"struct.std::complex"*
%ref.tmp = alloca %"struct.std::complex", align 4
%0 = bitcast i64* %ldd to i8*
call void @llvm.lifetime.start(i64 8, i8* %0) #3
%_M_value.realp.i = getelementptr inbounds %"struct.std::complex", %"struct.std::complex"* %tmpcast, i64 0, i32 0, i32 0 
store float 0.000000e+00, float* %_M_value.realp.i, align 4
%_M_value2.imagp.i = getelementptr inbounds %"struct.std::complex", %"struct.std::complex"* %tmpcast, i64 0, i32 0, i32 1 
store float 0.000000e+00, float* %_M_value2.imagp.i, align 4
br label %for.cond

for.cond: ; preds = %for.body, %entry

%i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
%cmp = icmp slt i32 %i.0, %n
br i1 %cmp, label %for.body, label %for.end

for.body: ; preds = %for.cond

%1 = load float, float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd, i64 0, i32 0, i32 0), align 4
%2 = load float, float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd, i64 0, i32 0, i32 1), align 4
%3 = load float, float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd2, i64 0, i32 0, i32 0), align 4
%4 = load float, float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd2, i64 0, i32 0, i32 1), align 4
%mul.i = fmul float %1, %3
%mul4.i = fmul float %2, %4
%sub.i = fsub float %mul.i, %mul4.i
%mul5.i = fmul float %2, %3
%mul6.i = fmul float %1, %4
%add.i.4 = fadd float %mul5.i, %mul6.i
%.fca.0.insert.i = insertvalue [2 x float] undef, float %sub.i, 0
%.fca.1.insert.i = insertvalue [2 x float] %.fca.0.insert.i, float %add.i.4, 1
%call.fca.0.gep = getelementptr inbounds %"struct.std::complex", %"struct.std::complex"* %ref.tmp, i64 0, i32 0, i32 0 
%call.fca.0.extract = extractvalue [2 x float] %.fca.1.insert.i, 0
store float %call.fca.0.extract, float* %call.fca.0.gep, align 4
%5 = getelementptr inbounds %"struct.std::complex", %"struct.std::complex"* %ref.tmp, i64 0, i32 0, i32 1 
%call.fca.1.extract = extractvalue [2 x float] %.fca.1.insert.i, 1
store float %call.fca.1.extract, float* %5, align 4
%_M_value.realp.i.i = getelementptr inbounds %"struct.std::complex", %"struct.std::complex"* %ref.tmp, i64 0, i32 0, i32 0 
%6 = load float, float* %_M_value.realp.i.i, align 4
%_M_value.realp.i.3 = getelementptr inbounds %"struct.std::complex", %"struct.std::complex"* %tmpcast, i64 0, i32 0, i32 0 
%7 = load float, float* %_M_value.realp.i.3, align 4
%add.i = fadd float %6, %7
store float %add.i, float* %_M_value.realp.i.3, align 4
%_M_value.imagp.i.i = getelementptr inbounds %"struct.std::complex", %"struct.std::complex"* %ref.tmp, i64 0, i32 0, i32 1 
%8 = load float, float* %_M_value.imagp.i.i, align 4
%_M_value3.imagp.i = getelementptr inbounds %"struct.std::complex", %"struct.std::complex"* %tmpcast, i64 0, i32 0, i32 1 
%9 = load float, float* %_M_value3.imagp.i, align 4
%add4.i = fadd float %8, %9
store float %add4.i, float* %_M_value3.imagp.i, align 4
%inc = add nsw i32 %i.0, 1
br label %for.cond

for.end: ; preds = %for.cond

%10 = load i64, i64* %ldd, align 8
store i64 %10, i64* bitcast (%"struct.std::complex"* @dd to i64*), align 4
call void @llvm.lifetime.end(i64 8, i8* %0) #3
ret void

}

And the IR after SROA pass

  • IR Dump After SROA ***

; Function Attrs: nounwind
define void @_Z3fooi(i32 signext %n) #0 {
entry:

br label %for.cond

for.cond: ; preds = %for.body, %entry

%ldd.sroa.0.0 = phi i32 [ 0, %entry ], [ %5, %for.body ]
%ldd.sroa.6.0 = phi i32 [ 0, %entry ], [ %7, %for.body ]
%i.0 = phi i32 [ 0, %entry ], [ %inc, %for.body ]
%cmp = icmp slt i32 %i.0, %n
br i1 %cmp, label %for.body, label %for.end

for.body: ; preds = %for.cond

%0 = load float, float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd, i64 0, i32 0, i32 0), align 4
%1 = load float, float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd, i64 0, i32 0, i32 1), align 4
%2 = load float, float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd2, i64 0, i32 0, i32 0), align 4
%3 = load float, float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd2, i64 0, i32 0, i32 1), align 4
%mul.i = fmul float %0, %2
%mul4.i = fmul float %1, %3
%sub.i = fsub float %mul.i, %mul4.i
%mul5.i = fmul float %1, %2
%mul6.i = fmul float %0, %3
%add.i.4 = fadd float %mul5.i, %mul6.i
%.fca.0.insert.i = insertvalue [2 x float] undef, float %sub.i, 0
%.fca.1.insert.i = insertvalue [2 x float] %.fca.0.insert.i, float %add.i.4, 1
%call.fca.0.extract = extractvalue [2 x float] %.fca.1.insert.i, 0
%call.fca.1.extract = extractvalue [2 x float] %.fca.1.insert.i, 1
%4 = bitcast i32 %ldd.sroa.0.0 to float 
%add.i = fadd float %call.fca.0.extract, %4
%5 = bitcast float %add.i to i32
%6 = bitcast i32 %ldd.sroa.6.0 to float 
%add4.i = fadd float %call.fca.1.extract, %6
%7 = bitcast float %add4.i to i32
%inc = add nsw i32 %i.0, 1
br label %for.cond

for.end: ; preds = %for.cond

store i32 %ldd.sroa.0.0, i32* bitcast (%"struct.std::complex"* @dd to i32*), align 4
store i32 %ldd.sroa.6.0, i32* bitcast (float* getelementptr inbounds (%"struct.std::complex", %"struct.std::complex"* @dd, i64 0, i32 0, i32 1) to i32*), align 4
ret void

}

So, the problem here is that the instcombine pass tries to remove the bitcast from i32 to float, but fails because of the phi node.

Specifically, we have a combine that handles bitcasts of loads and bitcasts prior to stores, but we don't have anything that looks through phi-nodes. If you were to reg-to-mem the phi nodes so that they were replaced with loads and stores to an i32 alloca, I believe you would find that instcombine would remove the bitcasts and make all of these floating point.

The key will be to push this to handle phi nodes as well. This may be a bit trickier because you really want to handle even complex cycles of phi-nodes. Does this make sense? I'm familiar with this part of instcombine if it would be helpful for me to help you craft this change.

Carrot updated this revision to Diff 43814.Dec 30 2015, 4:46 PM
Carrot edited edge metadata.

This updated patch improves inst combine pass to handle following bitcast case

TyA  ->  TyB    bitcast
PHI
TyB  ->  TyA    bitcast

The optimized code can skip the two bitcast operations, and use TyA directly in later instructions.

A test case is also added.

Can you also add a test case with multiple levels of phis?

lib/Transforms/InstCombine/InstCombineCasts.cpp
1795 ↗(On Diff #43814)

Add more documentation of the method:
Replace \p CI instruction with a new PHI ...

1797 ↗(On Diff #43814)

add comment -- type B

1798 ↗(On Diff #43814)

comment type A

1804 ↗(On Diff #43814)

Use the more common worklist pattern:

while (!Worklist.empty()) {

PHINode *Item = Worklist.pop();
if (!ResultSet.insert(Item)) // already processed
    continue;
// process Item
 ...
  Worklist.insert(IncValue);

}

1842 ↗(On Diff #43814)

use range based iteration : for (auto OldPN : PNodes) { }

1845 ↗(On Diff #43814)

clang formatting with space:
for (unsigned J = 0, E = OldPN-> ...) {

Carrot updated this revision to Diff 46428.Jan 29 2016, 2:03 PM
Carrot updated this object.
Carrot marked 6 inline comments as done.
davidxl added inline comments.Feb 2 2016, 10:24 AM
lib/Transforms/InstCombine/InstCombineCasts.cpp
1846 ↗(On Diff #46428)

the two loops can be fused together.

Having David look at this is definitely a good thing. This will be a really nice change to instcombine, but it is in the tricky space of instcombines because it combines a fully phi-node cycle at a time, and has tricky oscillation challenges.

Some comments below.

lib/Transforms/InstCombine/InstCombineCasts.cpp
1805 ↗(On Diff #46428)

nit: "Find out all the" -> "Find all of the".

The comments really make this clear, but I assume part of your intent here is to handle PHI-cycles, where a PHI node is one of its own inputs (or an indirect cycle)? It took me a while to realize that this would necessitate scanning across a worklist in this way to see if you can saturate the PHI graph with all non-PHI inputs and outputs being bitcasts or constants. I think this really would help to be laid out in some detail in a comment so that the reader of the code knows what the intended algorithm here is.

1823 ↗(On Diff #46428)

Make this an early return to reduce indentation.

1846 ↗(On Diff #46428)

This is hard, because in the current structure you might need a new phi node to use as the input to another phi node. And this can't be solved because phi nodes are inherently cyclic.

I think you will have to build all the nodes first, and then fill in their operands.

However, the current code has a problem where the iteration order isn't stable. I would suggest using a SetVector to build the list of phi nodes, and I would give it a better name than "visited" to make it clear that we're going to use it here to rewrite the nodes. Maybe just "PhiNodes" or "NodeSet" to make it clear that we're going to keep using this set past the worklist walk.

There is other code in instcombine that deals with phi node cycles, and it might be useful to check how they handle these patterns and name things (I don't know personally...)

1863–1864 ↗(On Diff #46428)

Hmm, if you handle stores in addition to bitcasts for phi node users, you should handle loads in addition to bitcasts for phi node operands.

However, you shouldn't need to rewrite the loads and stores. Instead, you should be able to introduce a bitcast for the loads and stores, and remove the other bitcasts. Then instcombine should trigger the existing code to fold the loads and stores through the casts.

The only thing to watch out for here are introducing cycles where we switch bitcasts from one place to another on phi nodes. I'm not sure how best to do that. Maybe David Majnemer has some ideas here?

2002 ↗(On Diff #46428)

No need for the {'s

davidxl added inline comments.Feb 2 2016, 2:27 PM
lib/Transforms/InstCombine/InstCombineCasts.cpp
1846 ↗(On Diff #46428)

Right -- there is indeed a loop carried dependency I missed at the first look.

Carrot updated this revision to Diff 47045.Feb 5 2016, 2:21 PM
Carrot marked 4 inline comments as done.

This updated patch now handles load instruction.

lib/Transforms/InstCombine/InstCombineCasts.cpp
1846 ↗(On Diff #46428)

I found the code in function InstCombiner::SliceUpIllegalIntegerPHI, it has similar code structure. It uses a SmallPtrSet variable named PHIsInspected, but it is not used for other purposes.

I changed my var to type SmallSetVector, and named as "OldPhiNodes".

1863–1864 ↗(On Diff #46428)

For the load case, I handle the load instruction that has only one use. Otherwise it may create another bitcast for other uses, it may not be good for performance.

FWIW none of this discussion has been on the commits list, so it should probably be added so it can be seen there.

(Or abandon this revision and start fresh with a patch that has llvm-commits as a subscriber)

Nevermind, I'm apparently blind.

majnemer edited edge metadata.Feb 9 2016, 11:21 AM

I think this can be N**2 if there is a big tree of phi nodes with bitcasts at different levels of the tree. It might make sense to limit the total amount of work we are willing to do this combine.

How many items of work does the worklist typically process before firing?

lib/Transforms/InstCombine/InstCombineCasts.cpp
1803–1804 ↗(On Diff #47045)

Pointers lean right.

1821 ↗(On Diff #47045)

change -> changing

1826 ↗(On Diff #47045)

Please use auto *

1833 ↗(On Diff #47045)

Please use auto *

1847 ↗(On Diff #47045)

Pointers lean right.

1848 ↗(On Diff #47045)

Here, I'd use PHINode * or at least auto *.

1855 ↗(On Diff #47045)

Why not just use the IRBuilder? It should appropriately forward to the constant folder.

1856 ↗(On Diff #47045)

Here, I'd use PHINode * or at least auto *.

1861 ↗(On Diff #47045)

Please use auto *.

1863 ↗(On Diff #47045)

Please use auto *.

1866 ↗(On Diff #47045)

Please use auto *.

1868 ↗(On Diff #47045)

Ditto.

1878 ↗(On Diff #47045)

Ditto.

Carrot marked 13 inline comments as done.Feb 10 2016, 2:16 PM

I think this can be N**2 if there is a big tree of phi nodes with bitcasts at different levels of the tree. It might make sense to limit the total amount of work we are willing to do this combine.

How many items of work does the worklist typically process before firing?

In the regression testing there are 1.01 PHI nodes added to the worklist in average, the maximum is 3.

Carrot updated this revision to Diff 47521.Feb 10 2016, 2:17 PM
Carrot edited edge metadata.

Reviewers, any more comments on the patch?

David

Silly improvement: Can probably constify all of the Type * variables. No need to repost though.

I think the biggest question left is the potential quadratic behavior:

I think this can be N**2 if there is a big tree of phi nodes with bitcasts at different levels of the tree. It might make sense to limit the total amount of work we are willing to do this combine.

How many items of work does the worklist typically process before firing?

In the regression testing there are 1.01 PHI nodes added to the worklist in average, the maximum is 3.

I think you should look at a substantially larger test than regression testing. For example, collecting the stats across a bootstrap of Clang itself at least?

I think the biggest question left is the potential quadratic behavior:

I think this can be N**2 if there is a big tree of phi nodes with bitcasts at different levels of the tree. It might make sense to limit the total amount of work we are willing to do this combine.

How many items of work does the worklist typically process before firing?

In the regression testing there are 1.01 PHI nodes added to the worklist in average, the maximum is 3.

I think you should look at a substantially larger test than regression testing. For example, collecting the stats across a bootstrap of Clang itself at least?

In the bootstrap of clang, the average number of visited PHI nodes is 1.047, the maximum is 14.

Given the data collected, all concerns have been addressed -- can an explicit LGTM be given so that it can be closed?

majnemer accepted this revision.Mar 2 2016, 2:31 PM
majnemer edited edge metadata.

LGTM

Please be sure to mention [InstCombine] instead of [SROA] in the commit message.

This revision is now accepted and ready to land.Mar 2 2016, 2:31 PM
kbarton added a subscriber: kbarton.Mar 9 2016, 7:26 AM
Carrot updated this revision to Diff 50211.Mar 9 2016, 3:53 PM
Carrot edited edge metadata.

Please take another look.

I updated the patch a little to avoid change volatile memory access instructions, since volatile ld/st with bitcast can't be combined.

The changes compared to last version is:

Index: lib/Transforms/InstCombine/InstCombineCasts.cpp

  • lib/Transforms/InstCombine/InstCombineCasts.cpp (revision 262910)

+++ lib/Transforms/InstCombine/InstCombineCasts.cpp (working copy)
@@ -1815,8 +1815,9 @@

if (isa<Constant>(IncValue))
  continue;
  • if (isa<LoadInst>(IncValue)) {
  • if (IncValue->hasOneUse())

+ auto *LI = dyn_cast<LoadInst>(IncValue);
+ if (LI) {
+ if (LI->hasOneUse() && LI->isSimple())

  continue;
// If a LoadInst has more than one use, changing the type of loaded
// value may create another bitcast.

@@ -1875,7 +1876,7 @@

// If there is a store with type B, change it to type A.
for (User *U : PN->users()) {
  auto *SI = dyn_cast<StoreInst>(U);
  • if (SI && SI->getOperand(0) == PN) {

+ if (SI && SI->isSimple() && SI->getOperand(0) == PN) {

  Builder->SetInsertPoint(SI);
  SI->setOperand(0, Builder->CreateBitCast(NewPNodes[PN], SrcTy));
}
Carrot updated this revision to Diff 50887.Mar 16 2016, 3:56 PM

This updated patch fixed https://llvm.org/bugs/show_bug.cgi?id=26918.

The problem of the reverted patch is when we have following code

; <label>:2: ; preds = %0

%3 = load atomic i64, i64* %1 monotonic, align 8
br label %8

; <label>:4: ; preds = %0, %0

%5 = load atomic i64, i64* %1 acquire, align 8
br label %8

; <label>:6: ; preds = %0

%7 = load atomic i64, i64* %1 seq_cst, align 8
br label %8

; <label>:8: ; preds = %6, %4, %2

%.in1097491 = phi i64 [ %3, %2 ], [ %7, %6 ], [ %5, %4 ]
%9 = bitcast i64 %.in1097491 to double
ret double %9

My patch changed it to

; <label>:2: ; preds = %0

%3 = load atomic i64, i64* %1 monotonic, align 8
%4 = bitcast i64 %3 to double
br label %11

; <label>:5: ; preds = %0, %0

%6 = load atomic i64, i64* %1 acquire, align 8
%7 = bitcast i64 %6 to double
br label %11

; <label>:8: ; preds = %0

%9 = load atomic i64, i64* %1 seq_cst, align 8
%10 = bitcast i64 %9 to double
br label %11

; <label>:11: ; preds = %8, %5, %2

%12 = phi double [ %4, %2 ], [ %10, %8 ], [ %7, %5 ]
%.in1097491 = phi i64 [ %3, %2 ], [ %9, %8 ], [ %6, %5 ]
ret double %12

And expects ld/st combining can combine the load instructions with following bitcast. But ld/st combining doesn't change volatile/atomic memory access, so the bitcast instructions are left there. Later phi combining transformed it back to the original form, so the two combining transforms the code back and forth indefinitely.

There are two changes compared to the original approved patch:

1 avoid changing volatile/atomic memory access
2 add the impacted instructions to combining worklist

Please take another look.

I've added a patch D20847 to fix potential infinite loop for the case here. Can you please take a look?

echristo closed this revision.Mar 30 2017, 12:38 AM

This has been committed for some time.