This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] PR35354: Convert load bitcast (select (Cond, &V1, &V2)) --> select(Cond, load bitcast &V1, load bitcast &V2)
ClosedPublic

Authored by ABataev on Nov 21 2017, 7:06 AM.

Details

Summary

If we have the code like this:

float a, b;
a = std::max(a ,b);

it is converted into something like this:

%call = call dereferenceable(4) float* @_ZSt3maxIfERKT_S2_S2_(float* nonnull dereferenceable(4) %a.addr, float* nonnull dereferenceable(4) %b.addr)
%1 = bitcast float* %call to i32*
%2 = load i32, i32* %1, align 4
%3 = bitcast float* %a.addr to i32*
store i32 %2, i32* %3, align 4

After inlinning this code is converted to the next:

%1 = load float, float* %a.addr
%2 = load float, float* %b.addr
%cmp.i = fcmp fast olt float %1, %2
%__b.__a.i = select i1 %cmp.i, float* %a.addr, float* %b.addr
%3 = bitcast float* %__b.__a.i to i32*
%4 = load i32, i32* %3, align 4
%5 = bitcast float* %arrayidx to i32*
store i32 %4, i32* %5, align 4

This pattern is not recognized as minmax pattern.
Patch solves this problem by converting sequence

load bitcast (select (Cond, &V1, &V2))

to a sequence

select(Cond, load bitcast &V1, load bitcast &V2)

After this the code is recognized as minmax pattern.

Event Timeline

ABataev created this revision.Nov 21 2017, 7:06 AM
spatel edited edge metadata.Nov 21 2017, 7:45 AM

I’m not at a dev machine so can’t review this currently, but is this the same as
https://bugs.llvm.org/show_bug.cgi?id=34603 ?

I’m not at a dev machine so can’t review this currently, but is this the same as
https://bugs.llvm.org/show_bug.cgi?id=34603 ?

No, it is a bit different. We don't have the GEP from select here, we have load (bitcast float* to i32* (select)) because of InstCombiner.
I convert this to select (load (bitcast float* to i32 *), (load (bitcast float* to i32*))

ABataev updated this revision to Diff 123809.Nov 21 2017, 8:03 AM

Added check that bitcasted address has only one use.

This patch is building on a transform that is already suspect (see discussion starting at https://bugs.llvm.org/show_bug.cgi?id=34603#c6 ).

In conjunction with the existing transform, it increases the instruction count from 3 to 5 for this example:

define i32 @store_bitcasted_load(i1 %cond, float* dereferenceable(4) %addr1, float* dereferenceable(4) %addr2) {
  %sel = select i1 %cond, float* %addr1, float* %addr2
  %bc1 = bitcast float* %sel to i32*
  %ld = load i32, i32* %bc1
  ret i32 %ld
}

$ ./opt -instcombine loadsel.ll -S

define i32 @store_bitcasted_load(i1 %cond, float* dereferenceable(4) %addr1, float* dereferenceable(4) %addr2) {
  %addr1.cast = bitcast float* %addr1 to i32*
  %addr2.cast = bitcast float* %addr2 to i32*
  %addr1.cast.val = load i32, i32* %addr1.cast, align 4
  %addr2.cast.val = load i32, i32* %addr2.cast, align 4
  %ld = select i1 %cond, i32 %addr1.cast.val, i32 %addr2.cast.val
  ret i32 %ld
}

Can you provide a reduced C++ source example for how we got to this IR? I couldn't repro with a simple case, so I must be missing some part of it.

Also, there might be something in common with:
https://bugs.llvm.org/show_bug.cgi?id=35354
or
https://bugs.llvm.org/show_bug.cgi?id=35284
?

This patch is building on a transform that is already suspect (see discussion starting at https://bugs.llvm.org/show_bug.cgi?id=34603#c6 ).

In conjunction with the existing transform, it increases the instruction count from 3 to 5 for this example:

define i32 @store_bitcasted_load(i1 %cond, float* dereferenceable(4) %addr1, float* dereferenceable(4) %addr2) {
  %sel = select i1 %cond, float* %addr1, float* %addr2
  %bc1 = bitcast float* %sel to i32*
  %ld = load i32, i32* %bc1
  ret i32 %ld
}

$ ./opt -instcombine loadsel.ll -S

define i32 @store_bitcasted_load(i1 %cond, float* dereferenceable(4) %addr1, float* dereferenceable(4) %addr2) {
  %addr1.cast = bitcast float* %addr1 to i32*
  %addr2.cast = bitcast float* %addr2 to i32*
  %addr1.cast.val = load i32, i32* %addr1.cast, align 4
  %addr2.cast.val = load i32, i32* %addr2.cast, align 4
  %ld = select i1 %cond, i32 %addr1.cast.val, i32 %addr2.cast.val
  ret i32 %ld
}

Can you provide a reduced C++ source example for how we got to this IR? I couldn't repro with a simple case, so I must be missing some part of it.

Also, there might be something in common with:
https://bugs.llvm.org/show_bug.cgi?id=35354
or
https://bugs.llvm.org/show_bug.cgi?id=35284
?

This patch fixes PR35354, the first one from your list. I described everything aready. We got extra bitcast from the canonicalization of load/store sequence before function inlining, but this canonialization breaks minmax pattern recognition after the inlining o а min/max functions.

dberlin edited edge metadata.Nov 27 2017, 12:35 PM

So, what Sanjoy has pointed out is that canonicalization like this to select, breaks the ability to remove redundant loads and stores and do PRE, compared to the equivalent control flow version, in general.

While it's probably true that
IE your canonicalization of:
load bitcast (select (Cond, &V1, &V2))
to
select(Cond, load bitcast &V1, load bitcast &V2)

is not likely to break this more, and will make things a little better, all of our redundancy elimination passes and knowledge propagation passes would be happier with the control flow non-select version.

This is not possible to fix in those passes sanely (you'd need a fake cfg with fake instructions and fake operands to simulate the control flow).
Otherwise, the only way around this is to not canonicalize to select this way that early.

Otherwise, the only way around this is to not canonicalize to select this way that early.

And for reference, I have a patch towards that - D38566.

There may be other fixes needed to solve PR35354 (sorry for missing that in the title!)

Otherwise, the only way around this is to not canonicalize to select this way that early.

And for reference, I have a patch towards that - D38566.

There may be other fixes needed to solve PR35354 (sorry for missing that in the title!)

Your patch fixes SimplifyCFG pass, but canonicalization is occurred in InstCombiner (see combineLoadToOperationType function)

This patch is building on a transform that is already suspect (see discussion starting at https://bugs.llvm.org/show_bug.cgi?id=34603#c6 ).

In conjunction with the existing transform, it increases the instruction count from 3 to 5 for this example:

define i32 @store_bitcasted_load(i1 %cond, float* dereferenceable(4) %addr1, float* dereferenceable(4) %addr2) {
  %sel = select i1 %cond, float* %addr1, float* %addr2
  %bc1 = bitcast float* %sel to i32*
  %ld = load i32, i32* %bc1
  ret i32 %ld
}

$ ./opt -instcombine loadsel.ll -S

define i32 @store_bitcasted_load(i1 %cond, float* dereferenceable(4) %addr1, float* dereferenceable(4) %addr2) {
  %addr1.cast = bitcast float* %addr1 to i32*
  %addr2.cast = bitcast float* %addr2 to i32*
  %addr1.cast.val = load i32, i32* %addr1.cast, align 4
  %addr2.cast.val = load i32, i32* %addr2.cast, align 4
  %ld = select i1 %cond, i32 %addr1.cast.val, i32 %addr2.cast.val
  ret i32 %ld
}

Can you provide a reduced C++ source example for how we got to this IR? I couldn't repro with a simple case, so I must be missing some part of it.

Also, there might be something in common with:
https://bugs.llvm.org/show_bug.cgi?id=35354
or
https://bugs.llvm.org/show_bug.cgi?id=35284
?

The original transformation load (select (Cond, &V1, &V2)) --> select(Cond, load &V1, load &V2) also increases number of instructions from 2 to 3

define float @store_bitcasted_load(i1 %cond, float* dereferenceable(4) %addr1, float* dereferenceable(4) %addr2) {
  %sel = select i1 %cond, float* %addr1, float* %addr2
  %ld = load float, float* %sel
  ret float %ld
}

After the existing transformation we have:

define float @store_bitcasted_load(i1 %cond, float* dereferenceable(4) %addr1, float* dereferenceable(4) %addr2) {
  %addr1.val = load float, float* %addr1, align 4
  %addr2.val = load float, float* %addr2, align 4
  %ld = select i1 %cond, float %addr1.val, float %addr2.val
  ret float %ld
}

The original transformation load (select (Cond, &V1, &V2)) --> select(Cond, load &V1, load &V2) also increases number of instructions from 2 to 3

Yes - that's why I said the existing transform is suspect. There are 2 InstCombine problems here as I think you've noted:

  1. The transform that hoists loads ahead of select.
  2. The transform that adds bitcasts around FP loads ("Try to canonicalize loads which are only ever stored to operate over integers instead of any other type.")

I've looked at this a bit closer now, and SimplifyCFG really wants to create a select for this:

define float* @mymax(float* dereferenceable(4) %__a, float* dereferenceable(4) %__b) {
entry:
  %__comp = alloca %"struct.std::__1::__less", align 1
  %call = call zeroext i1 @less(%"struct.std::__1::__less"* nonnull %__comp, float* nonnull dereferenceable(4) %__a, float* nonnull dereferenceable(4) %
__b)
  br i1 %call, label %cond.true, label %cond.false

cond.true:                                        ; preds = %entry
  br label %cond.end

cond.false:                                       ; preds = %entry
  br label %cond.end

cond.end:                                         ; preds = %cond.false, %cond.true
  %cond-lvalue = phi float* [ %__b, %cond.true ], [ %__a, %cond.false ]
  ret float* %cond-lvalue
}

For this example, a select will be created by FoldTwoEntryPHINode(). If I stub that out, then a select will still be created in HoistThenElseCodeToIf(). If I stub that out, then a select will still be created in SpeculativelyExecuteBB(). So we would have to disable much more of SimplifyCFG to avoid creating the select. But even that is not enough - the bitcasts are interfering with further optimization.

If the consensus is that this instcombine bitcast transform is valid:

target datalayout = "e-m:o-i64:64-f80:128-n8:16:32:64-S128"
define void @store_bitcasted_load2(float* %loadaddr, float* %storeaddr) {
  %ld = load float, float* %loadaddr
  store float %ld, float* %storeaddr
  ret void
}

$ ./opt -instcombine -S bitcastload.ll

define void @store_bitcasted_load2(float* %loadaddr, float* %storeaddr) {
  %1 = bitcast float* %loadaddr to i32*
  %ld1 = load i32, i32* %1, align 4
  %2 = bitcast float* %storeaddr to i32*
  store i32 %ld1, i32* %2, align 4
  ret void
}

...then I think we have to account for that here and look through the bitcasts (as this patch is proposing). We could make this patch not create more instructions than it removes by starting the pattern match at the store instruction rather than the load?

We could make this patch not create more instructions than it removes by starting the pattern match at the store instruction rather than the load?

Yes, I already thought about it. Will try rework it by starting pattern matching from the store.

ABataev updated this revision to Diff 124625.Nov 28 2017, 1:07 PM

Make the conversion iff the load is part of load/store canonicalization conversion.

For reference, here are the bitcast-adding commit and discussion (cc @chandlerc ):
https://reviews.llvm.org/rL226781
http://lists.llvm.org/pipermail/llvm-dev/2015-January/080956.html

Nobody was sure what regressions would be caused by that change...looks like we found one. :)

I'm still concerned that we're building on a fold that might get removed. Let me propose an alternate idea:

  1. Fold the bitcasts out of the 5 instruction sequence starting from visitStoreInst(): store (bitcast (load (bitcast (select ...))))
  2. Create an exception to the bitcasting canonicalization for loads fed by a select, so we don't infinite-loop.

So I think we should be able to do this fold without relying on the fold that moves loads above select, so the test case is minimized to:

define void @bitcasted_store(i1 %cond, float* %loadaddr1, float* %loadaddr2, float* %storeaddr) {
  %sel = select i1 %cond, float* %loadaddr1, float* %loadaddr2
  %int_load_addr = bitcast float* %sel to i32*
  %ld = load i32, i32* %int_load_addr
  %int_store_addr = bitcast float* %storeaddr to i32*
  store i32 %ld, i32* %int_store_addr
  ret void
}
ABataev updated this revision to Diff 124948.Nov 30 2017, 8:39 AM

Update after review.

spatel added inline comments.Nov 30 2017, 11:51 AM
lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
567

Why do we need to match min/max? Does something break if we just match select?

If we need to restrict to min/max, this isn't the way to do it. Use pattern matches (m_c_SMax, etc) or value tracking's matchSelectPattern().

1331

It doesn't make sense to call this "decanonicalize". Whatever we choose to do here is redefining canonical. "removeBitcastsFromLoadStoreOnSelect"?

ABataev added inline comments.Dec 1 2017, 7:43 AM
lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
567
  1. Currently, we have troubles only with particular minmax pattern, not all select instructions. I think, at first we should do this for minmax. If it is required for other select-based patterns, we can extend this patch or remove it.
  2. Ok, will do it via pattern matchers. Though I can use m_c_<Matcher> directly, because we have a slightly different minmax pattern (comparing of values, but return addresses of this values, not values themselves).
1331

Ok, will rename it.

ABataev updated this revision to Diff 125146.Dec 1 2017, 8:00 AM

Update after review.

If we require matching the loads + cmp ahead of the select, then this should be the minimal test case:

define void @bitcasted_minmax_with_select_of_pointers(float* %loadaddr1, float* %loadaddr2, float* %storeaddr) {
  %ld1 = load float, float* %loadaddr1, align 4
  %ld2 = load float, float* %loadaddr2, align 4
  %cond = fcmp ogt float %ld1, %ld2
  %sel = select i1 %cond, float* %loadaddr1, float* %loadaddr2
  %int_load_addr = bitcast float* %sel to i32*
  %ld = load i32, i32* %int_load_addr, align 4
  %int_store_addr = bitcast float* %storeaddr to i32*
  store i32 %ld, i32* %int_store_addr, align 4
  ret void
}

...but I'm not getting the transform to fire. Something may be wrong with the pattern matching's use of m_Specific()?

lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
566–567

Could make this generally available by putting it into PatternMatch.h?

590–591

I don't think we need to loop here. Use peekThroughBitcast() instead.

ABataev updated this revision to Diff 125551.Dec 5 2017, 9:21 AM

Update after review.

spatel accepted this revision.Dec 6 2017, 3:07 PM

I don't see a better way to avoid the problem, so LGTM. See inline for some small issues. Wait a day to commit in case anyone else has ideas.

include/llvm/IR/PatternMatch.h
960–979 ↗(On Diff #125551)

Please commit this part as an NFC preliminary step.

lib/Transforms/InstCombine/InstCombineLoadStoreAlloca.cpp
567

I still can't read this without thinking it's a duplicate of something in matchSelectPattern(). Include "load" in the name? "isMinMaxWithLoads()"?

1352

Remove comment - the description above the function body is enough.

This revision is now accepted and ready to land.Dec 6 2017, 3:07 PM
This revision was automatically updated to reflect the committed changes.