Page MenuHomePhabricator

[InstCombine] look through bitcasts to find selects
ClosedPublic

Authored by spatel on May 28 2016, 12:24 PM.

Details

Summary

The motivating example for this patch is this IR produced via SSE intrinsics in C:

define <2 x i64> @gibson(<2 x i64> %a, <2 x i64> %b) {
  %t0 = bitcast <2 x i64> %a to <4 x i32>
  %t1 = bitcast <2 x i64> %b to <4 x i32>
  %cmp = icmp sgt <4 x i32> %t0, %t1
  %sext = sext <4 x i1> %cmp to <4 x i32>
  %t2 = bitcast <4 x i32> %sext to <2 x i64>
  %and = and <2 x i64> %t2, %a
  %neg = xor <4 x i32> %sext, <i32 -1, i32 -1, i32 -1, i32 -1>
  %neg2 = bitcast <4 x i32> %neg to <2 x i64>
  %and2 = and <2 x i64> %neg2, %b
  %or = or <2 x i64> %and, %and2
  ret <2 x i64> %or
}

For an AVX target, this is currently:

vpcmpgtd	%xmm1, %xmm0, %xmm2
vpand	%xmm0, %xmm2, %xmm0
vpandn	%xmm1, %xmm2, %xmm1
vpor	%xmm1, %xmm0, %xmm0
retq

With this patch, it becomes:

vpmaxsd	%xmm1, %xmm0, %xmm0

Diff Detail

Repository
rL LLVM

Event Timeline

spatel updated this revision to Diff 58899.May 28 2016, 12:24 PM
spatel retitled this revision from to [InstCombine] look through bitcasts to find selects.
spatel updated this object.
spatel added reviewers: majnemer, RKSimon, chandlerc.
spatel added a subscriber: llvm-commits.
eli.friedman added inline comments.
lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
1335 ↗(On Diff #58899)

This canonicalization seems incomplete... for example, if both operands are sext instructions, the one which belongs on the LHS is the one where the source is a bool, but this just picks randomly.

1359 ↗(On Diff #58899)

Assuming bitcasts are free is kind of a big assumption... you could easily end up in situations where the bitcasts are not free. If X is scalar, the extra bitcasts should be folded away, but you could end up with bad effects on code like "((uint64_t)(vec1 == vec2)) & x".

spatel added inline comments.May 28 2016, 1:35 PM
lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
1335 ↗(On Diff #58899)

Yes, you're right. I was going to say it is just an existing bug for the sext/not cases, but it seems we don't optimize the bitcast case either:

define <2 x i64> @vecBitcastOp0(<4 x i32> %a, <8 x i16> %b) {
  %bc1 = bitcast <4 x i32> %a to <2 x i64>
  %bc2 = bitcast <8 x i16> %b to <2 x i64>
  %and = and <2 x i64> %bc1, %bc2
  ret <2 x i64> %and
}

I thought we'd eliminate one bitcast (randomly?) for this case and do the logic op in the format of one of the inputs. Is that a valid IR optimization?

1359 ↗(On Diff #58899)

Would it be fair to assume that vector bitcasts to another vector type are free? If not, then I suppose I need to abandon this patch and try again in the DAG?

eli.friedman added inline comments.May 28 2016, 2:27 PM
lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
1335 ↗(On Diff #58899)

Well, you can't transform the return type of a function unconditionally; it could change the calling convention.

But to actually answer your question, you can perform an "and" in any type of the right size as long as there aren't any padding bits involved; "and" is a bit-wise operation. Whether that's a good idea probably depends on the target; performing an "and" in an illegal type is likely to be expensive.

1359 ↗(On Diff #58899)

Vector to vector bitcasts are free on most platforms, but not all. As a practical example, big-endian AArch64 has non-trivial vector bitcasts: a bitcast translates to a vector shuffle to put the elements in the right places.

In theory, there's infrastructure for querying costs like this from transform passes; see TargetTransformInfo::getCastInstrCost. Not sure how that would work out in practice.

spatel added inline comments.May 28 2016, 3:05 PM
lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
1335 ↗(On Diff #58899)

Sorry - I didn't finish that example properly here. We'd of course need a bitcast after the "promoted" logic op to get the types to match. Please let me know what you think of this example:
https://llvm.org/bugs/show_bug.cgi?id=27925

1359 ↗(On Diff #58899)

Based on my experience so far, I assume we'd just delay this kind of combining to the DAG rather than attempt it in IR in a target-dependent way. It sounds like that's what needs to happen in this case: the problem can't be generalized to all targets, so I'll make an x86 solution to start and other targets can pick it up if they like.

I'll let this sit a bit in case anyone else wants to comment, but for now I'm assuming I will eventually abandon this patch. Thank you for the feedback, Eli!

spatel added a comment.Jun 1 2016, 5:56 PM

It's going to take multiple patches/steps to get this to match in the DAG. Is there any concern about making the following transform in IR:

Before:

define <2 x i64> @gibson(<2 x i64> %a, <2 x i64> %b) {
  %t0 = bitcast <2 x i64> %a to <4 x i32>
  %t1 = bitcast <2 x i64> %b to <4 x i32>
  %cmp = icmp sgt <4 x i32> %t0, %t1
  %sext = sext <4 x i1> %cmp to <4 x i32>
  %t2 = bitcast <4 x i32> %sext to <2 x i64>
  %and = and <2 x i64> %t2, %a
  %neg = xor <4 x i32> %sext, <i32 -1, i32 -1, i32 -1, i32 -1>
  %neg2 = bitcast <4 x i32> %neg to <2 x i64>
  %and2 = and <2 x i64> %neg2, %b
  %or = or <2 x i64> %and, %and2
  ret <2 x i64> %or
}

After:

define <2 x i64> @max(<2 x i64> %a, <2 x i64> %b) {
  %t0 = bitcast <2 x i64> %a to <4 x i32>
  %t1 = bitcast <2 x i64> %b to <4 x i32>
  %cmp = icmp sgt <4 x i32> %t0, %t1
  %or = select <4 x i1> %cmp, <4 x i32> %t0, <4 x i32> %t1
  %r1 = bitcast <4 x i32> %or to <2 x i64>
  ret <2 x i64> %r1
}

We're trading 4 logic ops, a bitcast, and a sext for a select. Is that better / more canonical IR for all targets?

We're trading 4 logic ops, a bitcast, and a sext for a select. Is that better / more canonical IR for all targets?

I can answer my own question with a 'yes'. We already have this transform in matchSelectFromAndOr(). It looks like it just needs to be enhanced to see through the bitcasts.

spatel updated this revision to Diff 59470.Jun 2 2016, 4:04 PM

Patch updated:
Use the existing matchSelectFromAndOr() infrastructure to transform the larger select pattern that is present in the motivating example.
Hopefully, this alleviates any concern about the cost of bitcasts because we're eliminating all of the logic/sext instructions.

majnemer accepted this revision.Jun 2 2016, 4:15 PM
majnemer edited edge metadata.

LGTM with nits

lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
1647–1689 ↗(On Diff #59470)

It would be nice if we could find a way to fold the code which handles the bitcast case with the code which doesn't.

This revision is now accepted and ready to land.Jun 2 2016, 4:15 PM
spatel added inline comments.Jun 2 2016, 4:25 PM
lib/Transforms/InstCombine/InstCombineAndOrXor.cpp
1647–1689 ↗(On Diff #59470)

Agreed - at the least, we can use m_CombineOr in the same way to make the code look similar.
However, after my bot-killing / bug-raising misadventure in rL269728, I'm determined to make any refactoring a follow-up step. :)
I'll add a FIXME comment in this patch. Thank you for the prompt review!

This revision was automatically updated to reflect the committed changes.