This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Generate better code for std::bit_ceil
ClosedPublic

Authored by kazu on Mar 4 2023, 12:25 AM.

Details

Summary

Without this patch, std::bit_ceil<uint32_t> is compiled as:

%dec = add i32 %x, -1
%lz = tail call i32 @llvm.ctlz.i32(i32 %dec, i1 false)
%sub = sub i32 32, %lz
%res = shl i32 1, %sub
%ugt = icmp ugt i32 %x, 1
%sel = select i1 %ugt, i32 %res, i32 1

With this patch, we generate:

%dec = add i32 %x, -1
%ctlz = tail call i32 @llvm.ctlz.i32(i32 %dec, i1 false)
%sub = sub nsw i32 0, %ctlz
%and = and i32 %1, 31
%sel = shl nuw i32 1, %and
ret i32 %sel

https://alive2.llvm.org/ce/z/pwezvF

This patch recognizes the specific pattern and drops the conditional
move. Specifically, it recognizes patterns from std::bit_ceil in
libc++ and libstdc++. In addition to the LLVM IR generated for
std::bit_ceil(X), this patch recognizes variants like:

std::bit_ceil(X - 1)
std::bit_ceil(X + 1)
std::bit_ceil(X + 2)
std::bit_ceil(-X)
std::bit_ceil(~X)

This patch fixes:

https://github.com/llvm/llvm-project/issues/60802

Diff Detail

Event Timeline

kazu created this revision.Mar 4 2023, 12:25 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 4 2023, 12:25 AM
kazu requested review of this revision.Mar 4 2023, 12:25 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 4 2023, 12:25 AM
kazu edited the summary of this revision. (Show Details)Mar 4 2023, 12:28 AM
kazu added reviewers: RKSimon, spatel, goldstein.w.n.
goldstein.w.n added inline comments.Mar 4 2023, 10:23 AM
llvm/test/CodeGen/X86/bit_ceil.ll
2 ↗(On Diff #502358)

Can you add tests w.o bmi and w.o lzcnt enabled?

36 ↗(On Diff #502358)

Is the select at the end just unneeded? If so this would probably be better addressed in the IR or DAGCombiner as its a pretty generic. Maybe DAGCombiner::VisitSETCC?

goldstein.w.n added inline comments.Mar 4 2023, 10:23 AM
llvm/lib/Target/X86/X86ISelLowering.cpp
47340 ↗(On Diff #502358)

This all seems a bit a bit brittle, for example if someone does std::bit_ceil(x + 1) we won't have the add on ctlz:

%2 = tail call i32 @llvm.ctlz.i32(i32 %0, i1 false), !range !5
%3 = sub nuw nsw i32 32, %2
%4 = shl nuw i32 1, %3
%5 = add i32 %0, -1
%6 = icmp ult i32 %5, -2
%7 = select i1 %6, i32 %4, i32 1
ret i32 %7

is there a way we could more generically detect that the select is unneeded?

kazu marked an inline comment as done.Mar 4 2023, 12:37 PM
kazu added inline comments.
llvm/lib/Target/X86/X86ISelLowering.cpp
47340 ↗(On Diff #502358)

Yes, brittleness is a real concern.

I could calculate the difference between the ctlz argument and icmp argument.

llvm/test/CodeGen/X86/bit_ceil.ll
2 ↗(On Diff #502358)
36 ↗(On Diff #502358)

Correct, we do not need the select at the end.

We do rely on a subtle property that shlx masks off the shift count with 31 or 63, depending on the operand size. I generate ISD::AND with 31 (or 63), expecting that the hardware shift instruction absorbs it.

At the LLVM IR level, the result of the shift instruction is undefined because the final select picks 1 for inputs < 2.

Yes, we could do this in the IR or DAG combiner somewhere. Let me look into it.

goldstein.w.n added inline comments.Mar 4 2023, 1:48 PM
llvm/test/CodeGen/X86/bit_ceil.ll
36 ↗(On Diff #502358)

Correct, we do not need the select at the end.

We do rely on a subtle property that shlx masks off the shift count with 31 or 63, depending on the operand size. I generate ISD::AND with 31 (or 63), expecting that the hardware shift instruction absorbs it.

You can still generate the ISD::AND in DAGCombiner, it will be removed in later lowering.

spatel added a comment.Mar 5 2023, 5:06 AM

As noted in previous comments, it's hard to reliably pattern-match a sequence that is this long. We are missing several potential canonicalizations in IR, and I see at least one possible variant that is shorter in IR:
https://alive2.llvm.org/ce/z/CkQ433

We probably want to form the umax variant in IR based on it being one less instruction. If a target has a umax instruction (and ctlz), then that's probably going to be the best in codegen.

So this patch should wait until the IR questions are resolved, but at first glance, it seems like we will need D144451, more codegen conversions, and several IR patches.

What is preventing is from performing this in InstCombine? I don't think this pattern will emerge in SelectionDAG

spatel added a comment.Mar 5 2023, 6:17 AM

What is preventing is from performing this in InstCombine? I don't think this pattern will emerge in SelectionDAG

I haven't found a way to avoid a poison shift in IR without doing a cmp+select or umax yet. I think we're relying on the x86-specific behavior of masking the shift amount to make that part of the logic disappear in this patch.

goldstein.w.n added a comment.EditedMar 5 2023, 9:52 AM

What is preventing is from performing this in InstCombine? I don't think this pattern will emerge in SelectionDAG

I haven't found a way to avoid a poison shift in IR without doing a cmp+select or umax yet. I think we're relying on the x86-specific behavior of masking the shift amount to make that part of the logic disappear in this patch.

The IR is:

%2 = add i32 %0, -1
%3 = tail call i32 @llvm.ctlz.i32(i32 %2, i1 false), !range !5
%4 = sub nuw nsw i32 32, %3
%5 = shl nuw i32 1, %4
%6 = icmp ugt i32 %0, 1
%7 = select i1 %6, i32 %5, i32 1
ret i32 %7

The poison shift is if %3 is zero?

What is preventing is from performing this in InstCombine? I don't think this pattern will emerge in SelectionDAG

I haven't found a way to avoid a poison shift in IR without doing a cmp+select or umax yet. I think we're relying on the x86-specific behavior of masking the shift amount to make that part of the logic disappear in this patch.

The IR is:

%2 = add i32 %0, -1
%3 = tail call i32 @llvm.ctlz.i32(i32 %2, i1 false), !range !5
%4 = sub nuw nsw i32 32, %3
%5 = shl nuw i32 1, %4
%6 = icmp ugt i32 %0, 1
%7 = select i1 %6, i32 %5, i32 1
ret i32 %7

The poison shift is if %3 is zero?

Yes - if we shift by the bitwidth, that's poison in IR.

IIUC in this example, we don't have to care about any input "ugt 0x8000_0000" ( https://en.cppreference.com/w/cpp/numeric/bit_ceil ), so we'd need the front-end to provide that info somehow. But 0x8000_0000 is still a valid input, so does that knowledge actually help?

What is preventing is from performing this in InstCombine? I don't think this pattern will emerge in SelectionDAG

I haven't found a way to avoid a poison shift in IR without doing a cmp+select or umax yet. I think we're relying on the x86-specific behavior of masking the shift amount to make that part of the logic disappear in this patch.

The IR is:

%2 = add i32 %0, -1
%3 = tail call i32 @llvm.ctlz.i32(i32 %2, i1 false), !range !5
%4 = sub nuw nsw i32 32, %3
%5 = shl nuw i32 1, %4
%6 = icmp ugt i32 %0, 1
%7 = select i1 %6, i32 %5, i32 1
ret i32 %7

The poison shift is if %3 is zero?

Yes - if we shift by the bitwidth, that's poison in IR.

IIUC in this example, we don't have to care about any input "ugt 0x8000_0000" ( https://en.cppreference.com/w/cpp/numeric/bit_ceil ), so we'd need the front-end to provide that info somehow. But 0x8000_0000 is still a valid input, so does that knowledge actually help?

yeah, the impl would either need to change width - clz(X) -> -clz(X) % width or add some assume. Guess it makes sense to keep this in X86ISel but should be sooner in the pipeline, rather than a very brittle match on cmov, might be able to find an easier pattern in CombineSELECT although have doubts about whether it would stand up in real codes.

nikic added a subscriber: nikic.Mar 5 2023, 11:33 AM

What is preventing is from performing this in InstCombine? I don't think this pattern will emerge in SelectionDAG

I haven't found a way to avoid a poison shift in IR without doing a cmp+select or umax yet. I think we're relying on the x86-specific behavior of masking the shift amount to make that part of the logic disappear in this patch.

Well, we could also do what is being proposed here and mask the shift amount: https://alive2.llvm.org/ce/z/FD_9Sh On x86, that happens to be free. I'm not sure this makes sense as an IR canonicalization though.

What is preventing is from performing this in InstCombine? I don't think this pattern will emerge in SelectionDAG

I haven't found a way to avoid a poison shift in IR without doing a cmp+select or umax yet. I think we're relying on the x86-specific behavior of masking the shift amount to make that part of the logic disappear in this patch.

Well, we could also do what is being proposed here and mask the shift amount: https://alive2.llvm.org/ce/z/FD_9Sh On x86, that happens to be free. I'm not sure this makes sense as an IR canonicalization though.

Oops - yes, I was overlooking that direct translation. It's shorter IR with the mask, and the backend should already account for the mask if it can be removed, so yes, that seems like a good improvement.

I think we still have several potential canonicalizations to deal with if we want a robust solution (see my earlier Alive2 link with variations with umax and shift-right).

RKSimon requested changes to this revision.Mar 8 2023, 3:44 AM

This should be handled in InstCombine

This revision now requires changes to proceed.Mar 8 2023, 3:44 AM
kazu updated this revision to Diff 504466.Mar 12 2023, 1:03 PM
kazu marked an inline comment as done.
kazu edited the summary of this revision. (Show Details)

Ported to InstCombine.

kazu retitled this revision from [X86] Generate better code for std::bit_ceil to [InstCombine] Generate better code for std::bit_ceil.Mar 12 2023, 1:04 PM
kazu edited the summary of this revision. (Show Details)
kazu added a comment.Mar 12 2023, 1:06 PM

This should be handled in InstCombine

I just ported my patch to InstCombine. Please take a look. Thanks!

goldstein.w.n added inline comments.Mar 16 2023, 11:15 AM
llvm/test/Transforms/InstCombine/bit_ceil.ll
15

Can you add some tests where there are prior transformations on %x (some binops like add/sub/shift/and/xor/or) to test that this impl is is reasonable robust in finding the pattern.

kazu marked an inline comment as done.Mar 19 2023, 1:47 PM
kazu added inline comments.
llvm/test/Transforms/InstCombine/bit_ceil.ll
15

I added a few more tests:

https://github.com/llvm/llvm-project/commit/ef860cf150f0d7e0f635026c55182bc1f65ac8f9

I'll update this patch shortly

kazu updated this revision to Diff 506421.Mar 19 2023, 1:56 PM
kazu marked an inline comment as done.
kazu edited the summary of this revision. (Show Details)

Handle more cases.

kazu edited the summary of this revision. (Show Details)Mar 19 2023, 1:58 PM

Please take a look. Thanks!

nikic requested changes to this revision.Mar 19 2023, 2:53 PM

I haven't looked in detail what you're doing here, but you're clearly looking for the ConstantRange class. makeExactICmpRegion(), add(), sub(), binaryNot(), etc.

This revision now requires changes to proceed.Mar 19 2023, 2:53 PM
kazu updated this revision to Diff 506451.Mar 19 2023, 5:20 PM
kazu edited the summary of this revision. (Show Details)

Started using ConstantRange.

kazu added a comment.Mar 19 2023, 5:21 PM

I haven't looked in detail what you're doing here, but you're clearly looking for the ConstantRange class. makeExactICmpRegion(), add(), sub(), binaryNot(), etc.

Thank you for the pointer. I started using ConstantRange in this patch.

RKSimon added inline comments.Mar 20 2023, 6:28 AM
llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
3193

empty clause?

kazu added inline comments.Mar 20 2023, 8:58 AM
llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
3193

I would like to find X here and leave it nullptr if no match occurs. I don't have any actions when the if condition is true. Should I say something like this?

(void) match(CtlzOp, m_Add(m_Value(X), m_ConstantInt())) ||
    match(CtlzOp, m_Sub(m_ConstantInt(), m_Value(X))) ||
    match(CtlzOp, m_Not(m_Value(X)));
nikic added inline comments.Mar 20 2023, 12:40 PM
llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
3276

All these m_ConstantInt should be m_APInt.

kazu updated this revision to Diff 506687.Mar 20 2023, 12:59 PM

Use m_APInt instead of m_ConstantInt.

kazu marked an inline comment as done.Mar 20 2023, 1:00 PM

please can you add vector test coverage?

nikic added inline comments.Mar 21 2023, 7:53 AM
llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
3193

Something that I don't really get is why we have to do this separately. Wouldn't something like this be sufficient?

Value *RangeOp = Cond0;
ConstantRange CR = ConstantRange::makeExactICmpRegion(
    CmpInst::getInversePredicate(Pred), *Cond1);

if (CtlzOp != RangeOp) {
  if (match(CtlzOp, m_Add(m_Specific(RangeOp), m_APInt(C)))) {
    CR = CR.add(*C);
  }
  // ...
}
3225

These RangeOp assignments are all dead?

3263

IRBuilderBase & is preferred.

3270

Drop nullptr init.

3292

CreateNeg

kazu updated this revision to Diff 507117.Mar 21 2023, 2:10 PM
  • Removed the empty clause.
  • Removed RangeOp.
  • Switched to CreateNeg.
  • Made the matching a little more straightforward.
kazu marked 6 inline comments as done.Mar 21 2023, 2:12 PM

Please take a look. Thanks!

llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
3193

I've revamped the matching logic. No more RangeOp.

nikic requested changes to this revision.Mar 22 2023, 2:32 AM

Implementation looks fine, but this needs more test coverage.

llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
3197

Check CommonAncestor == CtlzOp here and then drop the first if below? So it's one MatchForward plus one MatchForward after peeking through add?

llvm/test/Transforms/InstCombine/bit_ceil.ll
153

Some missing tests:

  • Commuted select operands.
  • Select constant not 1.
  • Select condition does not imply the needed range.
  • Multi-use test.
  • Vector test.
  • Wrong constant in sub.
This revision now requires changes to proceed.Mar 22 2023, 2:32 AM
kazu updated this revision to Diff 507617.Mar 22 2023, 10:56 PM
kazu marked an inline comment as done.

Updated tests, including a vector one.

kazu marked an inline comment as done.Mar 22 2023, 11:06 PM

I've updated the patch. Please take a look. Thanks!

llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
3197

I am not sure if I understand your suggestion here. Could you elaborate? We have three possibilities:

  • Peek through an add
  • MatchForward
  • Peek through an add and MatchForward

I cannot just drop the first if below. I need to check for a specific operand m_Add(m_Specific(CtlzOp), ...).

nikic accepted this revision.Mar 23 2023, 4:19 AM

LG

llvm/lib/Transforms/InstCombine/InstCombineSelect.cpp
3193–3227

To clarify, this is what I meant.

llvm/test/Transforms/InstCombine/bit_ceil.ll
154–155

The order gets canonicalized here. You can use something like icmp slt %x, 0 to avoid.

RKSimon accepted this revision.Mar 23 2023, 5:22 AM

unblocking this - thanks @kazu

This revision is now accepted and ready to land.Mar 23 2023, 5:22 AM
kazu updated this revision to Diff 507939.Mar 23 2023, 7:06 PM

Adjusted MatchForward.
Adjusted the condition for select.

kazu marked 2 inline comments as done.Mar 23 2023, 7:08 PM
kazu added inline comments.
llvm/test/Transforms/InstCombine/bit_ceil.ll
154–155

I settled on icmp eq %dec, 0. The order of the select operands seems to stick that way.

This revision was landed with ongoing or failed builds.Mar 23 2023, 7:27 PM
This revision was automatically updated to reflect the committed changes.
kazu marked an inline comment as done.