This is an archive of the discontinued LLVM Phabricator instance.

[X86][InstCombine] Simplify some `pternlog` intrinsics
ClosedPublic

Authored by goldstein.w.n on Mar 2 2023, 11:42 PM.

Details

Summary

Currently pternlog intrinsics stay as a function call and are unable
to merge with other bin-ops / constant-fold.

This commit adds support for reducing all pternlog intrinsics to
their base logic-ops so that they can be further reduced in other
passes.

Since the x86 backend doesn't do a great job creating vpternlog
instructions from stray logic ops, the current logic only simplifies
in the cases were we obviously will be able to do as good or
better. As the x86 backend improves, more cases can be simplified.

Diff Detail

Event Timeline

goldstein.w.n created this revision.Mar 2 2023, 11:42 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 2 2023, 11:42 PM
goldstein.w.n requested review of this revision.Mar 2 2023, 11:42 PM
Herald added a project: Restricted Project. · View Herald TranscriptMar 2 2023, 11:42 PM
craig.topper added inline comments.
llvm/lib/Target/X86/X86InstCombineIntrinsic.cpp
588

If everything is const can we someone use the immediate as a lookup table instead of manually describing every case?

588

somehow* not someone

craig.topper added inline comments.Mar 3 2023, 12:12 AM
llvm/lib/Target/X86/X86InstCombineIntrinsic.cpp
585

Do we need to make sure we are looking at vectors of ConstantInts rather than vectors of ConstantExpr?

pengfei added inline comments.Mar 3 2023, 12:46 AM
llvm/lib/Target/X86/X86InstCombineIntrinsic.cpp
591–594

Maybe better to do it in FE and remove all these intrinsics finally?

goldstein.w.n added inline comments.Mar 3 2023, 1:37 PM
llvm/lib/Target/X86/X86InstCombineIntrinsic.cpp
585

Will the ALU not fold with ConstantExpr? How can I test that?

588

If everything is const can we someone use the immediate as a lookup table instead of manually describing every case?

Yes although there are some IMM where we only need AC/AB/BC so think we would still want to check the IMM.

Are you concerned about compile time overhead? If not keeping the same logic for const / non-const seems simpler and less error prone.

591–594

What do you mean? Like in the header for _mm_ternarylogic_.... add a switch statement? Or something else?

But would prefer to keep it somewhere we have const/non-const info on A/B/C because it will currently cause regressions to simplify many of the cases (working on a patch for backend generation, but a ways away).

goldstein.w.n added inline comments.Mar 3 2023, 1:40 PM
llvm/lib/Target/X86/X86InstCombineIntrinsic.cpp
588

If everything is const can we someone use the immediate as a lookup table instead of manually describing every case?

Yes although there are some IMM where we only need AC/AB/BC so think we would still want to check the IMM.

Are you concerned about compile time overhead? If not keeping the same logic for const / non-const seems simpler and less error prone.

If all three are const we can just implement it with a loop above all the bits. Its just:

for(i : bitwidth()) {
res[i] = (imm >> ((A[i] << 2) + (B[i] << 1) + (C[i] << 0))) & 1
}

but still issue of we will miss cases where C is non-constant but also truly unneeded.

craig.topper added inline comments.Mar 3 2023, 2:07 PM
llvm/lib/Target/X86/X86InstCombineIntrinsic.cpp
585

ConstantExpr can involve things that are conceptually constant but aren’t known to the compiler. Like the address of a global variable. SelectionDAG will expand ConstantExprs into instructions.

588

I really want there to be some nice algorithm to do this optimization. It’s not easy to see that all 256 entries in the switch are correct. We have to trust that you transcribed them from Intel’s table correctly or we need to check that ourselves.

goldstein.w.n added inline comments.Mar 3 2023, 2:20 PM
llvm/lib/Target/X86/X86InstCombineIntrinsic.cpp
585

I see, the issue with isa<ConstantInt> is then it only works for splats, is there a check for evaluatable constant?

588

I really want there to be some nice algorithm to do this optimization. It’s not easy to see that all 256 entries in the switch are correct. We have to trust that you transcribed them from Intel’s table correctly or we need to check that ourselves.

We can do the constant case with the algorithm but I couldn't figure out how to generic turn the imm -> logic ops.

not intels table (don't know if there is one).
Use the following not so clean pyscript: https://godbolt.org/z/aahK7qEhM
tested that the results are correct (turned off the constant guards and use the following): https://godbolt.org/z/KMrMdoY3f

craig.topper added inline comments.Mar 3 2023, 2:56 PM
llvm/lib/Target/X86/X86InstCombineIntrinsic.cpp
585

isa<ConstantInt> doesn't work for splats. It only works for scalars. Splats are usually handled in InstCombine by using m_APInt which calls getSplatValue() on the Constant.

I think you would need to loop over all elements, call getAggregateElement, and make sure it returns a ConstantInt for each element. From a quick search, I see InstCombinerImpl::dyn_castNegVal does something like that.

588

Intel has a table at the start of CHAPTER 5 INSTRUCTION SET REFERENCE, V in Volume 2C of the SDM

goldstein.w.n added inline comments.Mar 3 2023, 3:08 PM
llvm/lib/Target/X86/X86InstCombineIntrinsic.cpp
585

Thanks, will update.

588

That would have saved a moment :/

Given that they recommend a table as well not sure if there is a generic algorithm that has reasonable perf.

goldstein.w.n marked an inline comment as not done.Mar 3 2023, 3:11 PM
goldstein.w.n added inline comments.
llvm/lib/Target/X86/X86InstCombineIntrinsic.cpp
591–594

What do you mean? Like in the header for _mm_ternarylogic_.... add a switch statement? Or something else?

But would prefer to keep it somewhere we have const/non-const info on A/B/C because it will currently cause regressions to simplify many of the cases (working on a patch for backend generation, but a ways away).

We can handle the constants algorithimically if you prefer (my feeling however is 1 implementation is preferred and if we need the table would be simpler to just use that).

craig.topper added inline comments.Mar 3 2023, 3:17 PM
llvm/lib/Target/X86/X86InstCombineIntrinsic.cpp
591–594

How many entries are there that do something for non-constant in the current patch?

I'm bit nervous about replacing vpternlog intrinsics with complex sequences and hoping to turn it back in the backend. If there are many logic ops involved we might make a different decision than the programmer. We already get bugs from time to time that we changed the shuffles that a programmer wrote and generated worse code.

goldstein.w.n marked an inline comment as not done.Mar 3 2023, 3:24 PM
goldstein.w.n added inline comments.
llvm/lib/Target/X86/X86InstCombineIntrinsic.cpp
591–594

How many entries are there that do something for non-constant in the current patch?

At the moment only have 11 cases actually on. The rest are placaeholder for when backend gets improved.

I'm bit nervous about replacing vpternlog intrinsics with complex sequences and hoping to turn it back in the backend. If there are many logic ops involved we might make a different decision than the programmer. We already get bugs from time to time that we changed the shuffles that a programmer wrote and generated worse code.

Agreed, thats why most are off at the moment. I am working on a patch to improve vpternlog generation at which point we will hopefully be able to turn more on w.o causing regressions.

Think ternlogd is a bit simpler. Unlike shuffles the pattern for creating them is pretty simple. Take an arbitrary number of logic ops that use the same 3 operands -> ternlog. The issue we have now is we only look at 3-ops instead of looking at 3-operands.

But OFC will test results and won't turn on cases that still cause regressions.

Correct constant detection for vec + verification

goldstein.w.n marked 3 inline comments as done.Mar 3 2023, 5:59 PM
goldstein.w.n added inline comments.
llvm/lib/Target/X86/X86InstCombineIntrinsic.cpp
588

I really want there to be some nice algorithm to do this optimization. It’s not easy to see that all 256 entries in the switch are correct. We have to trust that you transcribed them from Intel’s table correctly or we need to check that ourselves.

I added asserts for all cases. Could still be a typeof in the future if someone changes the return but forgets to do the assert, but does verify that what we have now is correct.

Make verification impossible to mess up

goldstein.w.n added inline comments.Mar 3 2023, 7:00 PM
llvm/lib/Target/X86/X86InstCombineIntrinsic.cpp
588

I really want there to be some nice algorithm to do this optimization. It’s not easy to see that all 256 entries in the switch are correct. We have to trust that you transcribed them from Intel’s table correctly or we need to check that ourselves.

I added asserts for all cases. Could still be a typeof in the future if someone changes the return but forgets to do the assert, but does verify that what we have now is correct.

Changed so it tracks the imm along with creating the return value. Should be impossible to accidentally mess up now.

nikic added a subscriber: nikic.Mar 4 2023, 1:11 AM
nikic added inline comments.
llvm/lib/Target/X86/X86InstCombineIntrinsic.cpp
590

You are looking for match(V, m_ImmConstant()) I think.

goldstein.w.n marked an inline comment as done.

Use match(V, m_ImmConstant())

Fix from imm 3rd arg

pengfei added inline comments.Mar 15 2023, 7:14 PM
llvm/test/Transforms/InstCombine/X86/x86-ternlog.ll
2312

The following constant tests are too tedium and hard to review. Can we use a unittest instead?

Add unit tests

goldstein.w.n marked an inline comment as done.Mar 28 2023, 8:20 PM
goldstein.w.n added inline comments.
llvm/test/Transforms/InstCombine/X86/x86-ternlog.ll
2312

Done, see new file TernlogTest.cpp

pengfei accepted this revision.Mar 29 2023, 2:28 AM

LGTM.

llvm/test/Transforms/InstCombine/X86/x86-ternlog.ll
2312

Thanks! Do we still need these constant tests?

llvm/unittests/Transforms/InstCombine/X86/TernlogTest.cpp
187 ↗(On Diff #509196)

What is loop used for?

This revision is now accepted and ready to land.Mar 29 2023, 2:28 AM
goldstein.w.n marked an inline comment as done.Mar 29 2023, 10:05 AM
goldstein.w.n added inline comments.
llvm/test/Transforms/InstCombine/X86/x86-ternlog.ll
2312

Thanks! Do we still need these constant tests?

I prefer to keep. I was less so trying to test that the values produced a correct than sanity test that we properly get constant folding in codegen.

llvm/unittests/Transforms/InstCombine/X86/TernlogTest.cpp
187 ↗(On Diff #509196)

Because the tests use random values it may make sense to increase number of rand iterations. I can remove if you think its unnecessary.

Remove unnecessary NTests loop

goldstein.w.n marked an inline comment as done.Mar 29 2023, 10:37 AM
goldstein.w.n added inline comments.
llvm/unittests/Transforms/InstCombine/X86/TernlogTest.cpp
187 ↗(On Diff #509196)

Removed it.

This revision was automatically updated to reflect the committed changes.
mehdi_amini added inline comments.
llvm/unittests/Transforms/InstCombine/X86/CMakeLists.txt
14 ↗(On Diff #510529)

This breaks the build when the X86 target is not configured, you likely need to filter out this directory somehow. Can you look into it quickly or revert temporarily?

goldstein.w.n added inline comments.Apr 3 2023, 10:06 AM
llvm/unittests/Transforms/InstCombine/X86/CMakeLists.txt
14 ↗(On Diff #510529)

Reverted, Sorry about that!

mehdi_amini added inline comments.Apr 3 2023, 10:09 AM
llvm/unittests/Transforms/InstCombine/X86/CMakeLists.txt
14 ↗(On Diff #510529)

Thanks!

(FYI: in general revert commit should mention the reason for the revert)

goldstein.w.n reopened this revision.Apr 3 2023, 10:59 AM
This revision is now accepted and ready to land.Apr 3 2023, 10:59 AM

Fix test issues

skan added a subscriber: skan.Apr 5 2023, 10:48 PM