This is an archive of the discontinued LLVM Phabricator instance.

[X86] Machine combine vnni instruction.
ClosedPublic

Authored by LuoYuanke on Apr 21 2023, 9:39 PM.

Details

Summary

"vpmaddwd + vpaddd" can be combined to vpdpwssd and the latency is
reduced after combination. However when vpdpwssd is in a critical path
the combination get less ILP. It happens when vpdpwssd is in a loop, the
vpmaddwd can be executed in parallel in multi-iterations while vpdpwssd
has data dependency for each iterations. If vpaddd is in a critical path
while vpmaddwd is not, it is profitable to split vpdpwssd into "vpmaddwd
+ vpaddd ".
This patch is based on the machine combiner framework to acheive decision
on "vpmaddwd + vpaddd" combination. The typical example code is as
below.

__m256i foo(int cnt, __m256i c, __m256i b, __m256i *p) {

    for (int i = 0; i < cnt; ++i) {
        __m256i a = p[i];
        __m256i m = _mm256_madd_epi16 (b, a);
        c = _mm256_add_epi32(m, c);
    }

    return c;
}

Diff Detail

Event Timeline

LuoYuanke created this revision.Apr 21 2023, 9:39 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 21 2023, 9:39 PM
LuoYuanke requested review of this revision.Apr 21 2023, 9:39 PM
Herald added a project: Restricted Project. · View Herald TranscriptApr 21 2023, 9:39 PM
goldstein.w.n added inline comments.
llvm/lib/CodeGen/MachineCombiner.cpp
173

What is this change for?

427

This seems like an unrelated change. Can you split it?

llvm/lib/Target/X86/X86InstrInfo.cpp
9823

Maybe this belongs in X86FixupInstTuning.cpp?

llvm/test/CodeGen/X86/avxvnni-combine.ll
4

Can you precommit the test so we can see the diff?

Fix the AArch64 test case failure.

LuoYuanke added inline comments.Apr 22 2023, 12:28 AM
llvm/lib/CodeGen/MachineCombiner.cpp
173

For some reason the new created register for new created instructions is not managed by MRI, so it would miss adding the DefInstr's latency. I guess may be due to the new instructions are not inserted in the machine function. I'll debug more on it.

llvm/lib/Target/X86/X86InstrInfo.cpp
9823

The method is to override the virtual function of TargetInstrInfo.
X86FixupInstTuning.cpp may be integrated to MachineCombine.

llvm/test/CodeGen/X86/avxvnni-combine.ll
4

Sure, I'll precommit the test case first.

RKSimon added inline comments.Apr 22 2023, 2:10 AM
llvm/test/CodeGen/X86/avxvnni-combine.ll
2–291

Add AVX512-VNNI support as well?

300

Please can you add test coverage for a case where there isn't the cross-loop dependency?

auto bar(int cnt, __m256i *c, __m256i b, __m256i *p) {
    for (int i = 0; i < cnt; ++i) {
        __m256i a = p[i];
        __m256i m = _mm256_madd_epi16 (b, a);
        c[i] = _mm256_add_epi32(m, c[i]);
    }
}
LuoYuanke updated this revision to Diff 516059.Apr 22 2023, 5:23 AM

Add test case that there isn't the cross-loop dependency.

LuoYuanke added inline comments.Apr 22 2023, 5:31 AM
llvm/test/CodeGen/X86/avxvnni-combine.ll
429

vpmaddwd and vmovdqa (line 159) can be issued in parallel.

goldstein.w.n added inline comments.Apr 22 2023, 11:49 AM
llvm/lib/Target/X86/X86InstrInfo.cpp
9823

It seems.you are just querying for an opcode and replacing 1-1? Not that it can only be done in X86Fixup... But seems to fit there with less code.

craig.topper added inline comments.Apr 22 2023, 1:20 PM
llvm/lib/Target/X86/X86InstrInfo.cpp
9823

It’s replacing 1 instruction with 2 after doing analysis that consults the scheduling model.

LuoYuanke updated this revision to Diff 516145.Apr 23 2023, 3:27 AM

Add (avx512, 128, 256, 512) support.

LuoYuanke updated this revision to Diff 516149.Apr 23 2023, 3:33 AM

Update test case.

LuoYuanke updated this revision to Diff 516220.Apr 23 2023, 6:53 PM

Fix an opcode bug.

craig.topper added inline comments.Apr 23 2023, 8:45 PM
llvm/lib/CodeGen/MachineCombiner.cpp
426–432

Put curly braces around the if body for consistency with the else body.

llvm/lib/Target/X86/X86InstrInfo.cpp
9791

The Vp prefix on these variable names isn't providing much value. You can probably drop it.

skan added a subscriber: skan.Apr 24 2023, 12:05 AM

Address Craig's comments.

LuoYuanke marked 2 inline comments as done.Apr 24 2023, 12:31 AM
XinWang10 edited the summary of this revision. (Show Details)Apr 24 2023, 1:01 AM
LuoYuanke edited the summary of this revision. (Show Details)Apr 24 2023, 1:52 AM
LuoYuanke edited the summary of this revision. (Show Details)
LuoYuanke updated this revision to Diff 516312.Apr 24 2023, 1:55 AM
LuoYuanke edited the summary of this revision. (Show Details)

Update description

LuoYuanke edited the summary of this revision. (Show Details)Apr 24 2023, 1:58 AM
pengfei accepted this revision.Apr 27 2023, 12:54 AM

LGTM with some nits.

llvm/lib/Target/X86/X86InstrInfo.cpp
9797–9800

indent.

I think we don't need to duplicate for ymm and zmm. And maybe better to choose rr version as example.

9851

Avoid auto

llvm/lib/Target/X86/X86InstrInfo.h
605

Nit: Should copy+paste the comment here like others.

610

ditto.

This revision is now accepted and ready to land.Apr 27 2023, 12:54 AM
LuoYuanke updated this revision to Diff 517481.Apr 27 2023, 1:29 AM

Address Phoebe's comments.

This revision was landed with ongoing or failed builds.Apr 27 2023, 1:43 AM
This revision was automatically updated to reflect the committed changes.

@LuoYuanke I'm seeing tests failures on my local machine with expensive checks enabled - one buildbot is reporting something similar: https://lab.llvm.org/buildbot/#/builders/16/builds/47297

@LuoYuanke I'm seeing tests failures on my local machine with expensive checks enabled - one buildbot is reporting something similar: https://lab.llvm.org/buildbot/#/builders/16/builds/47297

Thanks, @RKSimon for notifying. I'll revert the patch first.

There is conflict when I try to revert the patch. I fixed the issue with rG40222ddcf8f5.