Page MenuHomePhabricator

[X86][TTI] Costmodel for AVX512DQ's VPMOVM2[DQ] / VPMOV[DQ]2M instructions
ClosedPublic

Authored by lebedev.ri on Sat, Nov 20, 3:03 AM.

Details

Summary

Much like the VPMOVM2[BW] / VPMOV[BW]2M from AVX512BW,
these either sign-extent the mask register into a vector,
or pack the mask from vector register.

Apparently, we didn't even have MCA tests for these,
added in rG2f364f6f0d3a2420ca78cbd80abb186657180e05,
so i'm just guessing that their perf characteristics
are optimal.

Diff Detail

Event Timeline

lebedev.ri created this revision.Sat, Nov 20, 3:03 AM
lebedev.ri requested review of this revision.Sat, Nov 20, 3:03 AM

I remember that there was a discussion using tablegen and the scheduling models to create these cost models. The discussion stoped. Having one diff per instruction does not scale.

I remember that there was a discussion using tablegen and the scheduling models to create these cost models. The discussion stoped. Having one diff per instruction does not scale.

Yep.

I remember that there was a discussion using tablegen and the scheduling models to create these cost models. The discussion stoped. Having one diff per instruction does not scale.

Yep.

Its even worse than that - the cost tables ignore many hw specific facts such that some ops can be run on multiple pipes (so a throughput cost of 1 might actually be 1/4 of that.....), and that usually we want accumulative costs of an entire code sequence, which the InstructionCost class can't do as it'd have to internally track some kind of llvm-mca like state. I've tried to work on stop gap solutions such as D46276 and D103695, but they rely on high quality scheduler models, which we just don't have yet (although D103695 has proved to be useful to iteratively fix the models as well as the cost tables).

So, first step would be to ensure we have decent scheduler models for all relevant x86 cpus, but there's many steps after that :(

As an outsider, maybe have a document and discussion of the long-term goal, the necessary steps to get there, and document the problems of the current approach.

You say instruction sequence. I would see it as a window. What is the cost of this instruction with a window of size k, i.e., k/2 instructions before and k/2 instructions after? If you increase accuracy, then costs would rise.

NFC, precommitted some more tests - i've completely neglected VL/+prefer-256-bit side of things apparently,
but that is for a followup commit.

Now with actual +VL support :(

Fix really dumb think-o - trunc cost is obviously 2, not 1 - https://godbolt.org/z/1TeYMjWE7 - trunc keeps lowest bit, while VPMOV.2M keeps the highest bit.

RKSimon accepted this revision.Mon, Nov 22, 3:14 AM

LGTM

This revision is now accepted and ready to land.Mon, Nov 22, 3:14 AM

LGTM

Thank you for the review!

This revision was landed with ongoing or failed builds.Mon, Nov 22, 3:40 AM
This revision was automatically updated to reflect the committed changes.