Page MenuHomePhabricator

[X86][BtVer2] Fix WriteFShuffle256 schedule write info.
ClosedPublic

Authored by andreadb on Aug 30 2018, 8:59 AM.

Details

Summary

This patch fixes the number of micro opcodes, and processor resource cycles for the following AVX instructions:

vinsertf128rr/rm
vperm2f128rr/rm

Tests have been regenerated using the usual scripts in the llvm/utils directory.

Please let me know if okay to commit.

Thanks @RKSimon for spotting the issue with WriteFShuffle256.

Andrea

Diff Detail

Repository
rL LLVM

Event Timeline

andreadb created this revision.Aug 30 2018, 8:59 AM
RKSimon added inline comments.Aug 30 2018, 9:04 AM
test/tools/llvm-mca/X86/BtVer2/resources-avx1.s
1791 ↗(On Diff #163346)

vbroadcastf128 looks dodgy as well

[MCA can get the latency, and micro-ops for the most part, but ]

processor resource cycles

Really ignorant question: how do you decide what is the right values?
Is there some documentation?

Really ignorant question: how do you decide what is the right values?
Is there some documentation?

AMD's Fam16h SOG which includes a latency spreadsheet - https://developer.amd.com/resources/developer-guides-manuals/

The Fam15h SOG have similar charts at the back of the PDF as do many of the Intel CPUs in their AOM docs

Really ignorant question: how do you decide what is the right values?
Is there some documentation?

AMD's Fam16h SOG which includes a latency spreadsheet - https://developer.amd.com/resources/developer-guides-manuals/

The Fam15h SOG have similar charts at the back of the PDF as do many of the Intel CPUs in their AOM docs

Hm, perhaps i wrote too much :)
I was specifically asking *not* about the latency/micro-ops, but about the resource cycles.

avt77 added a comment.Aug 30 2018, 9:54 AM

I don't see any changes for VEXTRACTF128 in tests. Do you really need this JWriteVecExtractF128? If YES you should add the corresponding test.

I don't see any changes for VEXTRACTF128 in tests. Do you really need this JWriteVecExtractF128? If YES you should add the corresponding test.

There is no change in tests for VEXTRACTF128 because I didn't touch its latency/throughput profile info.

I need JWriteVecExtractF128 because otherwise I would affect profile info for VEXTRACTF128, which is already correct.
I don't think that I need more tests for it, as there are already existing tests for that instruction (both in llvm-mca and CodeGen/X86).

@lebedev.ri , the amdfam16h SOG also reports that data paths are 128-bits wide. AVX 256-bit instructions are effectively split into two opcodes, and consume twice as many pipeline resources as their 128-bit counterpart. The spreadsheet in the SOG should also report the reciprocal throughput (that's how I derived those resource cycles).

andreadb updated this revision to Diff 163375.Aug 30 2018, 11:07 AM

Patch updated with a fix for the throughput information for VBROADCASTF128.

This should address Simon's review comment.

RKSimon accepted this revision.Aug 30 2018, 1:12 PM

LGTM - thanks

This revision is now accepted and ready to land.Aug 30 2018, 1:12 PM
This revision was automatically updated to reflect the committed changes.