This is an archive of the discontinued LLVM Phabricator instance.

[X86][Haswell] Updating HSW instruction scheduling information
ClosedPublic

Authored by gadi.haber on Aug 13 2017, 11:55 PM.

Details

Summary

This patch completely replaces the instruction scheduling information for the Haswell architecture target by modifying the file X86SchedHaswell.td located under the X86 Target.
We used the scheduling information retrieved from the Haswell architects in order to replace and modify the existing scheduling.
The patch continues the scheduling replacement effort started with the SNB target in r307529 and r310792.
Information includes latency, number of micro-Ops and used ports by each HSW instruction.

Please expect some performance fluctuations due to code alignment effects.

Diff Detail

Repository
rL LLVM

Event Timeline

gadi.haber created this revision.Aug 13 2017, 11:55 PM
craig.topper edited edge metadata.Aug 18 2017, 2:49 PM

I think the following are missing from the HWWriteResGroups

IMUL32rmi 2 uops
IMUL32rmi8 2 uops
IMUL32rri 1 uop
IMUL32rri8 1 uop
IMUL64rmi32 2 uops
IMUL64rmi8 2 uops
IMUL64rri32 1 uop
IMUL64rri8 1 uop

lib/Target/X86/X86SchedHaswell.td
2748 ↗(On Diff #110916)

I believe believe this

IMUL8m - 2 uops
IMUL16m - 5 uops
IMUL32m - 4 uops
IMUL64m - 3 uops

2762 ↗(On Diff #110916)

I believe this

MUL8m - 2 uops
MUL16m - 5 uops
MUL32m - 4 uops
MUL64m - 3 uops

3156 ↗(On Diff #110916)

Can you recheck this. I believe the following

MUL16r/IMUL16r - 4 uops
MUL32r/IMUL32r - 3 uops
IMUL64r/IMUL64r - 2 uops
MULX64rr - 2 uops

craig.topper added inline comments.Aug 18 2017, 2:57 PM
lib/Target/X86/X86SchedHaswell.td
2722 ↗(On Diff #110916)

Should this account for load latency?

Unfortunately, I could not retrieve the data about the mentioned IMUL instructions from the HSW architects.
I hope to dig into the archives together with the architects and search for all missing instructions in the next commits for both SNB, HSW and the next SKX and SKL patches.

Ignore my last comment.
I was able to retrieve the information for IMUL(16|32|64)rri8 and IMUL(16|32|64)rmi8
I will add them in.

gadi.haber added inline comments.Aug 20 2017, 5:46 AM
lib/Target/X86/X86SchedHaswell.td
2722 ↗(On Diff #110916)

yes, according to the SNB architects.

Updated diff after comments by Craig. Found missing some of the IMUL instrs + added the rr_REV instructions.

gadi.haber added inline comments.Aug 20 2017, 5:51 AM
lib/Target/X86/X86SchedHaswell.td
3156 ↗(On Diff #110916)

Will re-check with the architects to verify.

craig.topper added inline comments.Aug 20 2017, 10:15 AM
lib/Target/X86/X86SchedHaswell.td
2722 ↗(On Diff #110916)

If it shoudl include load latency shouldn't it have a latency of more than 3? ADDPDrr is in a group with latency 3. So shoudln't ADDPrm be more than 3?

gadi.haber added inline comments.Aug 21 2017, 12:01 AM
lib/Target/X86/X86SchedHaswell.td
2722 ↗(On Diff #110916)

The scheduling model is based on the fact that there are no memory latencies effects, i.e., no cache misses and everything is in the 1st level cache.
This is the model successfully used and constantly verified by the architects.
The performance measurements we ran support this model.

craig.topper added inline comments.Aug 21 2017, 12:42 AM
lib/Target/X86/X86SchedHaswell.td
2722 ↗(On Diff #110916)

I understand assuming everyting is in L1.

But in the SandyBridge model you have ADDPDrr as 3 cycles and ADDPDrm as 9 cycles. So it seems you're accounting for the load as being 6 cycles. But in Haswell you have both as 3 cycles. So loads from the L1 are free on Haswell?

gadi.haber added inline comments.Aug 21 2017, 1:04 AM
lib/Target/X86/X86SchedHaswell.td
2722 ↗(On Diff #110916)

What I understood from the architects who explained it to me is that the memory access in SNB required additional cycles even when everything is in L1.
The exact additional cycles depends on the instruction's ucode.
As a result there are memory instructions that require less additional cycles than others. For example: MOV(16|32|64)rr requires 1 cycle whereas MOV(16|32|64)rm requires 5 cycles.
In here he difference is 4 cycles (not additional 6 as in ADDPD).

Updated diff file after comments by Craig Topper about differences in MUL and IMUL instructions for 8 vs. 16 vs 32 vs 64 bits.

craig.topper added inline comments.Aug 24 2017, 3:30 PM
lib/Target/X86/X86SchedHaswell.td
2372 ↗(On Diff #111879)

BSWAP16r seems to no longer be present.

3195 ↗(On Diff #112335)

Is MUL32r/IMUL32r really different than MULX32rr?

gadi.haber added inline comments.Aug 27 2017, 1:02 AM
lib/Target/X86/X86SchedHaswell.td
2372 ↗(On Diff #111879)

Yes. Removed it temporarily for performance evaluation.
Brought it back.

3195 ↗(On Diff #112335)

yes. This one is different.

Updated diff following comments by Craig. Brought back BSWAP(16|32) + fixed ports for the MUL and IMUL 16 and 32.

This revision is now accepted and ready to land.Aug 27 2017, 8:59 AM
This revision was automatically updated to reflect the committed changes.
llvm/trunk/test/CodeGen/X86/avx512vl-vec-cmp.ll