This is an archive of the discontinued LLVM Phabricator instance.

[llvm-mca][BtVer2] teach how to identify false dependencies on partially written registers.
ClosedPublic

Authored by andreadb on Jul 11 2018, 9:25 AM.

Details

Summary

The goal of this patch is to improve the throughput analysis in llvm-mca for the case where instructions in the input assembly sequence perform partial register writes.

On x86, partial register writes are quite difficult to model, mainly because different processors tend to implement different register merging schemes in hardware.

When the code contains partial register writes, the IPC (instructions per cycles) estimated by llvm-mca tends to diverge quite significantly from the observed IPC (using perf).

Modern AMD processors (at least, from Bulldozer onwards) don't rename partial registers. Quoting Agner Fog's microarchitecture.pdf:

The processor always keeps the different parts of an integer register together. For example, AL and AH are not treated as independent by the out-of-order execution mechanism. An instruction that writes to part of a register will therefore have a false dependence on any previous write to the same register or any part of it.

This patch is a first important step towards improving the analysis of partial register updates.

This patch changes the semantic of RegisterFile descriptors in tablegen, and teaches llvm-mca how to identify false dependences in the presence of partial register writes (for more details: see the new code comments in include/Target/TargetSchedule.h - class RegisterFile).

This patch doesn't address the case where a write to a part of a register is followed by a read from the whole register.
On Intel chips, high8 registers (AH/BH/CH/DH)) can be stored in separate physical registers. However, a later (dirty) read of the full register (example: AX/EAX) triggers a merge uOp, which adds extra latency (and potentially affects the pipe usage).
This is a very interesting article about partial register writes on Intel chips here: https://stackoverflow.com/questions/45660139/how-exactly-do-partial-registers-on-haswell-skylake-perform-writing-al-seems-to

In future, the definition of RegisterFile can be extended with extra information that may be used to identify cases where a register read is slowed down by a merge of a partial write.

Please let me know if okay to commit.

-Andrea

Diff Detail

Repository
rL LLVM

Event Timeline

andreadb created this revision.Jul 11 2018, 9:25 AM
andreadb edited the summary of this revision. (Show Details)Jul 11 2018, 9:27 AM
mattd added inline comments.Jul 11 2018, 11:04 AM
include/llvm/Target/TargetSchedule.td
506 ↗(On Diff #155014)

s/wit/with/

tools/llvm-mca/Instruction.cpp
143 ↗(On Diff #155014)

Style: You could just use WriteLatency in place of the call to getCyclesLeft.

tools/llvm-mca/RegisterFile.cpp
29 ↗(On Diff #155014)

I think you can treat the {0,1U} as a call to the IndexPlusCostPairTy constructor. That would be a bit easier to read.

99 ↗(On Diff #155014)

We should note that index 0 is a special case, and probably have a similar comment in RegisterFile.h where IndexPlusCostPairTy is declared.

153 ↗(On Diff #155014)

Style: Remove a line of white space.

243 ↗(On Diff #155014)

This block looks duplicated from line 233 above. It looks like we will invalidate the supers of RegID twice if WS are clearing super regs.

tools/llvm-mca/RegisterFile.h
155 ↗(On Diff #155014)

Add 'to' to the following phrase: ... allows us to classify ...

RKSimon added inline comments.Jul 11 2018, 11:12 AM
tools/llvm-mca/RegisterFile.cpp
155 ↗(On Diff #155014)

Use RRI.RenameAs direct and make RRI const?

331 ↗(On Diff #155014)

Is this worth it? Just use RRI.IndexPlusCost directly below?

andreadb updated this revision to Diff 155147.Jul 12 2018, 3:45 AM
andreadb marked 7 inline comments as done.

Patch updated.

Addressed review comments.

andreadb added inline comments.Jul 12 2018, 3:48 AM
tools/llvm-mca/RegisterFile.cpp
99 ↗(On Diff #155014)

This is already explained in a couple of places in RegisterFile.h.
See the comments on field RegisterFiles, and type RegisterMapping.
If possible, I prefer not to add another comment about it.

243 ↗(On Diff #155014)

The previous block is invalidating sub-registers of RegID.
This block invalidates super-registers of RegID. So, we cannot remove it.

mattd accepted this revision.Jul 13 2018, 9:51 AM

I'm happy with the changes made, and don't see anything obviously wrong, but you might want to wait a day/weekend to see if anyone else has anything to add. LGTM!

This revision is now accepted and ready to land.Jul 13 2018, 9:51 AM
lebedev.ri added inline comments.Jul 14 2018, 6:19 AM
include/llvm/Target/TargetSchedule.td
465–466 ↗(On Diff #155147)

This should probably be a TODO.
"Software Optimization Guide for AMD Family 17h Processors"
http://developer.amd.com/wordpress/media/2013/12/55723_SOG_Fam_17h_Processors_3.00.pdf
"34 Microarchitecture of AMD Family 17h Processor Chapter 2"
"2.11 Floating-Point Unit"

It can handle dispatch and renaming of 4 floating point micro ops per cycle

So maybe having one single DispatchWidth per ProcResGroup and using that for
both the dispatch and renaming is the way to go.

RKSimon accepted this revision.Jul 14 2018, 7:01 AM

LGTM with a couple of final minor docs/comments fixes.

include/llvm/Target/TargetSchedule.td
465–466 ↗(On Diff #155147)

Yes, please put a TODO in front of this paragraph:

TODO: This implementation currently assumes that there is no limit in the number of renames per cycle, which might not be true for all hardware or register classes.
lib/Target/X86/X86ScheduleBtVer2.td
50 ↗(On Diff #155147)

Please can you check the section numbers - it doesn't seem to match the latest version of Agner's doc - as he seems to reorder things with each release, adding the full secton/subsection titles might make sense here.

This revision was automatically updated to reflect the committed changes.

ZnVer1 will need updating, too.
I could take a look, but i have an immediate question about FP registers:

llvm/trunk/lib/Target/X86/X86ScheduleBtVer2.td
56

A bit late, but i'm not sure about the FP registers.

But the high and low 64 bits of a 128 bit register are treated as independent on
Bobcat, and the high and low 128 bits of a 256 bit register are treated as independent on
Jaguar.

Does that mean that it should be def JFpuPRF: RegisterFile<72, [VR128, VR256], [1, 1, 2]>; ?
Or does that mean on Bobcat <and later>?

andreadb added inline comments.Jul 16 2018, 8:09 AM
llvm/trunk/lib/Target/X86/X86ScheduleBtVer2.td
56

Register class VR64 is the class used by x87/mmx registers.
Those eight registers don't alias with XMM/YMM registers, and they are subject to register renaming.

Bobcat provides native support for 64-bit data types, but not 128-bit data types. So, operations on 128-bit data types are split into 2x64-bit pairs. Internally, the PRF consumes two physical registers to map a single XMM write.

Agner describes this in Section 21.1: The Bobcat has 64-bit physical registers and uses two such registers to save a 128-bit vector.

I hope it helps.
-Andrea

lebedev.ri added inline comments.Jul 16 2018, 8:59 AM
llvm/trunk/lib/Target/X86/X86ScheduleBtVer2.td
56

Thank you.
Yes, i think it did.
Basically, i think it confirmed that the FP RegisterFile for both the btver2, and znver1 is correct.
So i just need to adjust the integer RegisterFile on znver1.