This is an archive of the discontinued LLVM Phabricator instance.

[llvm-mca][X86] Teach how to identify register writes that implicitly clear the upper portion of a super-register.
ClosedPublic

Authored by andreadb on Jun 15 2018, 9:27 AM.

Details

Summary

This patch teaches llvm-mca how to identify register writes that implicitly zero the upper portion of a super-register.

On X86-64, a general purpose register is implemented in hardware as a 64-bit register. Quoting the Intel 64 Software Developer's Manual: "an update to the lower 32 bits of a 64 bit integer register is architecturally defined to zero extend the upper 32 bits".
Also, a write to an XMM register performed by an AVX instruction implicitly zeroes the upper 128 bits of the aliasing YMM register.

This patch adds a new method named clearsSuperRegisters to the MCInstrAnalysis interface to help identify instructions that implicitly clear the upper portion of a super-register.
The rest of the patch teaches llvm-mca how to use that new method to obtain the information, and update the register dependencies accordingly.

I compared the kernels from tests clear-super-register-1.s and clear-super-register-2.s against the output from perf on btver2.
Previously there was a large discrepancy between the estimated IPC and the measured IPC. Now the differences are mostly in the noise.

Please let me know if okay to commit.

Thanks,
Andrea

Diff Detail

Event Timeline

andreadb created this revision.Jun 15 2018, 9:27 AM

Should there be some AVX512VL tests? We don't have any scheduler that can test XOP instructions AFAICT (unless we want to cheat and use SandyBridge in its role as the generic model).

include/llvm/MC/MCInstrAnalysis.h
88

When is it better to use BitVector vs APInt? I don't have an answer but we're incredibly inconsistent on this!

lib/Target/X86/MCTargetDesc/X86MCTargetDesc.cpp
19

Include ordering?

316

XOP instructions? I think the TBM instructions that use this encoding will be safe as they are always GR32/GR64.

323

Is this safe for i686 32-bit targets?

I like the fact that we can now keep track of the implicit zero of the upper register bits. Very cool. This change makes sense, and as far as I can tell it looks good, but I'd wait for a few others to weigh in.

lib/Target/X86/MCTargetDesc/X86MCTargetDesc.cpp
307

Since you have a static routine for creating instances of this class: createX86MCInstrAnalysis, you can probably make this ctor private. It can probably be marked explicit, to imply no converting.

328

Can we be sure that NumDefs <= BitVector::size? I know Bitvector's access operators assert to check for out of bounds indexing.

tools/llvm-mca/InstrBuilder.h
55

Should the names of the formal parameters be capitalized? I realize you are trying to avoid clashing with the member names.

Should there be some AVX512VL tests? We don't have any scheduler that can test XOP instructions AFAICT (unless we want to cheat and use SandyBridge in its role as the generic model).

Thanks Simon,

I will add a test for AVX512VL. If you have a good idea about how to test XOP, then I can add more tests for it too.

include/llvm/MC/MCInstrAnalysis.h
88

I think using a BitVector (at least in this context) is probably okay. But - to be honest - I don't know the right answer to that question either.
The idea is to use a simple bitvector to do very simple bit manipulation.

APInt has a much richer interface. It allows to do other things (other than bit manipulation). It allows to do arithmetic and logic computation on integers with arbitrary precision. APInt is probably "over designed" for this particular context.

That being said, both interfaces are okay. If you prefer, I can switch to APInt.

lib/Target/X86/MCTargetDesc/X86MCTargetDesc.cpp
19

Sorry about that. I will move the new include before "X86MCAsmInfo.h".

316

Good point. I forgot about XOP.
However - as you mentioned - I have no idea how to test it.
Do you still want me to add the check for XOP, even if we cannot write a test for it? Alternatively, I can leave a FIXME comment. Not sure what is more appropriate in this context...

323

It should be safe for i686 because the super-register of a 32-bit GPR is not usable/existent in practice.

craig.topper added inline comments.Jun 15 2018, 10:41 AM
include/llvm/MC/MCInstrAnalysis.h
88

BitVector always heap allocates, APInt heap allocates above 64 bits. SmallBitVector heap allocates above 58 bits on 64-bit hosts and above 27 bits on 32-bit hosts.

lib/Target/X86/MCTargetDesc/X86MCTargetDesc.cpp
323

VR128RegClass doesn't include XMM16-XMM31. Those are in VR128XRegClass.

Does this cover writes to YMM clearing the upper half of ZMM registers?

andreadb added inline comments.Jun 15 2018, 10:44 AM
lib/Target/X86/MCTargetDesc/X86MCTargetDesc.cpp
307

Unless I am missing something, the createX86MCInstrAnalysis static routine would not be able to see the constructor if we make it private.

328

We cannot be sure. We assume that it has been set to the right size by the user.
Otherwise, it will assert inside BitVector.
What we could do, is to always "resize" the BitVector to the actual number of register writes (from the information in Desc).

tools/llvm-mca/InstrBuilder.h
55

I have seen it done in a few (not many) places in LLVM specifically to avoid the issue with name clash. I can use different names to avoid the name clashing. Alternatively, I just capitalise the param names; it would still work, but it would not be nice to read.

Does this cover writes to YMM clearing the upper half of ZMM registers?

Good point. I didn't consider that case.
I guess we need to conservatively also check for YMM writes. Obviously, the implicit zeroing of the upper half of ZMM would only make sense for targets where ZMM is effectively usable...

include/llvm/MC/MCInstrAnalysis.h
88

Thanks for the answer. I change it to APInt then.

lib/Target/X86/MCTargetDesc/X86MCTargetDesc.cpp
323

I will change it. Thanks.

mattd added inline comments.Jun 15 2018, 11:02 AM
lib/Target/X86/MCTargetDesc/X86MCTargetDesc.cpp
307

You are correct, createX86MCInstrAnalysis would have to be a static method of X86MCInstrAnalysis to make private the create* routine. I see that none of the other create* routines in here are presented that way, so it probably doesn't make much sense to change this one. Please ignore my original comment.

andreadb updated this revision to Diff 151713.Jun 18 2018, 7:34 AM
andreadb marked 10 inline comments as done.

Patch updated.

Addressed review comments.
We now use an APInt instead of a BitVector to keep track of writes that update super-register(s).

A few XOP and AVX512 tests have been added at revision 334945 (http://llvm.org/viewvc/llvm-project?view=revision&revision=334945).

This new patch shows the updated analysis. Now the tool correctly kills the dependencies with the zeroed portions of YMM/ZMM registers. The new IPC for those tests looks much more realistic now. For example, avx512-super-registers-2.s goes from IPC 0.29 to IPC 1.89, with a teoretical maximum IPC of 2.00 (computed as NumInstructions / Block RThroughput = 6 / 3).

Please let me know if okay to commit.

Thanks,
Andrea

No more comments from me - @mattd @craig.topper are you guys happy with this?

mattd added a comment.Jun 19 2018, 8:23 AM

I'm cool with the patch. I don't see anything wrong.

lib/Target/X86/MCTargetDesc/X86MCTargetDesc.cpp
330

I like this lambda, and it makes sense to me. AFAIK, the [=] capture encourages a copy of GR32RC and VR256XRC. Can we get away with a reference capture [&] instead?

mattd added inline comments.Jun 19 2018, 8:34 AM
lib/Target/X86/MCTargetDesc/X86MCTargetDesc.cpp
330

Nevermind, the GR43RC and VR128XRC are not automatic to this lambda, so they are propagated via reference semantics. This looks good to me.

Thanks Matt and Simon.

I will wait for the okay from Craig on the AVX512 part before committing this patch.

Cheers,
Andrea

craig.topper added inline comments.Jun 19 2018, 10:08 AM
lib/Target/X86/MCTargetDesc/X86MCTargetDesc.cpp
348

What should happen if you enable avx512f and xop instructions at the same time? I know no real CPU supports it, but should a 256-bit xop instruction clear the upper bits of zmm?

andreadb added inline comments.Jun 19 2018, 10:54 AM
lib/Target/X86/MCTargetDesc/X86MCTargetDesc.cpp
348

For XOP instructions, document "AMD64 Architecture Programmer’s Manual Volume 4: 128-Bit and 256-Bit Media Instructions" says that:

Bits [255:128] of the YMM register that corresponds to the destination are cleared

That sentence is related to XOP instructions that set an XMM register.

However, that same document uses the a very similar sentence when describing AVX instructions. For example, for VADDPD we have this:

XMM Encoding:
The first source operand is an XMM register. The second source  operand is either an XMM register or  a 128-bit memory location. The destination is a third XMM regis
ter. Bits [255:128] of the YMM register that corresponds to the destination are cleared.

VLMAX (or, the concept of a "maximum vector register width" for the processor) is not even mentioned in the entire document. So, I honestly don't know what is the right answer to your question.

If we want to be conservative, then we can assume for now that XOP does not update the upper bits of a ZMM register. In future (if AMD decides not to drop XOP), then we revisit this choice and update/simplify this code. What do you think?

lebedev.ri added inline comments.
lib/Target/X86/MCTargetDesc/X86MCTargetDesc.cpp
348

In future (if AMD decides not to drop XOP)

They already dropped it (and TBM), they don't exist in Zen.

So as it is now, writing a VR128X with XOP will zero [511:256] and [255:128], but writing VR256X with xop won't?

So as it is now, writing a VR128X with XOP will zero [511:256] and [255:128], but writing VR256X with xop won't?

I see what you mean. I don't want to complicate the API (especially since there is no cpu with XOP and AVX512f). What if I treat XOP the same as AVX then? In practice, this won't make any difference.

I'm fine with treating it the same as AVX.

andreadb updated this revision to Diff 151957.Jun 19 2018, 11:32 AM

Patch updated.

Treat XOP the same way as AVX. When the destination of a XOP instruction is an XMM register, we assume that the upper portion of all super-registers is cleared.

Ok the register class/feature stuff looks good to me now.

RKSimon accepted this revision.Jun 20 2018, 2:36 AM

We have a quorum! LGTM

This revision is now accepted and ready to land.Jun 20 2018, 2:36 AM

We have a quorum! LGTM

Cheers ;-)

Thanks Craig/Simon/Matt for all the feedback.

This revision was automatically updated to reflect the committed changes.