Page MenuHomePhabricator

[llvm-exegesis] Disable the LBR check on AMD
ClosedPublic

Authored by oontvoo on Feb 25 2021, 2:44 PM.

Details

Summary

https://bugs.llvm.org/show_bug.cgi?id=48918

The bug reported a hang (or very very slow runtime) on a Zen2. Unfortunately, we don't have the hardware right now to debug it and I was not able to reproduce the bug on a HSW.
Theory we've got is that the lbr-checking code could be confused on AMD.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
mgorny added inline comments.Feb 25 2021, 3:05 PM
llvm/tools/llvm-exegesis/lib/X86/Target.cpp
763

I wonder if it wouldn't be better to explicitly check for Intel here, or maybe even for Intel family new enough for LBR. I'm going to guess that Cyrix/VIA processors don't have it either.

ondrasej added inline comments.Feb 26 2021, 3:54 AM
llvm/tools/llvm-exegesis/lib/X86/Target.cpp
763

+1. The perf counters are vendor specific (in particular, we have some Intel-specific magic numbers in the code), so it would make more sense to allow only Intel rather than reject only AMD. We can allow more architectures as soon as we can verify the LBR support on them.

oontvoo added inline comments.Mar 2 2021, 12:12 PM
llvm/tools/llvm-exegesis/lib/X86/Target.cpp
763

Good point.
So ideally we want to say abort if we see anything that's not Intel, but looking at llvm::Triple::VendorType, I don't see Intel listed as a vendor.
A bit surprised... Does anyone know why?

mgorny added inline comments.Mar 2 2021, 1:14 PM
llvm/tools/llvm-exegesis/lib/X86/Target.cpp
763

Probably because nobody lists CPU vendor in the vendor field. It's usually pc or unknown.

IIRC the code is going to run on host, so you probably want to look at CPUID for the CPU vendor. sys::getHostCPUName has some logic to check for Intel/AMD but doesn't seem to return that usefully. Maybe you should copy the part of that logic, i.e. something like:

if (getX86CpuIDAndInfo(0, &MaxLeaf, &Vendor, &ECX, &EDX) || MaxLeaf < 1 || Vendor != 0x756e6547)
  return; // not intel
oontvoo updated this revision to Diff 327590.Mar 2 2021, 2:21 PM
oontvoo marked an inline comment as done.

Updated diff: Add a util in Host.h to check the host CPU's vendor.

mgorny added inline comments.Mar 2 2021, 11:14 PM
llvm/lib/Support/Host.cpp
502

Smells a bit weird that you're declaring EAX and EBX if you never intend to use it.

llvm/tools/llvm-exegesis/lib/X86/Target.cpp
738

Unless I'm missing something, state is no longer necessary here.

ondrasej added inline comments.Mar 3 2021, 3:45 AM
llvm/include/llvm/Support/Host.h
73

Why not "enum class"? With that, we would not need the "SIG_" prefix on all the value names.

llvm/lib/Support/Host.cpp
502

The vendor ID strings span over EBX, EDX, and ECX (for Intel, it spells "GenuineIntel"), and we should check all three of them, not just the first 32 bits, because this might be ambiguous (according to Wikipedia, there are non-Intel chips where the text also starts with "Genu" - even though they are probably old Intel clones).

I think we have two options here:

  1. Remove VendorSignatures and return an Optional<std::string> (because the signature is a human-readable string anyway) and provide constants for the most common vendor strings,
  2. Instead of returning a string, return a fixed size structure (a 12-byte array or a struct with fields for EBX, EDX, ECX),
  3. Keep VendorSignatures, and add one value for each known signature (UNKNOWN, GENUINE_INTEL for "GenuineIntel", AUTHENTIC_AMD for "AuthenticAMD", ...) do the proper string matching inside getVendorSignature() and let people add new signatures as they need them.

I'd strongly prefer 1 (all the known vendor signatures are human readable strings anyway) or 3 (easier use on caller side), option 2 makes the use of the signature on caller side more difficult.

518–519

Are the using statements actually needed? It seems that the function or the enum class is never used in this file.

oontvoo updated this revision to Diff 327964.Mar 3 2021, 4:31 PM
oontvoo marked 5 inline comments as done.

Updated diff:
Check the full sig string rather than just the first 4 bytes.

llvm/include/llvm/Support/Host.h
73

Done. Made this an enum class and renamed the members.

llvm/lib/Support/Host.cpp
502

Smells a bit weird that you're declaring EAX and EBX if you never intend to use it.

Removed.

502

Thanks! I've gone with 3) since it's easier to use (we won't need string comparisons).

518–519

The VendorSignatures enum are used. They used to be declared locally to this cpp file, but I've moved it to the header. I didn't want to update all the references due to the change in its namespace, so I'd added this` using`.

I've removed the using VendorSignatures That is not needed now that we've switched to an enum class.
The using namespace is still needed to avoid having to spell out the full ns in other places in this file.

mgorny accepted this revision.Mar 4 2021, 12:11 AM

Can't say how it works on Intel but I can confirm that the problem is gone on AMD.

This revision is now accepted and ready to land.Mar 4 2021, 12:11 AM
ondrasej accepted this revision.Mar 4 2021, 4:31 AM

One more const-comment, otherwise looks good.

llvm/lib/Support/Host.cpp
1130

Can this be const?

oontvoo updated this revision to Diff 328176.Mar 4 2021, 7:57 AM
oontvoo marked an inline comment as done.

Updated diff:

  • make Vendor const
  • move var decls closer to where they are used and made clang-diagnostic-uninitialized happy

Can't say how it works on Intel but I can confirm that the problem is gone on AMD.

Thanks for confirming it for AMD! I've tested it on Intel.

This revision was landed with ongoing or failed builds.Mar 4 2021, 8:17 AM
This revision was automatically updated to reflect the committed changes.

This doesn't build on at least mac/arm hosts: http://45.33.8.238/macm1/4600/step_3.txt

This doesn't build on at least mac/arm hosts: http://45.33.8.238/macm1/4600/step_3.txt

From looking at the code, I think this is broken on all non-intel hosts. Reverted for now in 76148caa505c85e5ea5e57d67f1f618f59d7e86f

This doesn't build on at least mac/arm hosts: http://45.33.8.238/macm1/4600/step_3.txt

From looking at the code, I think this is broken on all non-intel hosts. Reverted for now in 76148caa505c85e5ea5e57d67f1f618f59d7e86f

hmm from the build log, this is failing to compile an x86-specific target (`llvm-exegesis/lib/X86/target.cpp) on an aarch64. That seems weird.

oontvoo reopened this revision.Mar 4 2021, 9:16 AM
This revision is now accepted and ready to land.Mar 4 2021, 9:16 AM
oontvoo updated this revision to Diff 328208.Mar 4 2021, 9:16 AM

Updated diff with fix for non-x86 platforms

lebedev.ri requested changes to this revision.Mar 4 2021, 9:48 AM
lebedev.ri added a subscriber: lebedev.ri.

This is still a preprocessor mess that won't compile, i think.
getVendorSignature() should always exist, but if not being compiled for/on X86, it should return unknown.

This revision now requires changes to proceed.Mar 4 2021, 9:48 AM
ondrasej requested changes to this revision.Mar 4 2021, 10:38 AM
oontvoo updated this revision to Diff 328252.Mar 4 2021, 11:24 AM

Updated diff: defined getVendorSignature() for the non-x86 case too.

This is still a preprocessor mess that won't compile, i think.
getVendorSignature() should always exist, but if not being compiled for/on X86, it should return unknown.

Done

lebedev.ri accepted this revision.Mar 4 2021, 11:35 AM

This seems about right now, thanks.

llvm/tools/llvm-exegesis/lib/X86/Target.cpp
751–752

I don't think you need this any more

ondrasej added inline comments.Mar 5 2021, 9:48 AM
llvm/lib/Support/Host.cpp
507

As noted in the other patch - when MaxLeaf is nullptr we should set it to zero if this return statement is used.

Otherwise, it will keep its default value and for the caller it will be impossible to distinguish this branch from the return statement on line 520.

oontvoo updated this revision to Diff 328567.Mar 5 2021, 9:56 AM

init MaxLeaf to 0

llvm/lib/Support/Host.cpp
507

Done. (Setting MaxLeaf to 0 when it's not null in the check on line 503)

ondrasej accepted this revision.Mar 5 2021, 10:05 AM

Looks good except for the typo.

llvm/lib/Support/Host.cpp
505

This looks like a typo. Also, please move the assignment on the following line in the interest of readability.

This revision is now accepted and ready to land.Mar 5 2021, 10:05 AM
oontvoo updated this revision to Diff 328577.Mar 5 2021, 10:24 AM

Fixed typo

oontvoo marked 2 inline comments as done.Mar 5 2021, 10:24 AM

Thanks all for the review!

llvm/lib/Support/Host.cpp
505

Good catch!

This revision was landed with ongoing or failed builds.Mar 5 2021, 10:25 AM
This revision was automatically updated to reflect the committed changes.
oontvoo marked an inline comment as done.
ppenzin reopened this revision.EditedMar 9 2021, 10:30 PM
ppenzin added a subscriber: ppenzin.

Apologies if this is not the right channel to raise issues with the patch.

Order of registers in vendor string looks off, and I can see this patch breaking a project I am building.

Debugger walkthrough on AMD A10 running FreeBSD:

(gdb) 
515	  if (EBX == 0x756e6547 && ECX == 0x6c65746e && EDX == 0x49656e69)
(gdb) 
519	  if (EBX == 0x68747541 && ECX == 0x69746e65 && EDX == 0x444d4163)
(gdb) 
522	  return VendorSignatures::UNKNOWN;
(gdb) p/x EBX
$2 = 0x68747541
(gdb) p/x ECX
$3 = 0x444d4163
(gdb) p/x EDX
$4 = 0x69746e65

I think the latter two registers are in wrong order (see inline comment). A curious bit is that the old code did not use those.

llvm/lib/Support/Host.cpp
514–519

I think you have ECX and EDX flipped around, vendor string supposed to be read from EBX, EDX, ECX (in order): https://en.wikipedia.org/wiki/CPUID#EAX=0:_Highest_Function_Parameter_and_Manufacturer_ID

I can observe this patch returning UNKNOWN instead of AUTHENTIC_AMD on AMD A10 running FreeBSD.

This revision is now accepted and ready to land.Mar 9 2021, 10:30 PM
oontvoo added inline comments.Mar 9 2021, 11:46 PM
llvm/lib/Support/Host.cpp
514–519

Thanks! The Intel case was correct. ( ECX went first, but its expected value is 0x6c65746e, which is ntel)
I've sent to correct the AMD case.https://reviews.llvm.org/D98322

oontvoo closed this revision.Mar 9 2021, 11:47 PM