Page MenuHomePhabricator

Add method to TargetInfo to get CPU cache line size
ClosedPublic

Authored by zoecarver on Feb 20 2020, 11:23 AM.

Details

Summary

This patch adds a virtual method getCPUCacheLineSize() to TargetInfo. Currently, I've only (partially) implemented the method in X86TargetInfo. It's extremely important that each CPU's cache line size correct (e.g., we can't just define it as 64 across the board) so, it has been a little slow getting to this point. There are still quite a few CPUs I haven't been able to find the cache line size of yet; for those, I'm returning zero so that the caller of this method can propagate an error. See the commented table above X86TargetInfo::getCPUCacheLineSize to check my sources for each CPU.

I'll work on the ARM CPUs next, but that will probably come later in a different patch.

Also, I updated the current uses of cache line sizes in the compiler to use this API when possible. The only one (that I could find) that I didn't update is in TargetTransformInfo. Updating that would require a more significant API change, which would be out of scope for this patch. It would be nice if that also used this API (to keep everything in one place), so I'll try to update that too at some point.

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

No proofs, but for completeness, some amd numbers

clang/lib/Basic/Targets/X86.cpp
1811–1814

K6 should have 32B cache line size

1815–1832

K7..ZnVer2 should have 64B

1833–1834

32B for this one

@lebedev.ri thanks! I'll add those.

  • Add AMD cache line sizes based on @lebedev.ri's comments

Where are you planning to use this?

clang/lib/Sema/SemaStmt.cpp
2817 ↗(On Diff #245708)

"64" here is arbitrary; it's not actually supposed to be the cache line size for any particular CPU, just something generally expensive. I don't think this warning should depend on -mcpu.

As I've mentioned before, depending on what this will be used for, "64" is not a _useful_ answer if you want to know how your memory accesses will behave on modern intel x86 CPUs, despite being the "correct" answer for cache-line size. But, modern intel CPUs fetch aligned-pairs of cache-lines together. Therefore, you'll start to see cache interference effects as-if the cache-lines were 128 bytes, rather than 64 bytes. (This may or may not also apply to AMD cpus, I haven't looked into it.)

jfb added inline comments.Feb 20 2020, 1:56 PM
clang/include/clang/Basic/TargetInfo.h
1211

Return an optional instead of using zero to mean "unknown"?

clang/lib/Basic/Targets/X86.cpp
1840

All of the above (except generic) should be 64.

craig.topper added inline comments.
clang/lib/Basic/Targets/X86.cpp
1738

I'd be surprised if 386 is larger than 486 or 586.

1739

doubleword is 32-bits on X86. So 4 double words is 16 bytes I think?

1833

Nehalem, coooperlake, cannonlake and tigerlake are all 64.

craig.topper added inline comments.Feb 20 2020, 3:52 PM
clang/lib/Basic/Targets/X86.cpp
1834

I think Lakemont is 16 bytes. Assuming I'm interpretting the CLFLUSH line size from this CPUID dump correctly https://github.com/InstLatx64/InstLatx64/blob/master/GenuineIntel/GenuineIntel0000590_Lakemont_CPUID2.txt

zoecarver marked 7 inline comments as done.Feb 20 2020, 10:07 PM

To address the question, what will this be used for: this could be used for several things in the library and compiler, but the primary use will be to create a builtin that can be used to implement P0154.

@jyknight It would probably be best if we could account for CPUs who like to use aligned pairs when getting a cache line. It's probably also important that we don't change the value getCPUCacheLineSize returns, so, if we are going to account for that, we should probably update getCPUCacheLineSize before this patch lands.

clang/include/clang/Basic/TargetInfo.h
1211

Good idea, will do.

clang/lib/Basic/Targets/X86.cpp
1738

Yes, it does seem surprising but this other source corroborates it https://osxbook.com/blog/2009/03/02/retrieving-x86-processor-information/

1739

Yes, you're right! My bad. This PDF (page 29) also agrees with 16 bytes. I'll update the table.

1833

Great, I'll update it. Thanks.

1834

I may very well be interpreting this incorrectly but I think that the cache line sizes are at eax = 0x80000005 and eax = 0x80000006. However, all the registers at those codes are zero. If I instead look at eax = 0x00000001 (which I think is incorrect), ecx = 00010200 which would be 128 bytes (maybe this is where the 16 came from, if they were bits instead?). How were you interpreting it?

1840

Good to know, I'll update it :)

clang/lib/Sema/SemaStmt.cpp
2817 ↗(On Diff #245708)

This will fall back to 64 as a default if the cache line size isn't known. I can revert this change if you want, though. It's not central to the patch.

craig.topper added inline comments.Feb 20 2020, 10:28 PM
clang/lib/Basic/Targets/X86.cpp
1738

-march=i386 there seem just seems to be a compiler switch. The 386 CPU didn't have CPUID, that was added on the 486. The dump there seems to be a dump from Nehalem. They just compiled the code targeting a really old CPU for some reason.

Thinking about it more, the 386 is a bit weird because the cache wasn't part of the CPU, it was on the motherboard and not always present. Not sure what that means for cache line size. I have an ancient 386 hardware manual in my cube at work, maybe I can find something in there.

1834

Interpreted bits 15:8 of the ecx = 00010200 to be the clflush information. The spec for cpuid says to multiply by 8 to get cache line size. So 15:8 is 2, multiply by 8 is 16 bytes.

I think Lakemont is based on the 486 architecture so that seems possible.

zoecarver marked an inline comment as done.Feb 21 2020, 9:29 AM
zoecarver added inline comments.
clang/lib/Basic/Targets/X86.cpp
1834

I see. That looks correct. I'll update it.

Still, I do find it strange that all registers at eax = 0x80000006 are zeroed out. Using the cpuid instruction on my cpu I get the following:

CPUID 0: 13 71 110 105
CPUID 1: 97 0 191 255
CPUID 2: 1 255 0 0
CPUID 80000000: 8 0 0 0
CPUID 80000001: 0 0 33 0
CPUID 80000002: 73 108 32 101
CPUID 80000003: 41 45 48 67
CPUID 80000004: 64 50 122 0
CPUID 80000005: 0 0 0 0
CPUID 80000006: 0 0 64 0
CPUID 80000007: 0 0 0 0
CPUID 80000008: 39 0 0 0

I know my L2 cache line size is 64 and the L2 cache line size should be ecx at eax = 0x80000006 which would be 64. But, when I read the dump you linked to, it doesn't look like there's any data at 0x80000006.

Anyway, as I said above, I'll look past this and update it based on the clflush information. That seems valid.

zoecarver updated this revision to Diff 245896.Feb 21 2020, 9:40 AM
zoecarver marked 6 inline comments as done.
  • Update values returned by getCPUCacheLineSize based on review comments
zoecarver marked an inline comment as done.Feb 21 2020, 9:43 AM
zoecarver updated this revision to Diff 245897.Feb 21 2020, 9:46 AM
  • Update getCPUCacheLineSize to return optional
Harbormaster completed remote builds in B47018: Diff 245897.
craig.topper added inline comments.Feb 21 2020, 10:26 AM
clang/lib/Basic/Targets/X86.cpp
1834

Lakemont is a modernized version of a 486 which probably had none of the 80000000 leafs originally. My speculation is that they really wanted to support leaf 80000008 so they just put zeros in the rest rather than add a lot of microcode.

zoecarver marked an inline comment as done.Feb 21 2020, 10:34 AM
zoecarver added inline comments.
clang/lib/Basic/Targets/X86.cpp
1834

Alrighty, that makes sense. Thanks :)

Friendly ping. Any other comments on this?

jfb accepted this revision.Feb 27 2020, 1:09 PM
This revision is now accepted and ready to land.Feb 27 2020, 1:09 PM
craig.topper added inline comments.Feb 27 2020, 1:39 PM
clang/lib/Basic/Targets/X86.cpp
1786

I found the documentation for the 82385 cache controller chip for the 386. It's a bit weird. The tags for the cache are based on 16 byte lines, but there are valid bits for every 2 bytes within that line. So its possible for only a portion of a line to be valid.

1840

If Yonah and Penryn are 64, then Core2 should be as well. It's the generation between them.

jyknight requested changes to this revision.Feb 27 2020, 1:59 PM

@jyknight It would probably be best if we could account for CPUs who like to use aligned pairs when getting a cache line. It's probably also important that we don't change the value getCPUCacheLineSize returns, so, if we are going to account for that, we should probably update getCPUCacheLineSize before this patch lands.

It would be extremely unfortunate to go to all the trouble of attempting to return accurate results from the P0154 interfaces, and then to return an answer insufficient to actually achieve the performance benefit it's intended for, and then be unable to fix it due to ABI concerns. So, yes, I do believe that this issue must be decided.

Additionally, my opinion here has really not changed from a couple of years ago. I continue to believe it was a mistake to standardize these constexpr values, and that absolutely the best course of action would be to continue to decline to implement this part of the standard, forever. (And that GCC should similarly also continue to decline to implement it).

That said, the list of cacheline sizes collected in this review is still useful in any case, since it needs to be copied into LLVM's X86Subtarget to implement the backend getCacheLineSize function.

This revision now requires changes to proceed.Feb 27 2020, 1:59 PM

(I assume I'm not seeing a code review being used to veto a C++ Standard feature, but actually the other points are the reason for the red flag.)

I can see a desire for hyper-precise definitions to achieve the best possible performance, but we really need a healthy does of conservatism here.

The C++ values can't change as a result of selecting between microarchitecture variations that are supposed to link, it takes an ABI break to change these.

We specified two numbers here so we could conservatively set them (e.g. constructive: smallest; destructive: largest) if we want, but then they are fixed.

I think there's just two plausible answers for x86_64:

  1. constructive=64, destructive=64 (the only plausible answer for X86 classic edition)
  2. constructive=64, destructive=128 (reserve some room for line-pair designs)

Recapping: precision is nice but we need to fix this in the ABI so ultimate precision isn't required nor desirable; we should choose one of these pairs (or debate if another pair is viable that I'm not thinking of); then we should ship C++17.

clang/lib/Basic/Targets/X86.cpp
1748

Broadwell.

zoecarver added a comment.EditedFeb 27 2020, 7:46 PM

Here's what I think we should do: continue to have this method just return the cache line size. Then have another method that returns true or false for whether the given architecture supports aligned pairs of cache lines then, users of this (either in clang or libc++) can decide what they want to do.

Edit: I'm not tied to this approach, though. I'm happy to update this patch instead if that's what you (all) think is best.

(I assume I'm not seeing a code review being used to veto a C++ Standard feature, but actually the other points are the reason for the red flag.)

I can see a desire for hyper-precise definitions to achieve the best possible performance, but we really need a healthy does of conservatism here.

The C++ values can't change as a result of selecting between microarchitecture variations that are supposed to link, it takes an ABI break to change these.

If these values are part of the C++ target platform ABI, it seems to me the values for std::hardware_{constructive,destructive}_interference_size should be set by whoever has the authority to decide C++ platform ABI for specific platforms.
Assuming my thought in the previous sentence is correct; discussions on which values to chose for std::hardware_{constructive,destructive}_interference_size should happen in whichever forums decide C++ platform ABI for the various platforms? (Maybe for some platforms that forum might be clang-related fora like reviews.llvm.org, but probably not for all platforms).
With my (probably limited) understanding of the requirements, it seems like deriving std::hardware_{constructive,destructive}_interference_size from actual cache line size on a specific micro-architecture doesn't seem to be the right approach?

We specified two numbers here so we could conservatively set them (e.g. constructive: smallest; destructive: largest) if we want, but then they are fixed.

I think there's just two plausible answers for x86_64:

  1. constructive=64, destructive=64 (the only plausible answer for X86 classic edition)
  2. constructive=64, destructive=128 (reserve some room for line-pair designs)

    Recapping: precision is nice but we need to fix this in the ABI so ultimate precision isn't required nor desirable; we should choose one of these pairs (or debate if another pair is viable that I'm not thinking of); then we should ship C++17.

If these values are part of the C++ target platform ABI, it seems to me the values for std::hardware_{constructive,destructive}_interference_size should be set by whoever has the authority to decide C++ platform ABI for specific platforms.
Assuming my thought in the previous sentence is correct; discussions on which values to chose for std::hardware_{constructive,destructive}_interference_size should happen in whichever forums decide C++ platform ABI for the various platforms? (Maybe for some platforms that forum might be clang-related fora like reviews.llvm.org, but probably not for all platforms).
With my (probably limited) understanding of the requirements, it seems like deriving std::hardware_{constructive,destructive}_interference_size from actual cache line size on a specific micro-architecture doesn't seem to be the right approach?

They will be in the library ABI, meaning the libc++ ABI.

It's valid for libstdc++ and libc++ to have different values there. If we wish, we could try for an alignment (no pun intended) on these values, but even then that's just between these two libraries.

Which is good and encouraging, because I don't know what forum we would have to go to.

If these values are part of the C++ target platform ABI, it seems to me the values for std::hardware_{constructive,destructive}_interference_size should be set by whoever has the authority to decide C++ platform ABI for specific platforms.
Assuming my thought in the previous sentence is correct; discussions on which values to chose for std::hardware_{constructive,destructive}_interference_size should happen in whichever forums decide C++ platform ABI for the various platforms? (Maybe for some platforms that forum might be clang-related fora like reviews.llvm.org, but probably not for all platforms).
With my (probably limited) understanding of the requirements, it seems like deriving std::hardware_{constructive,destructive}_interference_size from actual cache line size on a specific micro-architecture doesn't seem to be the right approach?

They will be in the library ABI, meaning the libc++ ABI.

It's valid for libstdc++ and libc++ to have different values there. If we wish, we could try for an alignment (no pun intended) on these values, but even then that's just between these two libraries.

Which is good and encouraging, because I don't know what forum we would have to go to.

I see.
So IIUC, this is library C++ ABI, to be defined by the C++ library. Since no 2 C++ libraries can co-exist in a single application, there is no need for different C++ libraries to agree?

I think there is still an issue then with getting the values of std::hardware_{constructive,destructive}_interference_size in the library implementation derived from compiler builtins.
There are quite a few systems where clang supports targeting a different C++ library than libc++, e.g. libstdc++ on linux or (IIUC) the MSVC C++ library on Windows.
If these C++ libraries implement std::hardware_{constructive,destructive}_interference_size based on a value returned by a compiler builtin, and the different compilers that are used with these libraries return different values for such a builtin, then the library C++ ABI here will be dependent on which compiler used?
Doesn't this indicate that either:

  • std::hardware_{constructive,destructive}_interference_size should not be implemented using a compiler builtin, or
  • all compilers must return the same value for the builtin; for all targets they support.

Overall, that makes me doubt that using a compiler builtin to implement std::hardware_{constructive,destructive}_interference_size is the right direction.
If the functionality in this patch does not need to support implementing std::hardware_{constructive,destructive}_interference_size, then the design constraints for this patch change?

There are a lot of different ways we could implement the feature. We may want to only enable it when -march=native, or maybe only in the unstable ABI, and maybe we want to support aligned pairs on some architectures. I think that's an important discussion to have but I'm not sure _this_ patch is the best place to have that discussion.

Even if we don't use this patch in the implementation I think it would still be a good utility to have. Here's what I suggest: I commit this, create another patch to add a builtin that exposes this API, and then open a libc++ patch with a _possible_ implementation. In that patch, we can discuss how we should actually implement the feature and after we have a consensus I can do the work to implement it. Any objections to that plan?

There are a lot of different ways we could implement the feature. We may want to only enable it when -march=native, or maybe only in the unstable ABI, and maybe we want to support aligned pairs on some architectures. I think that's an important discussion to have but I'm not sure _this_ patch is the best place to have that discussion.

Even if we don't use this patch in the implementation I think it would still be a good utility to have. Here's what I suggest: I commit this, create another patch to add a builtin that exposes this API, and then open a libc++ patch with a _possible_ implementation. In that patch, we can discuss how we should actually implement the feature and after we have a consensus I can do the work to implement it. Any objections to that plan?

Discussing the implementation strategy for std::hardware_{constructive,destructive}_interference_size on a libc++ thread rather than here makes sense.
I'm afraid I don't have a good view on all the ways the API and associated intrinsic proposed here will or could be used in practice.
My only thought on it is that we cannot guarantee that different versions of LLVM will keep on reporting the same number, even for identical targets.

@kristof.beyls I'll try to bring that up in my libc++ patch, thanks.

I'm planning on committing this today unless anyone still has concerns. @jyknight is this path OK with you?

This seems misplaced - why is this in clang and not LLVM?

clang/include/clang/Basic/TargetInfo.h
1210

Comment is no longer valid - returns None instead.
Also, might it be worth explicitly calling out that there is zero guarantees of stability of the returned values?

This patch as it stands is harmless, since as it only defines an internal interface, which is unused. So in that sense, it's perfectly fine to commit even with the remaining unresolved questions about the correct values to return. However, unless we're going to actually use it, adding this code to clang is not useful, for the same reason of it only defining an internal interface which is unused.

Ultimately, I don't see a reason to commit this, until/unless we are going to commit code in Clang using it (which I continue to believe we should not do). So I'd say leaving this in a pending state, until a use is going to be committed immediately afterwards, seems best.

zoecarver marked 3 inline comments as done.Thu, Mar 19, 8:51 AM

@lebedev.ri LLVM may be better, I'm not sure. If you feel strongly I can move it.

@jyknight I'm planning on adding a builtin that uses this method. I can put the patch up for that first if you would like.

clang/include/clang/Basic/TargetInfo.h
1210

Good catch. I think we should try to have as much stability as possible but I can add a note that these values may change.

clang/lib/Basic/Targets/X86.cpp
1786

Interesting. Does this mean that we should be returning a different value?

1840

I'll update it. Thanks.

@lebedev.ri LLVM may be better, I'm not sure. If you feel strongly I can move it.

I do think that it makes much more sense somewhere closer to the backend

@jyknight I'm planning on adding a builtin that uses this method. I can put the patch up for that first if you would like.

but that being said, currently nothing needs that and with all those concerns about exposing this info,
i'd hold off moving it until it is settled whether it is needed (and not a dead code) in the first place :/

In D76525 I've added a builtin that uses this API. We can use that builtin in libc++.

zoecarver marked 2 inline comments as done.Fri, Mar 20, 2:11 PM

I've made the suggested changes. Is there a consensus that this should be moved to LLVM? I think it's fairly clang-specific (given the use case). We can also always move it to LLVM if the need arises in the future.

zoecarver updated this revision to Diff 251758.Fri, Mar 20, 2:12 PM

Fix based on review:

  • update comments
  • return 64 for Core2
craig.topper added inline comments.Fri, Mar 20, 11:58 PM
clang/lib/Basic/Targets/X86.cpp
1786

I think we should make 386 return 16.

zoecarver marked an inline comment as done.Sat, Mar 21, 9:08 AM
zoecarver added inline comments.
clang/lib/Basic/Targets/X86.cpp
1786

Alright. I'll update it.

@lebedev.ri Where should I put this? Could you maybe point me to another LLVM target API that clang uses that I could base this off? Maybe it should go inside the triple or just be a static function?

zoecarver updated this revision to Diff 252079.EditedMon, Mar 23, 9:39 AM
  • Rebase off master
  • For i386, return 16

@lebedev.ri Where should I put this? Could you maybe point me to another LLVM target API that clang uses that I could base this off? Maybe it should go inside the triple or just be a static function?

I would have guessed TargetLoweringInfo/TargetTransformInfo, but i'm not sure if we can access it here without violating layering.

@lebedev.ri Where should I put this? Could you maybe point me to another LLVM target API that clang uses that I could base this off? Maybe it should go inside the triple or just be a static function?

I would have guessed TargetLoweringInfo/TargetTransformInfo, but i'm not sure if we can access it here without violating layering.

I don't think can be done unless the proposed builtin for this lowered to an llvm intrinsic and was handled completely in llvm CodeGen.

@craig.topper that would mean that it couldn't be constexpr though, right?

I don't know enough about clang and llvm to know what is both possible and preferable. @craig.topper @lebedev.ri what is the best course of action here?

lebedev.ri added a comment.EditedWed, Mar 25, 9:13 AM

Like i said in previous comments, this can be moved later, that is not a blocker.
Ship it first :)

@lebedev.ri my misunderstanding. I'll commit :)

This revision was not accepted when it landed; it landed in state Needs Review.Wed, Mar 25, 10:16 AM
This revision was automatically updated to reflect the committed changes.