This is an archive of the discontinued LLVM Phabricator instance.

Hardware cache line size builtins
Needs RevisionPublic

Authored by zoecarver on Aug 27 2019, 12:10 PM.

Details

Summary

Creates the __builtin_hardware_destructive_interference_size and __builtin_hardware_constructive_interference_size builtins proposed by @jfb here. These builtins can be used to implement P0154 in libc++ and other standard libraries. My implementation switches on the target triple to get the max cache line size for that target. I am not sure if this is the best way to implement these builtins, but it will ensure that there is not an ABI break.

Event Timeline

zoecarver created this revision.Aug 27 2019, 12:10 PM
Herald added a project: Restricted Project. · View Herald Transcript
zoecarver edited the summary of this revision. (Show Details)Aug 27 2019, 12:17 PM

My implementation switches on the target triple to get the max cache line size for that target. I am not sure if this is the best way to implement these builtins, but it will ensure that there is not an ABI break.

Passing-by remark: i'm not sure it is possible to guarantee that this will be always correct and that no ABI break will happen.
What if some next model of e.g. x86 processor has 128-byte-wide cache line?
You nominally can't bump the value because it will be an ABI break,
but the default is no longer conservatively correct there - it is smaller than needed,
which will effectively cripple all usages where this size is used as an alignment to avoid cache line sharing.

Passing-by remark: i'm not sure it is possible to guarantee that this will be always correct and that no ABI break will happen.

You're right. I should have said, "least-likely to cause an ABI break." And I completely agree that there is no way to gaurentee this is correct at compile time. hardware_*_interference_size certainly has the potential to do more harm than good but, I think that is another discussion.

It may be a good idea only to enable this when -march=native. And _maybe_ remove the constexpr requirement (though it would make this feature significantly less useful it would also make it significantly more accurate).

numbers for cacheline size.

Passing-by remark: i'm not sure it is possible to guarantee that this will be always correct and that no ABI break will happen.

You're right. I should have said, "least-likely to cause an ABI break." And I completely agree that there is no way to gaurentee this is correct at compile time. hardware_*_interference_size certainly has the potential to do more harm than good but, I think that is another discussion.

I don't see why we'd bother to implement this as a builtin, if we're going to implement it like this. A much simpler implementation would be to have libc++ return 64 for constructive and 128 for destructive, across the board. That'd certainly be abi stable, and also correct, at the moment, for architectures people generally care about. (And we should tell people to never use these if they actually care about it.)

BTW, I note that facebook uses 128 bytes for x86, noting in the source:

Microbenchmarks indicate that pairs of cache lines also see destructive
interference under heavy use of atomic operations, as observed for atomic
increment on Sandy Bridge.

We assume a cache line size of 64, so we use a cache line pair size of 128
to avoid destructive interference.

We should probably tell people never to use this, period. That being said, I like your idea of having it be a constant. The only issue would be when, in the next few years, people start shipping CPUs with 256-byte-wide cache lines.

zoecarver added a comment.EditedAug 27 2019, 1:18 PM

BTW, I note that facebook uses 128 bytes for x86

They also use 64 for arm which is interesting (the opposite of this patch).

Also, see this comment in the same snippet:

We assume a cache line size of 64, so we use a cache line pair size of 128

Which would indicate that 64 is the correct number of bytes for x86. But maybe we should double that for hardware_destructive_interference_size.

Edit: Thinking about it more. A cache pair would usually be in the L2 cache (I think). This feature is looking at the L1 cache line size, so why did Facebook implement it in that way?

jfb requested changes to this revision.Sep 3 2019, 3:14 PM

Sorry for the delayed response, I was on vacation. Thanks for tackling it!

I don't think this is the approach I would take. From my dev meeting lightning talk I would instead:

  1. Add to target infrastructure
  2. Overriding in sub-targets using -march or -mcpu
  3. Overriding on the command line
  4. If set in target, expose the builtin
  5. Generic le32 / be32 ARM targets expose constructive / destructive as 64B
  6. Generic le64 / be64 ARM targets expose constructive as 64B and destructive as 128B
  7. Generic x86 expose constructive / destructive as 64B
  8. Honor existing sub-target preferences
  9. Let maintainers of other targets choose appropriate size

I think this needs to be split up into a few patches, at least one per target. libc++ would then only expose the feature if __has_builtin is true. I'm happy to go into more details if what I say above is to vague :)

This revision now requires changes to proceed.Sep 3 2019, 3:14 PM
jfb added a reviewer: __simt__.Dec 5 2019, 9:25 AM
In D66822#1656476, @jfb wrote:
  1. Add to target infrastructure

Refs D74918.