Page MenuHomePhabricator

Model cache size and associativity in TargetTransformInfo
ClosedPublic

Authored by grosser on Aug 23 2017, 12:44 AM.

Details

Summary

We add the precise cache sizes and associativity for the following Intel
architectures:

  • Penry
  • Nehalem
  • Westmere
  • Sandy Bridge
  • Ivy Bridge
  • Haswell
  • Broadwell
  • Skylake
  • Kabylake

Polly uses since several months a performance model for BLAS computations that
derives optimal cache and register tile sizes from cache and latency
information (based on ideas from "Analytical Modeling Is Enough for High-Performance BLIS", by Tze Meng Low published at TOMS 2016).
While bootstrapping this model, these target values have been kept in Polly.
However, as our implementation is now rather mature, it seems time to teach
LLVM itself about cache sizes.

Interestingly, L1 and L2 cache sizes are pretty constant across
micro-architectures, hence a set of architecture specific default values
seems like a good start. They can be expanded to more target specific values,
in case certain newer architectures require different values. For now a set
of Intel architectures are provided.

Just as a little teaser, for a simple gemm kernel this model allows us to
improve performance from 1.2s to 0.27s. For gemm kernels with less optimal
memory layouts even larger speedups can be reported.

Diff Detail

Repository
rL LLVM

Event Timeline

grosser created this revision.Aug 23 2017, 12:44 AM

Hi Florian, hi Sebastian, Eli,

I added you in case you also would like to provide feedback.

Best,
Tobias

gareevroman edited edge metadata.EditedAug 23 2017, 12:58 AM

information (based on ideas from "Analytical Models for the BLIS Framework").

Could you add the reference to the paper to the summary?

I think throughput and latency of vector fma instructions are pretty constant across micro-architectures too. Can we also add them?

Sorry, probably, it’d require to specify it for each architecture.

asb added a subscriber: asb.Aug 23 2017, 1:07 AM
asb added inline comments.
include/llvm/Analysis/TargetTransformInfo.h
607 ↗(On Diff #112298)

It's conceivable that targets might want to differentiate between the I-cache and D-cache. e.g an embedded target may use info on the I-cache to guide custom code layout optimisations.

Is it worth starting with CL_L1D and CL_L2D instead?

include/llvm/Analysis/TargetTransformInfoImpl.h
348 ↗(On Diff #112298)

Either the comment or the calculation is incorrect!

lib/Target/X86/X86TargetTransformInfo.cpp
92 ↗(On Diff #112298)

Either the comment or the calculation is incorrect!

lib/Target/X86/X86TargetTransformInfo.h
49 ↗(On Diff #112298)

Why not implement getCacheAssociativity too?

grosser updated this revision to Diff 112304.Aug 23 2017, 1:28 AM

Incorporate asb's comments.

  • Use 'D' postfix to clarify we talk about the data caches
  • Fix typo in cache size
  • Implement associativity for X86
grosser edited the summary of this revision. (Show Details)Aug 23 2017, 1:31 AM
grosser marked 2 inline comments as done.

Hi Alex, hi Roman,

thanks for your feedback. I addressed your comments.

Best,
Tobias

include/llvm/Analysis/TargetTransformInfo.h
607 ↗(On Diff #112298)

Very good point. I changed this.

include/llvm/Analysis/TargetTransformInfoImpl.h
348 ↗(On Diff #112298)

The calculation was wrong. I fixed this.

lib/Target/X86/X86TargetTransformInfo.cpp
92 ↗(On Diff #112298)

The calculation. Fixed.

lib/Target/X86/X86TargetTransformInfo.h
49 ↗(On Diff #112298)

Added!

grosser updated this revision to Diff 112305.Aug 23 2017, 1:33 AM
grosser marked 2 inline comments as done.

Add a newline between two functions

lsaba added a subscriber: lsaba.Aug 23 2017, 1:35 AM
fhahn added inline comments.Aug 23 2017, 3:38 AM
include/llvm/Analysis/TargetTransformInfoImpl.h
343 ↗(On Diff #112305)

Maybe it would be safer to be conservative and return 0 here, similar to what getCacheLineSize does currently? That allows passes to check if the target provides accurate information ( != 0).

For example, for ARM Cortex cores the level 1 cache size can vary between 0KByte and 64KByte. [1], [2]

Also, we already have getCacheLineSize. Would it make sense to express the cache size in terms of cache lines, (eg X * Cache line size)? That would make it slightly easier to keep them in sync and avoid situations where getCacheSize() is not a multiple of getCacheLineSize()

[1] https://en.wikipedia.org/wiki/Comparison_of_ARMv8-A_cores
[2] https://en.wikipedia.org/wiki/ARM_Cortex-M

asb added inline comments.Aug 23 2017, 3:55 AM
include/llvm/Analysis/TargetTransformInfoImpl.h
343 ↗(On Diff #112305)

I wondered about returning 0 too, but also think we need a new sentinel value. This function should be able to indicate 1) cache is known size X, 2) cache is unknown size, 3) cache is known size 0 (e.g. there is no L2).

mcrosier edited reviewers, added: efriedma; removed: eli.friedman.Aug 23 2017, 7:31 AM
grosser updated this revision to Diff 112393.Aug 23 2017, 9:45 AM

Make return values optional to be able to distinguish between unknown cache
size and no cache.

We use Optional in the very same way to indicate clearly that no associativity
has been specified.

I think throughput and latency of vector fma instructions are pretty constant across micro-architectures too. Can we also add them?

Sorry, probably, it’d require to specify it for each architecture.

We could do this. Some of the backend experts might be able to help you how to do this best. I think we should do this in a separate patch. Care to propose one?

include/llvm/Analysis/TargetTransformInfoImpl.h
343 ↗(On Diff #112305)

I changed this to llvm::Optional.

Thanks for the useful comments. I now use optional. I am not convinced regarding the cache size in multiples of cache lines. This seems to always require conversions in case we want the actual size in bytes. At least for us, this is the more common value, I believe.

What do other's think. Size in bytes or in cache lines? In the end it's the same information. (We could also add size in bytes for now and add an assert that checks that the returned result (if any) is always a multiple of the cache line size. If somebody needs this information in cache lines, we can always add a helper method.

asb added a comment.Aug 23 2017, 1:56 PM

Size in bytes makes most sense to me. The first thing any caller is going to do is convert away from cache lines to bytes or kilobytes anyway.

fhahn edited edge metadata.Aug 23 2017, 2:43 PM
In D37051#850696, @asb wrote:

Size in bytes makes most sense to me. The first thing any caller is going to do is convert away from cache lines to bytes or kilobytes anyway.

Agreed, Returning the number of cache lines as result of getCacheSize wouldn't be a good idea.

I meant using the cache line size in X86TTIImpl::getCacheSize. This was just a minor potential nit, I think either way is fine.

asb added a comment.Aug 23 2017, 2:48 PM

Agreed, Returning the number of cache lines as result of getCacheSize wouldn't be a good idea.

I meant using the cache line size in X86TTIImpl::getCacheSize. This was just a minor potential nit, I think either way is fine.

Ah, sorry for the misunderstanding. I suspect that the cost of potential confusion outweighs the potential benefit of mistakenly setting a cache size that isn't a multiple of the cache line size, but as you say it's a minor thing and either way would work.

OK. Let me know when this is worth an official LGTM.

asb added inline comments.Aug 23 2017, 11:40 PM
include/llvm/Analysis/TargetTransformInfoImpl.h
348 ↗(On Diff #112393)

Shouldn't this have something like:

`
default:
  llvm_unreachable("Unknown TargetTransformInfo::CacheLevel")

I think throughput and latency of vector fma instructions are pretty constant across micro-architectures too. Can we also add them?

Sorry, probably, it’d require to specify it for each architecture.

We could do this. Some of the backend experts might be able to help you how to do this best. I think we should do this in a separate patch. Care to propose one?

Yes, I will propose it. Sometimes it can be difficult to find this information, since it's not always publicly available.

grosser updated this revision to Diff 112510.Aug 24 2017, 12:38 AM

Use C++ class enum

Add llvm_unreachables

grosser marked 2 inline comments as done.Aug 24 2017, 12:40 AM

Addressed Alex comments.

include/llvm/Analysis/TargetTransformInfoImpl.h
348 ↗(On Diff #112393)

The switch above is fully covered, but I can add this as a form of defensive programming. Thinking about this, it might also be useful to used typed enums for the cache level. I will update the patch accordingly.

fhahn added inline comments.Aug 24 2017, 12:47 AM
include/llvm/Analysis/TargetTransformInfoImpl.h
348 ↗(On Diff #112393)

I think using an enum class would be enough. AFAIK, compilers are pretty good at warning about missing cases in switch statements over enum classes. I think there is no need for default, as the compiler would catch missing patterns.

grosser marked an inline comment as done.Aug 24 2017, 12:50 AM
grosser added inline comments.
include/llvm/Analysis/TargetTransformInfoImpl.h
348 ↗(On Diff #112393)

Right. Especially clang got very good at this and I think gcc also supports this meanwhile.

I did not add a default case on purpose, as this would prevent compiler warnings about uncovered cases. I added a unreachable _after_ the swich, just do document this code is never reached. Most compilers, likely will know this already, but older compilers might warn if they cannot prove the return is unreachable. We help them out here as well.

fhahn accepted this revision.Aug 24 2017, 12:57 AM

LGTM, just a minor nit about the C-style fallthrough comment

include/llvm/Analysis/TargetTransformInfoImpl.h
358 ↗(On Diff #112510)

Are you using the C-style comment to tell GCC that this fallthrough case is intentional? LLVM_FALLTHROUGH provides a slightly more portable and direct way of doing that I think

lib/Target/X86/X86TargetTransformInfo.cpp
112 ↗(On Diff #112510)

Same as above

This revision is now accepted and ready to land.Aug 24 2017, 12:57 AM
grosser updated this revision to Diff 112516.Aug 24 2017, 1:43 AM

Use LLVM_FALLTHROUGH

grosser marked 5 inline comments as done.Aug 24 2017, 1:46 AM

@fhahn, I just addressed your comments.

@asb, I think this is now ready. Could you have a final look?

asb accepted this revision.Aug 24 2017, 1:51 AM

Looks good to me.

This revision was automatically updated to reflect the committed changes.