This is an archive of the discontinued LLVM Phabricator instance.

Adding all X86 Processor families which can help initializing several uArch properties
ClosedPublic

Authored by magabari on Jul 13 2017, 4:08 AM.

Details

Summary

Adding all x86 Processor families to initialize several uArch properties (based on the family)
this shows how gather cost can be initialized based on the proc. family

Diff Detail

Event Timeline

magabari created this revision.Jul 13 2017, 4:08 AM
magabari updated this revision to Diff 106400.Jul 13 2017, 4:21 AM
RKSimon edited edge metadata.EditedJul 13 2017, 5:26 AM

I really don't like this approach - we've been getting closer and closer to killing off X86ProcFamilyEnum for years now and move to a mixture of feature bits and scheduler cost driven decisions that works in a more general manner.

Assuming you intend to use getGatherOverhead() inside X86TTIImpl::getGatherScatterOpCost, might not it be better to just provide a Fast/Slow Gather feature bit, or (even better) finally start trying to use the scheduler model data within X86TTIImpl ?

delena edited edge metadata.Jul 13 2017, 7:03 AM

I agree with you. We also considered using scheduling model (or any other target micro-architecture data base) from TTI. It requires detailed design and enters to our long term plans.
We already started feeding micro-architecture details to LLVM, step-by-step, and it will take a time to complete.
The cost, that is kept in X86TargetTransformInfo.cpp should also be moved to .td file and it will require a special interface for communication between IR and MachineInstructions.
Using processorFamily as a property allows tuning compiler for a processor meanwhile, otherwise we just tie all patches like "gather-enabling" in one chain, waiting until all .td files are entered.

filcab added a subscriber: filcab.Jul 13 2017, 8:05 AM

At the very least: This is adding a ton of unused and unneeded details. If we end up needing to dispatch on processor family, then maybe we can add (some of) these.
Otherwise, I don't think it's reasonable to add all these enum values.

lib/Target/X86/X86Subtarget.cpp
35

GatherOverhead is unsigned. Why set it to INT_MAX?

36

This could just be encoded in a bool ShouldUseGather/ShouldAvoidGather or similar.
What does having a non-boolean get us now*? Why 2?

I'm ok with getting a bool now and changing it to int/unsigned/whatever in the future, but only if we need to.

lib/Target/X86/X86Subtarget.h
19

Stray change.

21

What's a "suitable property", btw?

delena added inline comments.Jul 13 2017, 9:27 AM
lib/Target/X86/X86Subtarget.cpp
36

Gather is available since Haswell (AVX2 set). So technically, we can generate Gathers on all AVX2 processors.
But the overhead on HSW is high, Mohamed put int_max, it will be replaced later with another number. Skylake Client processor has faster Gathers than HSW and performance is similar to Skylake Server (AVX-512).
The specified overhead is relative to the Load operation. "2" is the number provided by Intel architects, we are already using it to calculate GS cost.

I want to say that "GSOverhead = 2" we have today inside X86TTI. The problem, that TTI can't distinguish between HSW and SKL. They both have Gather, but the cost is different.

I assume that TTI should provide a cost for Gather, for *all* AVX2 and AVX-512 processors, but again, it is not one value for all. Vectorizer compares this cost with scalar and interleave variants and chooses the best solution.

craig.topper edited edge metadata.Jul 13 2017, 9:49 AM

If we were to go the route of adding a feature bit for every CPU. I wonder if we should just convert the CPU name into an enum in the target independent subtarget classes. Going through a feature bit to create an enum is just making the number of feature bits larger. Feature bits are stored in a std::bitset that last I checked was 160 bits and is stored in some of the tablegen generated data structures. Due to limitations in various place this max size of 160 is determined by the target with the most feature bits. I believe it got as high as it is because ARM or AArch64 has added a feature bit per CPU similar to this patch.

Another option may be to allow an enum value to be passed as a separate input to ProcModel that would get stored in a separate field in the target independent subtarget. This would also avoid the feature bit pressure.

I still don't get why this can't be achieved with a single 'FeatureFastGather' feature bit, just used by Skylake CPU target - then X86TTIImpl::getGatherScatterOpCost can just call ST->hasFastGather() to decide whether to use gathers or not.

Or if you want to keep the overhead as a cost multiplier you can do something like:

unsigned getGatherOverhead() const { return HasFastGather ? 2 : UINT_MAX; }

These would be much easier to remove when you've had time to sort out the scheduler models.

I still don't get why this can't be achieved with a single 'FeatureFastGather' feature bit, just used by Skylake CPU target - then X86TTIImpl::getGatherScatterOpCost can just call ST->hasFastGather() to decide whether to use gathers or not.

Or if you want to keep the overhead as a cost multiplier you can do something like:

unsigned getGatherOverhead() const { return HasFastGather ? 2 : UINT_MAX; }

These would be much easier to remove when you've had time to sort out the scheduler models.

In order to set different numbers to different mach. Each processor that supports Gather should be able to provide a number, not only /NoGather/SlowGather/FastGather.

In order to set different numbers to different mach. Each processor that supports Gather should be able to provide a number, not only /NoGather/SlowGather/FastGather.

So how many numbers are we talking about? Is it just Skylake, SKX and Cannonlake? (KNL?) I guess we don't want to handle weaker AVX2 CPUs so we don't need a feature bit for those (default getGatherOverhead() to UINT_MAX)

Whatever happens, almost all those new processor enums are not required.

igorb added a subscriber: igorb.Jul 13 2017, 12:51 PM
magabari updated this revision to Diff 107280.Jul 19 2017, 4:19 AM
magabari marked 2 inline comments as done.

removing unnecessary processor
fixing uint issue

lib/Target/X86/X86Subtarget.h
21

it depends in usage of the isAtom\isSLM. for example in X86TargetTransformInfo.cpp::getMaxInterleaveFactor the property will be getMaxInterleaveFactor

Before you continue with this, is there any chance that you can create a phab for the follow on patches that are dependent on this please? To get a better idea of what you are needing this for.

lib/Target/X86/X86Subtarget.cpp
31

You got to all that trouble of adding the other intel cpus and then just need IntelSkylake and IntelSKX?

delena added inline comments.Jul 19 2017, 6:34 AM
lib/Target/X86/X86Subtarget.cpp
32

Mohamed, you have to add ScatterOverhead for SKX.

magabari added inline comments.Jul 22 2017, 11:40 PM
lib/Target/X86/X86Subtarget.cpp
31

This is only one property. even HSW has Gather and later will add it's overhead (currently we dont recommend generating gathers for HSW).

There is other properties that will need to return different values based on the family. so my patch just getting the infrastructure ready for this.

I will upload my patch soon, it uses the "getGatherOverhead" property from the subtarget.

magabari updated this revision to Diff 107821.Jul 23 2017, 4:39 AM

adding ScatterOverhead and updating the max overhead.
max overhead can't be MAX_INT or MAX_UINT because it will overflow on the CM calculation and causing a wrong decision.

delena added inline comments.Jul 23 2017, 4:44 AM
lib/Target/X86/X86Subtarget.h
19

You can use Others instead of Generic.

magabari updated this revision to Diff 107853.Jul 24 2017, 12:38 AM
magabari marked an inline comment as done.

replacing Generic with Others

craig.topper added inline comments.Jul 24 2017, 10:34 PM
lib/Target/X86/X86Subtarget.cpp
29

Why can't this just be done by comparing the CPU string? What did we get by creating an enum that is named the same as the CPU string?

delena added inline comments.Jul 25 2017, 10:42 AM
lib/Target/X86/X86.td
22–1

You don't need Others here, I think.

23

Why do you need this change?

magabari marked 2 inline comments as done.Jul 27 2017, 12:54 AM
magabari added inline comments.
lib/Target/X86/X86Subtarget.cpp
29

X86ProcFamily enum was already exist and we have just added more values. In general I think it's better to parse the CPU strings in one place and after that to start using the enum values instead of comparing the strings all the time.

magabari updated this revision to Diff 108429.Jul 27 2017, 12:55 AM
RKSimon added inline comments.Aug 14 2017, 3:56 AM
lib/Target/X86/X86Subtarget.h
18–1

Do this in X86Subtarget::initializeEnvironment() ?

21

Do this in X86Subtarget::initializeEnvironment() ?

magabari updated this revision to Diff 111943.Aug 21 2017, 3:49 AM
magabari marked 2 inline comments as done.

Simon, could you please take a look on the changes and see if its okay now?

delena added inline comments.Aug 27 2017, 5:11 AM
lib/Target/X86/X86Subtarget.cpp
287

Gather is available since Haswell (AVX2 set). So technically, we can generate Gathers on all AVX2 processors.
But the overhead on HSW is high. Skylake Client processor has faster Gathers than HSW and performance is similar to Skylake Server (AVX-512).
The specified overhead is relative to the Load operation. "2" is the number provided by Intel architects, we are already using it to calculate GS cost.
if (X86ProcFamily == IntelSkylake || hasAVX512)

GatherOverhead = 2;

if (hasAVX512) // SKX and KNL fail here

ScatterOverhead = 2;
290

Please remove.

306

Please remove the "else", the values are already initialized.

magabari updated this revision to Diff 113362.Aug 31 2017, 1:03 AM
magabari marked 3 inline comments as done.

fixing Elena comments

This comment was removed by magabari.
delena accepted this revision.Aug 31 2017, 1:31 AM

LGTM + a minor comment fix

lib/Target/X86/X86Subtarget.cpp
291

*we are already using it to calculate GS cost ->
This parameter is used for cost estimation of Gather Op and comparison with other alternatives.

This revision is now accepted and ready to land.Aug 31 2017, 1:31 AM
magabari updated this revision to Diff 113370.Aug 31 2017, 1:45 AM

fixed comment

RKSimon accepted this revision.Sep 5 2017, 2:44 AM

Simon, could you please take a look on the changes and see if its okay now?

LGTM - I still think this patch only needs to add IntelSkylake but it's a lot better than it was.

This revision was automatically updated to reflect the committed changes.