Page MenuHomePhabricator

[CostModel][X86] Derive TTI costs from complete scheduling models (PR36550) (RFC)
Needs ReviewPublic

Authored by RKSimon on Apr 30 2018, 10:54 AM.

Details

Summary

This is an initial RFC prototype patch for PR36550 - how scheduling model data might be used to determine TargetTransformInfo costs and move us away from hard coded tables for vectorization etc. This is implemented purely in the X86 target for now.

I've only included a few examples, targetting AVX1 for the Jaguar/BtVer2 model, which I've flagged as complete for this patch (the only complete model in x86). I'm reasonably confident in the values for this cpu, other X86 models have known accuracy concerns that llvm-mca/llvm-exegesis will hopefully improve and enable those models to be completed in the future. Perf testing for the Jaguar model with this patch is currently ongoing.

There are several things that I'm hoping this patch will help us to decide:

1 - Only complete scheduling models are supported, everything else (incomplete models or CPUs without a model) still uses the default cost (which should match the current cost tables). I'm concerned about introducing performance regressions into existing code before we have a high confidence in individual models - we have already hit something similar with Silvermont PMULLD regressions (PR37059).

2 - The patch embeds likely code sequences into the cost tables; this is going to get very bulkly very quickly, as we need to provide sequences for a number of ISAs, as well as keeping default costs until we have full scheduler model coverage. This probably will require us to split off the costs tables to its own file. We may want to investigate methods to auto-generate this data as well, but that is beyond the scope of this initial patch. Apologies for the formatting of some of the code sequences, I'll get it cleaned up once we decide how to store these.

3 - A new variant of MCSchedModel::getReciprocalThroughput is introduced that takes an array of MIOpcodes; this can't determine dependencies between the instructions so the result is mainly useful for 'resource pressure' cases, but the MIOpcodes data could be used for codesize costs as well. Is this satisfactory or do we need to account for dependencies as well? That would allow us to estimate latency costs, but is likely to complicate both the data and computation considerably - at that point we'd need to call a version of llvm-mca on every sequence.

4 - In the future I'd like us to move the cost from being based on cycles to a CPU's issue width - for example currently a cost of 1 might be returned for i32 and v4i32 ADDs, but we may actually be able to dispatch 2 or more i32 ADD/cycle but only 1 v4i32. This was proving tricky to get right in my initial experiments, as so many of the default costs would need adjusting to account for this. Future work.

5 - X86's fix to avoid SDIV/UDIV vectorization is clumsy (multiplying the vector costs x20) and appears to have been implemented before the cost of scalarization was being realistically accounted for, it can probably be discarded and we just give a higher default cost for scalar SDIV/UDIV/SREM/UREM.

Diff Detail

Repository
rL LLVM

Event Timeline

RKSimon created this revision.Apr 30 2018, 10:54 AM
RKSimon added reviewers: ABataev, spatel.
craig.topper added inline comments.Apr 30 2018, 11:06 AM
lib/Target/X86/X86TargetTransformInfo.cpp
697

Is this going to cause a bunch of SmallVectors to be constructed at startup? I think that goes against our coding standards.

craig.topper added inline comments.Apr 30 2018, 11:09 AM
lib/Target/X86/X86TargetTransformInfo.cpp
697

Or will it only happen the first time this function runs?

RKSimon added inline comments.Apr 30 2018, 11:17 AM
lib/Target/X86/X86TargetTransformInfo.cpp
697

Thanks for reminding me - first time only but using SmallVector was a quick fix to get around ArrayRef not working properly after the first time for some reason that I didn't investigate thoroughly. IIRC moving the tables out of the functions and into global namespace fixed the issue.

fhahn added a comment.May 4 2018, 11:18 AM

Great, thanks for sharing this Simon. I would be very interested in some performance numbers. I might try to get this working for AArch64 in the next week or two, unless someone else wants to give it a try.

2 - The patch embeds likely code sequences into the cost tables; this is going to get very bulkly very quickly, as we need to provide sequences for a number of ISAs, as well as keeping default costs until we have full scheduler model coverage.

Isn't that essentially duplicating information from instruction selection ? How hard would it be to generate the code snippets automatically ?
It could be done at runtime (something like: generate a "template" vectorized Instruction and lower it in a dummy context to get the MIOp sequence), or maybe even at compile time (not saying it's easy but theoretically we have all the information to do that).

2 - The patch embeds likely code sequences into the cost tables; this is going to get very bulkly very quickly, as we need to provide sequences for a number of ISAs, as well as keeping default costs until we have full scheduler model coverage.

Isn't that essentially duplicating information from instruction selection ? How hard would it be to generate the code snippets automatically ?
It could be done at runtime (something like: generate a "template" vectorized Instruction and lower it in a dummy context to get the MIOp sequence), or maybe even at compile time (not saying it's easy but theoretically we have all the information to do that).

It won't be easy to accomplish. 1:1 mappings are straightforward enough in principle, but are likely to require a lot of refactoring. The complex code sequences (shifts, 256-bit vector integers on AVX1 etc.) are not something that we are even close to having in tablegen form for reuse. GlobalIsel might be able to help some day but I can't see it being useful anytime soon.

RKSimon added a subscriber: filcab.May 9 2018, 6:16 AM

@craig.topper IIRC you were looking at ways to pull the X86InstrInfo.cpp folding tables into their own file and then later attempting to autogen the tables again - is that still the plan? These cost tables are likely to take a similar path.

lib/Target/X86/X86TargetTransformInfo.cpp
697

Update: I haven't found a way to avoid having to use SmallVector to create the code sequences, including if I move the tables into the global namespace - all approaches I've tried seem to have constexpr issues with initialization at compile time - mainly as the sequences appear as std::initializer_list - I'm open to suggestions........

I couldn't figure out how to get the folding tables to work without doing what AMDGPU did when they created AMDGPURegAsmNames.inc.cpp. They used cpp file that appears empty when built by itself due to the use of a preprocessor define that isn't defined. Then they include it in another file after defining that define.

efriedma added inline comments.
lib/Target/X86/X86TargetTransformInfo.cpp
697
struct C { const int (&x)[6]; };
extern const C z = { { 1, 2, 3 } };

The hardcoded number of array elements is a bit ugly, but I'm not sure there's any standard alternative.

efriedma added inline comments.May 16 2018, 2:45 PM
lib/Target/X86/X86TargetTransformInfo.cpp
697

Actually, the following should do the right thing:

#include <initializer_list>
struct C { const std::initializer_list<unsigned> &x; };
int f(int i) {
  static const C z = { { 1, 2, 3 } };
  return z.x.begin()[i];
}

gcc versions older than 5.0 apparently have a bug where the initalizer isn't considered a constant, but it still emits the right code.

RKSimon updated this revision to Diff 147322.May 17 2018, 8:07 AM

Thanks @efriedma using "const std::initializer_list<unsigned>&" seems to work

RKSimon updated this revision to Diff 160565.Aug 14 2018, 6:09 AM

Rebased - this patch is still waiting on a fix to MCSubtargetInfo::resolveVariantSchedClass that will return the default class when an MCInst is not available (== NULL) - @andreadb is investigating.

RKSimon updated this revision to Diff 162155.Aug 23 2018, 6:01 AM

Added the ability to resolve variant instructions when we don't have a MI - uses the default value.

Hi Simon,

utils/TableGen/SubtargetEmitter.cpp
1634–1639

You don't need to worry about computing 'ShouldEmitNullCheck'.
The logic in emitPredicates already knows how to identify so-called "default" predicates (i.e. predicates that always evaluate to 'true').

Note that you can only get to line 1501 in emitPredicates if there is at least one non-default predicate in the sequence of checks. Default predicate are then skipped by the loop coming after that line.

So, I think that you can safely remove all this code, as well as extra argument emitNullCheck from emitPredicates(), and simply change the following two lines:

Line 1501:

SS << "if " << "(MI && (";

Line 1530:

SS << "))\n"; // end of if-stmt

If my understanding is correct, this should be enough to achieve your goal.

RKSimon updated this revision to Diff 162177.Aug 23 2018, 7:08 AM

Simplified SubtargetEmitter.cpp based on @andreadb's feedback

andreadb added inline comments.Aug 23 2018, 7:34 AM
lib/Target/X86/X86TargetTransformInfo.cpp
698–716

I wonder if these cost tables could be auto-generates. For example via tablegen.

class CostTableTransition<MVT, list<Instruction>, int DefaultCost = 1> {...};
class CostTableEntry<ISDOpcode, list<CostTableTransition>> {...};

def FMUL : ConstTableEntry<
  ISD::FMUL,
  [
    MVT::v4f64, [ X86::VMULPDYrr ],
    MVT::v8f32, [ X86::VMULPSYrr ],
    ...
  ]>
>;

That being said, I am not convinced that it is a good idea to do it.
It is probably a lot of work for very little gain in return. Also, it is questionable whether tablegen descriptors would be more readable and less verbose than what you have here...

ABataev added inline comments.Aug 23 2018, 7:37 AM
lib/Target/X86/X86TargetTransformInfo.cpp
698–716

+1 here

RKSimon added inline comments.Aug 25 2018, 10:19 AM
lib/Target/X86/X86TargetTransformInfo.cpp
698–716

I agree that its getting very cluttered, so I'm open to ideas on how we improve the clarity of the cost tables in the future. But I'm not certain yet if a tblgen refactor will give us much extra benefit.

I'd prefer to get this patch in first to demonstrate how we want to drive the costs from MI and scheduler models.

RKSimon added inline comments.Aug 26 2018, 5:53 AM
lib/Target/X86/X86TargetTransformInfo.cpp
698–716

We will also need a plan for PR31274 before we consider a move to tblgen or other big changes.