This is an initial RFC prototype patch for PR36550 - how scheduling model data might be used to determine TargetTransformInfo costs and move us away from hard coded tables for vectorization etc. This is implemented purely in the X86 target for now.
I've only included a few examples, targetting AVX1 for the Jaguar/BtVer2 model, which I've flagged as complete for this patch (the only complete model in x86). I'm reasonably confident in the values for this cpu, other X86 models have known accuracy concerns that llvm-mca/llvm-exegesis will hopefully improve and enable those models to be completed in the future. Perf testing for the Jaguar model with this patch is currently ongoing.
There are several things that I'm hoping this patch will help us to decide:
1 - Only complete scheduling models are supported, everything else (incomplete models or CPUs without a model) still uses the default cost (which should match the current cost tables). I'm concerned about introducing performance regressions into existing code before we have a high confidence in individual models - we have already hit something similar with Silvermont PMULLD regressions (PR37059).
2 - The patch embeds likely code sequences into the cost tables; this is going to get very bulkly very quickly, as we need to provide sequences for a number of ISAs, as well as keeping default costs until we have full scheduler model coverage. This probably will require us to split off the costs tables to its own file. We may want to investigate methods to auto-generate this data as well, but that is beyond the scope of this initial patch. Apologies for the formatting of some of the code sequences, I'll get it cleaned up once we decide how to store these.
3 - A new variant of MCSchedModel::getReciprocalThroughput is introduced that takes an array of MIOpcodes; this can't determine dependencies between the instructions so the result is mainly useful for 'resource pressure' cases, but the MIOpcodes data could be used for codesize costs as well. Is this satisfactory or do we need to account for dependencies as well? That would allow us to estimate latency costs, but is likely to complicate both the data and computation considerably - at that point we'd need to call a version of llvm-mca on every sequence.
4 - In the future I'd like us to move the cost from being based on cycles to a CPU's issue width - for example currently a cost of 1 might be returned for i32 and v4i32 ADDs, but we may actually be able to dispatch 2 or more i32 ADD/cycle but only 1 v4i32. This was proving tricky to get right in my initial experiments, as so many of the default costs would need adjusting to account for this. Future work.
5 - X86's fix to avoid SDIV/UDIV vectorization is clumsy (multiplying the vector costs x20) and appears to have been implemented before the cost of scalarization was being realistically accounted for, it can probably be discarded and we just give a higher default cost for scalar SDIV/UDIV/SREM/UREM.
Is this going to cause a bunch of SmallVectors to be constructed at startup? I think that goes against our coding standards.