This is an archive of the discontinued LLVM Phabricator instance.

[X86][MC][Target] Initial backend support a tune CPU to support -mtune
ClosedPublic

Authored by craig.topper on Aug 3 2020, 2:56 PM.

Details

Summary

This patch implements initial backend support for a -mtune CPU controlled by a "tune-cpu" function attribute. If the attribute is not present X86 will use the resolved CPU from target-cpu attribute or command line.

This patch adds MC layer support a tune CPU. Each CPU now has two sets of features stored in their GenSubtargetInfo.inc tables . These features lists are passed separately to the Processor and ProcessorModel classes in tablegen. The tune list defaults to an empty list to avoid changes to non-X86. This annoyingly increases the size of static tables on all target as we now store 24 more bytes per CPU. I haven't quantified the overall impact, but I can if we're concerned.

In order to minimize code changes to non-X86 targets and out of tree targets in this initial patch. I've added a bit to the TableGen Target class to tell tablegen whether the targets supports a tune CPU. This is used to add TuneCPU as an argument to some constructors in generated files. If the target does not support a tune CPU the regular CPU is passed in place of TuneCPU in some of the lower layers.

One new test is added to X86 to show a few tuning features with mismatched tune-cpu and target-cpu/target-feature attributes to demonstrate independent control.

I have not added a -mtune to llc/opt or MC layer command line yet. With no attributes we'll just use the -mcpu for both. MC layer tools will always follow the normal CPU for tuning.

I'm happy to break this into smaller pieces if that's helpful. Maybe separating MC plumbing from X86?

Diff Detail

Event Timeline

craig.topper created this revision.Aug 3 2020, 2:56 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 3 2020, 2:56 PM
Herald added a subscriber: hiraditya. · View Herald Transcript
craig.topper requested review of this revision.Aug 3 2020, 2:56 PM

Could you add some test coverage for different scheduling models?

Do the x86 scheduling models do something sane with "impossible" instructions, like if you use AVX512 instructions with Atom tuning?

In order to minimize code changes to non-X86 targets and out of tree targets in this initial patch. I've added a bit to the TableGen Target class to tell tablegen whether the targets supports a tune CPU.

How much work is it to "support" tune-cpu for a target without actually adding any tuning features? If it's just a few function signatures in the subtarget, I'd prefer to just add it to all targets at once, rather than add the HasTuneCPU flag.

Remove hasTuneCPU from tablegen. Update interfaces in all targets instead.

@andreadb @RKSimon or @efriedma do any of you have suggestions for simple scheduler tests for this? I was hoping I could use -print-schedule like we used to but that no longer exists.

@andreadb @RKSimon or @efriedma do any of you have suggestions for simple scheduler tests for this? I was hoping I could use -print-schedule like we used to but that no longer exists.

I guess you could use -misched-print-dags? Not particularly nice to FileCheck, but should have the data you're looking for.

arsenm added inline comments.Aug 4 2020, 1:37 PM
llvm/include/llvm/Target/Target.td
1515

This should elaborate more on what tune cpu is

1579

Ditto

craig.topper updated this revision to Diff 283036.EditedAug 4 2020, 3:15 PM

Remove the hasTuneCPU from Target.td. That was supposed to be removed in the previous patch

Add a description of TuneFeatures in Target.td

Add cpu-tune-schedule.ll test using -misched-print-dags

Do the x86 scheduling models do something sane with "impossible" instructions, like if you use AVX512 instructions with Atom tuning?

AFAICT it should still use the model from the base cpu, and just apply the fast/slow feature flags from the tune cpu - so it shouldn't permit any mismatches - is that right?

@andreadb @RKSimon or @efriedma do any of you have suggestions for simple scheduler tests for this? I was hoping I could use -print-schedule like we used to but that no longer exists.

Since I've been working on HADD/SUB recently :-) Using -mtune=btver2 on a 'slow' base cpu should fold single shuffle HADD patterns. Tests based on other simple feature bits like SlowSHLD should be simple as well. More complex cases like LEA/NOPs etc. might work but could be trickier.

Do the x86 scheduling models do something sane with "impossible" instructions, like if you use AVX512 instructions with Atom tuning?

AFAICT it should still use the model from the base cpu, and just apply the fast/slow feature flags from the tune cpu - so it shouldn't permit any mismatches - is that right?

This patch changes the scheduler model selection to the tune CPU. See the change to MCSubtargetInfo::InitMCProcessorInfo

@andreadb @RKSimon or @efriedma do any of you have suggestions for simple scheduler tests for this? I was hoping I could use -print-schedule like we used to but that no longer exists.

Since I've been working on HADD/SUB recently :-) Using -mtune=btver2 on a 'slow' base cpu should fold single shuffle HADD patterns. Tests based on other simple feature bits like SlowSHLD should be simple as well. More complex cases like LEA/NOPs etc. might work but could be trickier.

Those are all based on feature flags right? I was looking for some easy test to prove the scheduler is following the tune CPU.

@andreadb How well could we test this using the llvm-mca resource schedule tests?

# RUN: llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -instruction-tables < %s | FileCheck %s
# RUN: llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mtune=btver2 -instruction-tables < %s | FileCheck %s

@andreadb How well could we test this using the llvm-mca resource schedule tests?

# RUN: llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=btver2 -instruction-tables < %s | FileCheck %s
# RUN: llvm-mca -mtriple=x86_64-unknown-unknown -mcpu=x86-64 -mtune=btver2 -instruction-tables < %s | FileCheck %s

I omitted -mtune command line support from this patch. The MC layer tools will always use -mcpu for the tune cpu.

andreadb added a comment.EditedAug 12 2020, 11:12 AM

@andreadb @RKSimon or @efriedma do any of you have suggestions for simple scheduler tests for this? I was hoping I could use -print-schedule like we used to but that no longer exists.

I remember the design of the -print-schedule functionality was a bit problematic because it had a layering violation (see PR37160).
The issue was introduced when support for printing scheduling info for inline assembly was added. The first version of print-schedule didn't have that problem though.

Not sure if it can help but, if your goal is to obtain latency and throughput information for every instruction, then you can piple the output of llc in input to llvm-mca.

You can use MCA markers around the regions of code that you want to have analyzed by mca.

Example:

define void @vzeroupper(<4 x i64>* %x, <4 x i64>* %y) #0 {
  call void asm sideeffect "# LLVM-MCA-BEGIN vzeroupper","~{dirflag},~{fpsr},~{flags}"()
  %a = load <4 x i64>, <4 x i64>* %x
  %b = load <4 x i64>, <4 x i64>* %y
  %c = mul <4 x i64> %a, %b
  store <4 x i64> %c, <4 x i64>* %x
  call void asm sideeffect "# LLVM-MCA-END", "~{dirflag},~{fpsr},~{flags}"()
  ret void
}

If you now run the following command:

llc < my-vzeroupper-test.ll | llvm-mca -mcpu=skx -all-views=false -instruction-info

The you should see something like this:

[0] Code Region - vzeroupper



Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      7     0.50    *                   vmovdqa       (%rdi), %ymm0
 4      22    1.50    *                   vpmullq       (%rsi), %ymm0, %ymm0
 2      1     1.00           *            vmovdqa       %ymm0, (%rdi)

Note however that mca doesn't allow you to specify different cpus for different code blocks. If you want to do something like that, then unfortunately you need to split your test into multiple files. You can still keep tests into a single file at the cost of having multiple run lines though.

That being said, you should be able to then use FileCheck and check latency/throughput values.

llvm/lib/MC/MCSubtargetInfo.cpp
183–188

Maybe it has already been asked before (apologies in case), but what if these features are not really compatible with CPU?
What if let say we have a crazy combination such as: -mcpu=btver2 -mtune=skx.
Not that I expect people to write that sequence of options :-).

llvm/lib/Target/X86/X86Subtarget.cpp
235–236

Out of curiosity. Is there a reason why TuneCPU defaults to "generic" and not to the CPU strings (from line 233)?

@andreadb @RKSimon or @efriedma do any of you have suggestions for simple scheduler tests for this? I was hoping I could use -print-schedule like we used to but that no longer exists.

I remember the design of the -print-schedule functionality was a bit problematic because it had a layering violation (see PR37160).
The issue was introduced when support for printing scheduling info for inline assembly was added. The first version of print-schedule didn't have that problem though.

Not sure if it can help but, if your goal is to obtain latency and throughput information for every instruction, then you can piple the output of llc in input to llvm-mca.

You can use MCA markers around the regions of code that you want to have analyzed by mca.

Example:

define void @vzeroupper(<4 x i64>* %x, <4 x i64>* %y) #0 {
  call void asm sideeffect "# LLVM-MCA-BEGIN vzeroupper","~{dirflag},~{fpsr},~{flags}"()
  %a = load <4 x i64>, <4 x i64>* %x
  %b = load <4 x i64>, <4 x i64>* %y
  %c = mul <4 x i64> %a, %b
  store <4 x i64> %c, <4 x i64>* %x
  call void asm sideeffect "# LLVM-MCA-END", "~{dirflag},~{fpsr},~{flags}"()
  ret void
}

If you now run the following command:

llc < my-vzeroupper-test.ll | llvm-mca -mcpu=skx -all-views=false -instruction-info

The you should see something like this:

[0] Code Region - vzeroupper



Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      7     0.50    *                   vmovdqa       (%rdi), %ymm0
 4      22    1.50    *                   vpmullq       (%rsi), %ymm0, %ymm0
 2      1     1.00           *            vmovdqa       %ymm0, (%rdi)

Note however that mca doesn't allow you to specify different cpus for different code blocks. If you want to do something like that, then unfortunately you need to split your test into multiple files...

That being said, you should be able to then use FileCheck and check latency/throughput values.

What I'm trying to prove is that the scheduler used for pre/post-ra scheduling is using the scheduler model for the CPU name specified in the tune-cpu attribute. So invoking a separate tool with a command line doesn't help that goal.

llvm/lib/MC/MCSubtargetInfo.cpp
183–188

We're only taking feature bits like "slowUAMem16" from the tune cpu. Hopefully those bits aren't implemented in ways that are incompatible with flags for instruction legality.

llvm/lib/Target/X86/X86Subtarget.cpp
235–236

Probably not a good reason as the moment. But in the future tune=generic is going to be the tuning feature flags from the x86-64 or pentium4 cpu. So for llc purposes I'm likely going to have to pick something like "i386" as the tune CPU when the string is empty to avoid changing tests.

The -misched-print-dags approach seems to be doable, and not too unpretty.

If you want to try something else, I believe the FixupLEAPass is scheduler driven so you might be able to get some codegen diffs via that?

Cheers. Now I have a better understanding of why you wanted to check latencies.

If only MCSchedModel had an extra (debug only) field for the name, then it would have been easy to print it out form the machine-scheduler (either as part of -misched-print-dags, or from a new debug-only option).

I'm not sure if it was clear or not. I did implement the -misched-print-dag approached in test/CodeGen/X86/tune-cpu-schedule.ll using the latency of PMULLD which I knew varies quite a bit on different CPUs. Happy to add more test cases if we want.

@efriedma did you have any more comments?

This revision is now accepted and ready to land.Aug 14 2020, 2:45 PM