Page MenuHomePhabricator

AMD K10 (Barcelona) Initial Scheduler model
Needs ReviewPublic

Authored by lebedev.ri on Jun 20 2019, 3:22 PM.

Details

Summary

Following in footsteps of not-total-failure of the first schedule profile
produced with the new next-gen tools like llvm-exegesis (BdVer2 (Piledriver), D52779),
let's continue the trend.

Here i present the scheduler model for AMD K10 CPU.
Much like with D52779, this is pretty much exclusively based on llvm-exegesis,
with some cross-referencing with some other sources (Agner, AMD SOG).

The time doesn't stand still, llvm-exegesis saw many improvements since
i've originally started looking into BdVer2 model, so this time around it took
not 6 months but just ~3 weeks :S

This is by no means perfect. But i'm actually going to claim that it is in
a comparable or better shape than the BdVer2 model was at the time of commit.

I didn't do any really heavy benchmarking, but i did use "internal" public real-world benchmark
i understand - RawSpeed raw image decoding library.

AggregateWall time changeTotal CPU time change
min-18.4600%-15.7800%
max6.9000%6.8900%
median-0.9050%-0.8950%
average-0.8623%-0.7703%
sqrt(sum(x^2))0.26240.2442
cbrt(sum(x^3))-0.1775-0.1479



So looks like an improvement overall, with some regressions, not unexpectedly.

WARNING: i did not mess with sched model to make numbers good.

Some further cleanup will be needed for sure, i've put my eye on some inconsistencies in sched model,
but i don't feel like cleaning them up right away via InstRW because that will relieve immediate pressure
and may delay proper cleanup via splitting SchedWrite classes.

I shouldn't promise anything, but i'm hoping to look into AMD K8 afterwards.

Diff Detail

Event Timeline

lebedev.ri created this revision.Jun 20 2019, 3:22 PM

Amazing work Roman ! A few nits on the form, I'll let others more knowledgeable about AMD architectures comment on the model itself.

include/llvm/TableGen/Record.h
1271 ↗(On Diff #205908)

llvm_unreachable ?

lib/Target/X86/X86ScheduleBarcelona.td
17 ↗(On Diff #205908)

* Measurements from llvm-exegesis ?

760 ↗(On Diff #205908)

Does K10 have any zeroing/dependency breaking idioms ? Maybe add a FIXME to implement them.

lebedev.ri marked 2 inline comments as done.Jun 21 2019, 1:28 AM
lebedev.ri added inline comments.
include/llvm/TableGen/Record.h
1271 ↗(On Diff #205908)

I'll split this int a separate patch

lib/Target/X86/X86ScheduleBarcelona.td
17 ↗(On Diff #205908)

Yeah, i should add that :)

760 ↗(On Diff #205908)

Well, it's a good question.
Agner says it does, but unless i'm doing something horribly wrong even the trivial xor %eax, %eax doesn't work.
So i've left it for later.

RKSimon added inline comments.Jun 21 2019, 1:35 AM
lib/Target/X86/X86ScheduleBarcelona.td
17 ↗(On Diff #205908)

In which case please can you add the pfm mappings to X86PfmCounters.td

javed.absar added inline comments.Jun 21 2019, 1:38 AM
lib/Target/X86/X86ScheduleBarcelona.td
185 ↗(On Diff #205908)

You can use 'listA # listB' to concatenate to improve readability

lebedev.ri marked 8 inline comments as done.

Addressed review notes.

(forgot to submit inline comments)

lib/Target/X86/X86ScheduleBarcelona.td
185 ↗(On Diff #205908)

There isn't that much TableGen stuff here, i think keeping it as explicit as possible is best..

760 ↗(On Diff #205908)

I've added a FIXME note for now.

Rather pedantic, but shouldn't the model be called K10 or Fam10 instead of Barcelona, which was just one of the chips in the series?

This comment was removed by lebedev.ri.

Rather pedantic, but shouldn't the model be called K10 or Fam10 instead of Barcelona, which was just one of the chips in the series?

I'm not sure, actually.

This seemed more consistent, because e.g. there are -mcpu=bdver[1-4],
with bd being bulldozer which was the first chip in the 15h series.
Those aren't -mcpu=k15ver[1-4], and not -mcpu={bulldozer,piledriver,steamroller,excavator},
and following the consistency logic that is already inconsistent, and latter would be
even more confusing since bulldozer would then refer both to the 15h series
and just a single model in the series, with all other models being different.

This isn't the case here, there is no -mcpu=k10ver2;
-mcpu=fam10h and -mcpu=barcelona are synonyms in X86.td
(i don't know if latter chips actually *should* be separate)

So i'm not sure, i can totally rename, but TO ME the current scheme follows preexisting pattern.

Hi Roman,

thanks for the patch. I left some comments below.

Overall, the patch looks good.
It is okay to contribute improvements to zero-idioms/dep-breaking instructions in a later patch.

lib/Target/X86/X86ScheduleBarcelona.td
24 ↗(On Diff #206044)

You can safely remove this line.
The default value of LoopMicroOpBufferSize is already -1.

45–48 ↗(On Diff #206044)

I think that your definition of RetireControlUnit is fine.
The RCU is definitely 72 entries. It internally tracks macro-ops contributed to each pipeline. The fact that instructions are distributed among three different pipes after Fetch shouldn't really affect the description of the RCU in LLVM (or require future any changes in future). In practice, you can assume a uniform distribution of decoded ops on the three pipelines. Basically, I don't think that you need a FIXME here.

286 ↗(On Diff #206044)

FIXME?

305 ↗(On Diff #206044)

Maybe clarify what there is to FIX.
Did you verify that NOPs consume an ALU slot?

Out of curiosity, did you investigate on why three benchmarks show a 6% slowdown?

According to Agner, the selection of floating point pipes done by the FPU is sub-optimal and can lead to bottlenecks that are difficult to predict.
There is another important source of bottlenecks to keep into account. It looks like that there is only one result bus per execution unit.
That forces schedulers to artificially delay opcodes to avoid bus conflicts. This is especially true for code that mixes instructions with different latencies.
Agner describes this in section "Mixing instructions with different latency".

We don't model those hazards in the scheduling model.
I wonder if the sequence of instructions computed by the post-RA machine scheduler for those benchmarks incurred in one of those bottlenecks.

Rather pedantic, but shouldn't the model be called K10 or Fam10 instead of Barcelona, which was just one of the chips in the series?

I'm not sure, actually.

This seemed more consistent, because e.g. there are -mcpu=bdver[1-4],
with bd being bulldozer which was the first chip in the 15h series.
Those aren't -mcpu=k15ver[1-4], and not -mcpu={bulldozer,piledriver,steamroller,excavator},
and following the consistency logic that is already inconsistent, and latter would be
even more confusing since bulldozer would then refer both to the 15h series
and just a single model in the series, with all other models being different.

This isn't the case here, there is no -mcpu=k10ver2;
-mcpu=fam10h and -mcpu=barcelona are synonyms in X86.td
(i don't know if latter chips actually *should* be separate)

So i'm not sure, i can totally rename, but TO ME the current scheme follows preexisting pattern.

Out of curiosity:
GCC uses amdfam10 as CPU name for flags -march and -mtune.

I am not suggesting that we should pedantically follow gcc's naming convention.
So far, our cpu names for AMD processors have always matched gcc names. Not sure if that was done intentionally though.
I tend to agree with Simon that k10, or amdfam10(h) is probably a better name. However I don't have a strong opinion about it.

Rather pedantic, but shouldn't the model be called K10 or Fam10 instead of Barcelona, which was just one of the chips in the series?

I'm not sure, actually.

This seemed more consistent, because e.g. there are -mcpu=bdver[1-4],
with bd being bulldozer which was the first chip in the 15h series.
Those aren't -mcpu=k15ver[1-4], and not -mcpu={bulldozer,piledriver,steamroller,excavator},
and following the consistency logic that is already inconsistent, and latter would be
even more confusing since bulldozer would then refer both to the 15h series
and just a single model in the series, with all other models being different.

This isn't the case here, there is no -mcpu=k10ver2;
-mcpu=fam10h and -mcpu=barcelona are synonyms in X86.td
(i don't know if latter chips actually *should* be separate)

So i'm not sure, i can totally rename, but TO ME the current scheme follows preexisting pattern.

Out of curiosity:
GCC uses amdfam10 as CPU name for flags -march and -mtune.

I am not suggesting that we should pedantically follow gcc's naming convention.
So far, our cpu names for AMD processors have always matched gcc names. Not sure if that was done intentionally though.
I tend to agree with Simon that k10, or amdfam10(h) is probably a better name. However I don't have a strong opinion about it.

Please note that this naming dilemma changes exactly nothing from user side.
It does not affect -mcpu=??? choices, it does not affect -march=??? choices,
it is only about internal naming for the directory where the mca tests will be located,
and for the X86ScheduleBarcelona.td filename.

Wanted to add, it is actually *MORE* correct to stick with Barcelona here,
because unlike the bdver*/znver*/..., where it is known that the sched models
are ideally different, but maybe the same is always used due to the lack of standalone model,
i(?) don't know that for this case, i didn't verify that every single chip in K10 microarch
is identical to the first one, for which this model is.

So i would personally say that, while it is obvious that this sched model is
being used for all the K10 chips, it is for barcelona specifically, and that should be noted.

This isn't new - bdver2 model is being used for bdver1 too, even though they are different
and that model is named bdver2, not fam15h, the generic model is named "<some intel cpu model>"
(sorry don't recall), not <generic cpu model>.

Again, can totally rename, but i'm not sure that will be better/more consistent.

I think I've commented on this before - but why did you pre-commit the test/tools/llvm-mca/X86/Barcelona/* files since adding this model completely overwrites them?

I think I've commented on this before

Oh? I don't remember any such feedback.. Good to know i guess.

  • but why did you pre-commit the test/tools/llvm-mca/X86/Barcelona/* files since adding this model completely overwrites them?

Two-fold:

  1. previously clang's -march=barcelonawould result in what's on the LHS of the diff, and it would now result in the RHS, so the diff does show the *change*; and
  2. just easing trying to avoid all the usual pitfalls of git rebase and having to re-come up with proper test file set should things go badly (which IIRC i had to do with piledriver)
lebedev.ri marked an inline comment as done.Jun 24 2019, 3:48 PM

Thanks for taking a look!

Out of curiosity, did you investigate on why three benchmarks show a 6% slowdown?

Actually no, i didn't look at that as of this moment.
It doesn't appear to be noise, but those 4 tests are essentially covering
the same codepath, so the reason will be the same for all of them.

According to Agner, the selection of floating point pipes done by the FPU is sub-optimal and can lead to bottlenecks that are difficult to predict.
...

We don't model those hazards in the scheduling model.
I wonder if the sequence of instructions computed by the post-RA machine scheduler for those benchmarks incurred in one of those bottlenecks.

It's certainly not a FPU cluster problem since all those tests are integer (like i said, i didn't do through benchmark...)
What's the suggestion there, rebenchmarking with let PostRAScheduler = 0; ?

I think I've commented on this before

Oh? I don't remember any such feedback.. Good to know i guess.

  • but why did you pre-commit the test/tools/llvm-mca/X86/Barcelona/* files since adding this model completely overwrites them?

Let me rephraze, should have i been aware of any such feedback i certainly would not have precommitted them,
but kept up with those struggles. Just for the sake of completeness, where did i miss it?

lib/Target/X86/X86ScheduleBarcelona.td
286 ↗(On Diff #206044)

This one is just a leftover.

305 ↗(On Diff #206044)

It certainly does as per Agner's instruction_tables.pdf
There is certainly a hit of a 1 on retired_uops.
There isn't any IssueCounters for integer pipelines,
so there is no direct way to check integer unit consumption i'm afraid :(
That leaves us with trying to manually measure throughput
of some high-throughput instruction that can go to any int pipe,
in the presence of NOP, i suppose (i actually suggested that as way for
exegesis to indirectly measure unit consumption). I did not do that yet.

Thanks for taking a look!

Out of curiosity, did you investigate on why three benchmarks show a 6% slowdown?

Actually no, i didn't look at that as of this moment.
It doesn't appear to be noise, but those 4 tests are essentially covering
the same codepath, so the reason will be the same for all of them.

No problem, it was just a curiosity of mine.

It's certainly not a FPU cluster problem since all those tests are integer (like i said, i didn't do through benchmark...)
What's the suggestion there, rebenchmarking with let PostRAScheduler = 0; ?

It would be an interesting follow-up investigation to do. However, it shouldn't be a blocker for this patch.

lib/Target/X86/X86ScheduleBarcelona.td
305 ↗(On Diff #206044)

Thanks for the info. If Agner says that it consumes 1cy of ALU pipe then it is fine.

Err, i'm getting increasingly confused by the day here :))
It might be best to list what *is* a blocker, not what isn't.

lebedev.ri marked 3 inline comments as done and an inline comment as not done.Jun 29 2019, 2:58 PM
lebedev.ri added inline comments.
lib/Target/X86/X86ScheduleBarcelona.td
24 ↗(On Diff #206044)

Isn't it better to be specific, since we know the answer (no loop buffer), instead of relying on whatever default there is currently?

45–48 ↗(On Diff #206044)

Okay, thank you for a second opinion. I do think it's okay-ish for it's current purposes.
I'll change it to NOTE, since i'd think it would be good to still record such info.

Now that 9 is out of the way, any new comments?

RKSimon added inline comments.Aug 4 2019, 6:41 AM
include/llvm/TableGen/Record.h
1271 ↗(On Diff #205908)

Split this off? Also, the llvm_unreachable message needs improvement to be useful.

lib/Target/X86/X86.td
805 ↗(On Diff #206044)

Can this feature list refactor be split off as well?

lebedev.ri updated this revision to Diff 213535.Aug 6 2019, 1:43 AM

Rebased, split off D65790/D65791.
Thanks for taking a look!

@lebedev.ri Did you ever get anywhere nailing down the benchmark regressions? Do you have any exegesis latency/uop/rtp reports you can share?

@lebedev.ri Did you ever get anywhere nailing down the benchmark regressions?

Uhm, i got nowhere with that because i didn't begin looking since in
previous comments i was asking what is a blocker and what's not to no avail :)
I have an unfounded suspicion that it will be code alignment.

Do you have any exegesis latency/uop/rtp reports you can share?

Sure, will post.