This is an archive of the discontinued LLVM Phabricator instance.

[MCA][WIP] Modelling CPU front-ent: Fetch stage/Instruction Byte Buffer unit/Decoder stage (PR42202)
Changes PlannedPublic

Authored by lebedev.ri on Feb 26 2020, 2:47 PM.

Details

Reviewers
andreadb
RKSimon
Summary

Not sure how further i will take this, but was bored and thought i'd take a stab..

References:

https://bugs.llvm.org/show_bug.cgi?id=42202

Diff Detail

Unit TestsFailed

Event Timeline

lebedev.ri created this revision.Feb 26 2020, 2:47 PM

Hi Roman,

I think that we should further discuss this design in an RFC or on the bugzilla.

For now, I consider this patch an interesting prototype (which presumably works for bdver2). However, a proper design will have to be more generic, and it would require more details. How many more details are required really depends on how accurate the simulation should be.

In my opinion, processor models should be able to describe how decoders work via tablegen.
For example, target should be able to declare:

  • the number of available decoders
  • the features of each decoder
    • The "maximum number of bytes" that a decoder can peek from a byte window during a cycle).
    • How many uOp can be generated in a cycle; etc.

Depending on how accurate we want to be, we may also need to model some properties of (what AMD calls) the "Instruction Byte Buffer" (IBB).
An accurate simulation requires that the decoder stage keeps track of which instruction byte window is active during a cycle, and which byte offset should be used by decoders (that is the offset from the last successful decoded instruction). Without that knowledge we lose some accuracy (i.e. we don't accurately model the throughput from the decoders).

If we decide that we don't want to go to that level of details, we still need to keep into account that processors may implement loop caches.
MCA should allow users to specify whether they want to simulate fetches from the instruction cache or from a hardware loop buffer (if available at decoding stage). The latter would provide a different throughput, and it would also be subject to different limitations than the decoders. I understand that this may not be useful for bdver2 (or btver2 FWIW). However, it would be useful for pretty much all modern Intel processors, and Zen.

The assumption that microcoded instructions always decode to more than 2 uOPs is a good default assumption. However, it would be nicer if processor models were able to override that quantity.

P.s.: if you want to accurately model frontend stalls caused by backpressure, then you need to use your pass in conjunction with the "MicroOpQueueStage" stage.

As a side note (not related to this patch). In terms of overall simulation: if we start adding more stages then we should consider at some point whether to increase the number of default iterations.

Thanks for taking a look.
Indeed, this is nowhere near review/integration ready,
i was hoping the patch's laid-back description
and the sheer amount of TODO/FIXME comments in the code made that obvious :)

lebedev.ri planned changes to this revision.Feb 27 2020, 9:29 AM
lebedev.ri updated this revision to Diff 248407.Mar 5 2020, 1:18 AM
lebedev.ri retitled this revision from [MCA][WIP] Decoder stage (PR42202) to [MCA][WIP] Modelling CPU front-ent: Fetch stage/Instruction Byte Buffer unit/Decoder stage (PR42202).
lebedev.ri planned changes to this revision.Mar 5 2020, 1:18 AM

we still need to keep into account that processors may implement loop caches.

I agree this may be useful, but i currently don't believe that to be a blocker here.
We currently don't model that, and since we don't model loops at all,
it would be a whole new user-activatable mode.
I'm not sure it should be implemented in this very patch.

andreadb added a comment.EditedMar 8 2020, 5:10 AM

we still need to keep into account that processors may implement loop caches.

I agree this may be useful, but i currently don't believe that to be a blocker here.

It may not be a blocker for your prototype. However, a proper design should allow the definition of a loop buffer.

We currently don't model that, and since we don't model loops at all,
it would be a whole new user-activatable mode.

The idea is to let users decide whether they want to simulate fetches from the loop cache or not. A new pipeline option (example --simulate-loop-buffer; or something similar) could be implemented to enable that simulation. We could have that by default, the absence of that option implies that the normal legacy decoders path is enabled during the entire simulation.
So, I am not sure I understand what you mean by "we don't model loops at all".

I'm not sure it should be implemented in this very patch.

No problem.
Personally, I still consider this patch a "something not for review". I still want to see a proper RFC for this where to discuss requirements, etc.

-Andrea