Not sure how further i will take this, but was bored and thought i'd take a stab..
References:
- https://www.realworldtech.com/bulldozer/5/
- https://www.agner.org/optimize/microarchitecture.pdf 19.3 (bdver2), 18.4 (K10), 20.3 (ryzen)
Paths
| Differential D75214
[MCA][WIP] Modelling CPU front-ent: Fetch stage/Instruction Byte Buffer unit/Decoder stage (PR42202) Changes PlannedPublic Authored by lebedev.ri on Feb 26 2020, 2:47 PM.
Details Summary Not sure how further i will take this, but was bored and thought i'd take a stab.. References:
Diff Detail
Unit TestsFailed
Event TimelineComment Actions Hi Roman, I think that we should further discuss this design in an RFC or on the bugzilla. For now, I consider this patch an interesting prototype (which presumably works for bdver2). However, a proper design will have to be more generic, and it would require more details. How many more details are required really depends on how accurate the simulation should be. In my opinion, processor models should be able to describe how decoders work via tablegen.
Depending on how accurate we want to be, we may also need to model some properties of (what AMD calls) the "Instruction Byte Buffer" (IBB). If we decide that we don't want to go to that level of details, we still need to keep into account that processors may implement loop caches. The assumption that microcoded instructions always decode to more than 2 uOPs is a good default assumption. However, it would be nicer if processor models were able to override that quantity. P.s.: if you want to accurately model frontend stalls caused by backpressure, then you need to use your pass in conjunction with the "MicroOpQueueStage" stage. As a side note (not related to this patch). In terms of overall simulation: if we start adding more stages then we should consider at some point whether to increase the number of default iterations. Comment Actions Thanks for taking a look. lebedev.ri retitled this revision from [MCA][WIP] Decoder stage (PR42202) to [MCA][WIP] Modelling CPU front-ent: Fetch stage/Instruction Byte Buffer unit/Decoder stage (PR42202). Comment Actions
I agree this may be useful, but i currently don't believe that to be a blocker here. Comment Actions
It may not be a blocker for your prototype. However, a proper design should allow the definition of a loop buffer.
The idea is to let users decide whether they want to simulate fetches from the loop cache or not. A new pipeline option (example --simulate-loop-buffer; or something similar) could be implemented to enable that simulation. We could have that by default, the absence of that option implies that the normal legacy decoders path is enabled during the entire simulation.
No problem. -Andrea
Revision Contents
Diff 248407 llvm/include/llvm/MCA/CodeEmitter.h
llvm/include/llvm/MCA/HardwareUnits/InstructionBuffer.h
llvm/include/llvm/MCA/InstrBuilder.h
llvm/include/llvm/MCA/Instruction.h
llvm/include/llvm/MCA/Stages/DecodeStage.h
llvm/include/llvm/MCA/Stages/EntryStage.h
llvm/include/llvm/MCA/Stages/FetchStage.h
llvm/lib/MCA/CMakeLists.txt
llvm/lib/MCA/Context.cpp
llvm/lib/MCA/HardwareUnits/InstructionBuffer.cpp
llvm/lib/MCA/InstrBuilder.cpp
llvm/lib/MCA/Pipeline.cpp
llvm/lib/MCA/Stages/DecodeStage.cpp
llvm/lib/MCA/Stages/EntryStage.cpp
llvm/lib/MCA/Stages/FetchStage.cpp
llvm/test/tools/llvm-mca/X86/BdVer2/add-sequence.s
llvm/test/tools/llvm-mca/X86/BdVer2/clear-super-register-1.s
llvm/test/tools/llvm-mca/X86/BdVer2/clear-super-register-2.s
llvm/test/tools/llvm-mca/X86/BdVer2/clear-super-register-3.s
llvm/test/tools/llvm-mca/X86/BdVer2/dependency-breaking-cmp.s
llvm/test/tools/llvm-mca/X86/BdVer2/dependency-breaking-pcmpeq.s
llvm/test/tools/llvm-mca/X86/BdVer2/dependency-breaking-pcmpgt.s
llvm/test/tools/llvm-mca/X86/BdVer2/dependency-breaking-sbb-1.s
llvm/test/tools/llvm-mca/X86/BdVer2/dependency-breaking-sbb-2.s
llvm/test/tools/llvm-mca/X86/BdVer2/dependent-pmuld-paddd.s
llvm/test/tools/llvm-mca/X86/BdVer2/dot-product.s
llvm/test/tools/llvm-mca/X86/BdVer2/hadd-read-after-ld-1.s
llvm/test/tools/llvm-mca/X86/BdVer2/hadd-read-after-ld-2.s
llvm/test/tools/llvm-mca/X86/BdVer2/int-to-fpu-forwarding-1.s
llvm/test/tools/llvm-mca/X86/BdVer2/int-to-fpu-forwarding-2.s
llvm/test/tools/llvm-mca/X86/BdVer2/int-to-fpu-forwarding-3.s
llvm/test/tools/llvm-mca/X86/BdVer2/load-store-alias.s
llvm/test/tools/llvm-mca/X86/BdVer2/load-store-throughput.s
llvm/test/tools/llvm-mca/X86/BdVer2/load-throughput.s
llvm/test/tools/llvm-mca/X86/BdVer2/memcpy-like-test.s
llvm/test/tools/llvm-mca/X86/BdVer2/one-idioms.s
llvm/test/tools/llvm-mca/X86/BdVer2/partial-reg-update-2.s
llvm/test/tools/llvm-mca/X86/BdVer2/partial-reg-update-3.s
llvm/test/tools/llvm-mca/X86/BdVer2/partial-reg-update-4.s
llvm/test/tools/llvm-mca/X86/BdVer2/partial-reg-update-5.s
llvm/test/tools/llvm-mca/X86/BdVer2/partial-reg-update-6.s
llvm/test/tools/llvm-mca/X86/BdVer2/partial-reg-update.s
llvm/test/tools/llvm-mca/X86/BdVer2/pipes-fpu.s
llvm/test/tools/llvm-mca/X86/BdVer2/pr37790.s
llvm/test/tools/llvm-mca/X86/BdVer2/rank.s
llvm/test/tools/llvm-mca/X86/BdVer2/rcu-statistics.s
llvm/test/tools/llvm-mca/X86/BdVer2/read-advance-1.s
llvm/test/tools/llvm-mca/X86/BdVer2/read-advance-2.s
llvm/test/tools/llvm-mca/X86/BdVer2/read-advance-3.s
llvm/test/tools/llvm-mca/X86/BdVer2/reg-move-elimination-1.s
llvm/test/tools/llvm-mca/X86/BdVer2/reg-move-elimination-2.s
llvm/test/tools/llvm-mca/X86/BdVer2/reg-move-elimination-3.s
llvm/test/tools/llvm-mca/X86/BdVer2/reg-move-elimination-4.s
llvm/test/tools/llvm-mca/X86/BdVer2/reg-move-elimination-5.s
llvm/test/tools/llvm-mca/X86/BdVer2/register-files-1.s
llvm/test/tools/llvm-mca/X86/BdVer2/register-files-2.s
llvm/test/tools/llvm-mca/X86/BdVer2/register-files-3.s
llvm/test/tools/llvm-mca/X86/BdVer2/register-files-4.s
llvm/test/tools/llvm-mca/X86/BdVer2/register-files-5.s
llvm/test/tools/llvm-mca/X86/BdVer2/scheduler-queue-usage.s
llvm/test/tools/llvm-mca/X86/BdVer2/simple-test.s
llvm/test/tools/llvm-mca/X86/BdVer2/store-throughput.s
llvm/test/tools/llvm-mca/X86/BdVer2/vbroadcast-operand-latency.s
llvm/test/tools/llvm-mca/X86/BdVer2/vec-logic-read-after-ld-1.s
llvm/test/tools/llvm-mca/X86/BdVer2/vec-logic-read-after-ld-2.s
llvm/test/tools/llvm-mca/X86/BdVer2/xop-super-registers-1.s
llvm/test/tools/llvm-mca/X86/BdVer2/xop-super-registers-2.s
llvm/test/tools/llvm-mca/X86/BdVer2/zero-idioms-avx-256.s
llvm/test/tools/llvm-mca/X86/BdVer2/zero-idioms.s
llvm/test/tools/llvm-mca/X86/bextr-read-after-ld.s
llvm/test/tools/llvm-mca/X86/cpus.s
llvm/test/tools/llvm-mca/X86/read-after-ld-1.s
llvm/test/tools/llvm-mca/X86/scheduler-queue-usage.s
llvm/test/tools/llvm-mca/X86/sqrt-rsqrt-rcp-memop.s
llvm/test/tools/llvm-mca/X86/variable-blend-read-after-ld-1.s
llvm/test/tools/llvm-mca/X86/variable-blend-read-after-ld-2.s
llvm/tools/llvm-mca/llvm-mca.cpp
|