This is an archive of the discontinued LLVM Phabricator instance.

[MCA] Use LSU for the in-order pipeline
ClosedPublic

Authored by asavonic on Jun 9 2021, 4:52 AM.

Details

Summary

Load/Store unit is used to enforce order of loads and stores if they
alias (controlled by --noalias=false option).

This model is not very accurate though - Cortex-A55 hardware still
shows quite different results in comparison with MCA.

See PR50483 - [MCA] In-order pipeline doesn't track memory load/store dependencies.

Diff Detail

Event Timeline

asavonic created this revision.Jun 9 2021, 4:52 AM
asavonic requested review of this revision.Jun 9 2021, 4:52 AM
Herald added a project: Restricted Project. · View Herald TranscriptJun 9 2021, 4:52 AM
asavonic edited the summary of this revision. (Show Details)Jun 9 2021, 4:53 AM

Thanks Andrew.

llvm/lib/MCA/Stages/InOrderIssueStage.cpp
53–54

This check is harmless, but completely redundant for an in-order processor.
It should never fail in practice because load/store queues are not really modelled by in-order processors.
In an in-order processors, the dispatch event coincides with the issue event, so there is no need for queueing loads/stores.
So, we should always ignore the presence of queues and I think you can safely get rid of that check.

167–171

I don't think that this should be an else-if.
It is better to always test this condition at the end of the if-then-else chain if StallCycles is still zero.

I suspect this is the reason why you get some out-of-order execution in the test that you have added.

llvm/test/tools/llvm-mca/AArch64/Cortex/A55-load-store-noalias.s
97–98

This doesn't look correct.
Any idea why these instructions are executed out of order?
Edit: I think it might be due to the LSU check you have added (see my other comment).

asavonic added inline comments.Jun 9 2021, 6:59 AM
llvm/test/tools/llvm-mca/AArch64/Cortex/A55-load-store-noalias.s
97–98

This is because the store instruction has no writes (Instruction::getDefs returns an empty list), so findLastWriteBackCycle returns 0.

This model is not very accurate though - Cortex-A55 hardware still
shows quite different results in comparison with MCA.

Accurate static simulation of memory operations is very difficult to achieve in practice.
Predicting whether a load would effectively alias another store, or even predicting whether a memory access would
hit a specific cache level is hard to do ahead of time. Sometimes, it is just not possible due to the lack of
information (which can only be obtained at runtime). So, the inability to accurately predict the latency and aliasing of
memory accesses will always be a big source of inaccuracy in general.

In the case of llvm-mca, there are several limiting factors. Most of those fall under the following two categories:

  1. The llvm scheduling model simply doesn't provide enough information for llvm-mca to simulate the memory subsystem.
  2. MCInst is a (too) flat/simple representation, and it doesn't provide enough information about memory operations.

About 1.
There is no knowledge about which caches are available in hardware (i.e. memory cache hierarchy, store buffers, TLB caches, etc.).
Since there is no cache (at least, from the llvm-mca point of view), there is only one possible "latency" value for every write.
For loads, most models tend to encode an optimistic "load-to-use latency" in the write latency itself.
There is no way to use different latency values if the value is believed to miss the L1. Most of the times, the
"optimistic load-to-use latency" assumes a HIT in the L1.

We could introduce special annotations (like metadata, or llvm-mca comments) to describe the
"probability of hitting a different cache level". We could then use that knowledge in conjunction with a more accurate tablegen
description of the memory hierarchy.
This is just an idea: it might improve the simulation, at the cost of adding more complex abstractions. There may be already a PR for this.

More in general: lvm-mca doesn't know about memory types. It assumes that all memory is cacheable. The LSU rules work quite well for WB
(and even write through) memory. Non cacheable memory would be subject to different latencies, and stores might be
subject to so-called "write combining". For simplicity, llvm-mca assumes that all stores are cacheable, so there is no
attempt at modelling the WC logic in HW.

For in-order processors, not being able to model store buffers may still be fine.
After all (at least in theory) there is no reason why stores should be delayed. I expect stores to be immediately committed.
It also means that we don't need to worry about modelling things like STLF (store-to-load forwarding).
The lack of STLF prediction is one of the bigger sources of inaccuracy when simulating memory intensive kernels
on OoO processors.

About 2.
One big difference between MCInst and MachineInstr, is that MCInst doesn't carry any information about memory accesses.
MCInst was designed as a simpler intermediate for integrated assemblers and disassemblers. It was not meant to be used
to implement complex data-flow analysis. So its structure is pretty flat by design.

For MachineInstr, we have that MachineMemOperand instances can be used to infer aliasing properties on loads/stores etc.
We don't have those operands for MCInst, so - even wanting - we cannot implement a greedy symbolic alias analysis to infer
which loads may-alias which stores.

Depending on the value of flag --noalias, we either always assume "may-alias" or "no-alias".
The default (i.e. --noalias=true) is what is optimistically used by llvm-mca. It may be also the main reason why you see a lot
of errors in your measurements. Although, keep in mind that this just one of the (many) sources of
inaccuracy (as already described before in my point 1.).
Let say that --noalias is a good "default" for things like memcpy-like patterns.

Hi Andrew,
are there any updates on this code review?

Thanks,
-Andrea

Hi Andrew,
are there any updates on this code review?

Sorry for the delay. I plan to update the patch later this week.

Hi Andrew,
are there any updates on this code review?

Sorry for the delay. I plan to update the patch later this week.

No problem, take your time :-)

dmgreen added a subscriber: NickGuy.Jul 1 2021, 3:06 AM
dmgreen added inline comments.
llvm/test/tools/llvm-mca/AArch64/Cortex/A55-load-store-noalias.s
95–96

I think I would expect most CPU's to work like this, whether the addresses alias or not :)

andreadb added inline comments.Jul 1 2021, 3:11 AM
llvm/test/tools/llvm-mca/AArch64/Cortex/A55-load-store-noalias.s
95–96

You mean the store sequence. Of course.

My concern was related to instructions that appear to commit out of order like the load and the nop after it.
We have flag RetireOOO for cases where we want to allow it.

andreadb added inline comments.Jul 1 2021, 3:21 AM
llvm/test/tools/llvm-mca/AArch64/Cortex/A55-load-store-noalias.s
95–96

If instead you are concerned about whether this patch might end up delaying the second store, then don't worry. That's not how flag -noalias should work: it only affects interactions between loads and stores. It is about whether a younger load is allowed to pass an older store. It should not affect pairs of adjacent stores.

dmgreen added inline comments.Jul 7 2021, 2:51 AM
llvm/test/tools/llvm-mca/AArch64/Cortex/A55-load-store-noalias.s
95–96

Sorry, I was hoping to look into the schedule over the weekend to see what is going on, but didn't get the chance to look into the correct bit yet.

I believe there are 2 different optimizations that can happen here:

  • Do two stores to the same address have some penalty.
  • Do loads from the same address as a load have a penalty.

The first sounds to me like it should almost always be no, and the second requires store->load forwarding which I believe is very common in most cpus of sufficient complexity.

It comes down to what does the latency of a store mean. I was under the impression that it didn't mean anything in normal llvm scheduling, but it appears that it does have some effect on the latency of an store to the end of the block (I think). In llvm-mca it means the latency of the write into L1 cache?
The Cortex-A55 optimization guide specifies the latency of stores as 1, and that would probably be a better value to use in the A55 schedule model. I've put together a patch to do that in D105541.

andreadb added inline comments.Jul 7 2021, 4:25 AM
llvm/test/tools/llvm-mca/AArch64/Cortex/A55-load-store-noalias.s
95–96

Just to be clear: the noalias flag does NOT affect store pairs.

Regarding your point 2.
I guess you wanted to say: "Do loads from the same address as a STORE have a penalty.".

STLF assumes the presence of a store buffer, and that memory stores are not immediately propagated to the underlying caches. I don't know how common this is for in-order processors. However, I take from your comment that modern in-order may do a lot of out-of-order commit for store operations too.

That being said, llvm-mca doesn't know if the simulated target implements a store buffer, nor it knows how to predict if a younger load would alias an older stores. Without that knowledge, it is not possible to correctly predict which are valid STLF candidates.

STLF also assumes the presence of a store buffer, and that values are not immediately committed in cache (which I honestly don't know how common it is for in-order processors).
When "noalias=true", we assume that there is no aliasing at all for loads and stores. There is no need to model STLF for this case, because - under that assumption - younger loads will never alias older stores.

When "noalias=false", we conservatively assume that younger loads may alias older stores. However, we don't know if they would partially overlap, or if operations are for misaligned addresses.
So we cannot always optimistically assume that STLF will eventually occur. STLF is subject to a number of constraints in hardware, and different subtargets might impose different restrictions.

In future, we could introduce code annotations/metadata to pass "hints" to llvm-mca. Something like: "assume no-alias"/ "assume perfect-alias" / "assume aligned"; etc. We could then extend the scheduling model in order to provide extra information about store buffers. That would allow us to model STLF.

For now, noalias=false is just a "worst-case scenario" where aliasing always occurs between loads and stores, and STLF is not simulated (so, it implicitly fails for "reasons" that we don't provide).

About the store latency:

In the presence of a store buffer, I'd expect the latency of a store to be 1.
It is literally the cost of placing the value in the store buffer (which I expect to be 1 for most targets).

Strictly speaking, llvm-mca doesn't specially handle latency of loads and stores.
llvm-mca literally ONLY uses whatever latency value is declared by each write.

In all upstream scheduling models, the latency of loads is often defined according to the "load-to-use latency" defined by the vendor. But that's it. There is no special handling in llvm-mca. In future (at least for addressing modes that allow folded loads), it would be nice to distinguish the load contribution (i.e. load-to-use latency) from the total latency.

asavonic updated this revision to Diff 361740.Jul 26 2021, 11:59 AM
  • Rebased the patch and fixed CR comments.

Note that stores retire out-of-order on Cortex-A55 after D105541.

  • Rebased the patch and fixed CR comments.

Note that stores retire out-of-order on Cortex-A55 after D105541.

LGTM

Thanks Andrew!

andreadb accepted this revision.Jul 27 2021, 5:30 AM
This revision is now accepted and ready to land.Jul 27 2021, 5:30 AM
This revision was landed with ongoing or failed builds.Jul 29 2021, 4:42 AM
This revision was automatically updated to reflect the committed changes.