This is an archive of the discontinued LLVM Phabricator instance.

[MCA][LSUnit] Track loads and stores until retirement.
ClosedPublic

Authored by andreadb on Oct 1 2019, 5:07 AM.

Details

Summary

Before this patch, loads and stores were only tracked by their corresponding queues in the LSUnit from dispatch until execute stage. In practice we should be more conservative and assume that memory opcodes leave their queues at retirement stage.

Basically, loads should leave the load queue only when they have completed and delivered their data. We conservatively assume that a load is completed when it is retired. Stores should be tracked by the store queue from dispatch until retirement. In practice, stores can only leave the store queue if their data can be written to the data cache.

This is mostly a mechanical change. With this patch, the retire stage notifies the LSUnit when a memory instruction has retired.
That would triggers the release of LDQ/STQ entries.
The only visible change is in memory tests for the bdver2 model. That is because bdver2 is the only model that defines the load/store queue size.

This patch partially addresses PR39830.

Diff Detail

Event Timeline

andreadb created this revision.Oct 1 2019, 5:07 AM

Thank you for working on this!
Seems ok to me.

include/llvm/MCA/HardwareUnits/LSUnit.h
298–299 ↗(On Diff #222592)

// By default we conservatively assume that the LDQ receives a load at dispatch.

I think this may explain some of the weird throughput numbers i was seeing
for load-folded instructions. (as compared with llvm-exegesis measurements)
Is there a bug that tracks this? I wonder if the correct choice would be
to make it wait for L1 latency here.

andreadb marked 2 inline comments as done.Oct 4 2019, 3:50 AM

Thanks Roman,

include/llvm/MCA/HardwareUnits/LSUnit.h
298–299 ↗(On Diff #222592)

It would be interesting to see what code is compiled and run by exegesis to obtain the latency/throughput of those load folded instructions. Not knowing what kernel is run by exegesis makes it hard for me to understand your last comment. Could you please post an example in PR39830 (or raise a separate bug)? that would be very useful. Thanks.

298–299 ↗(On Diff #222592)

To further clarify this. The LDQ does receive load opcodes at dispatch.
The 'conservative assumption' here is that loads leave at retire rather than at the end of execution.
Stores are always tracked by the STQ from dispatch until retire.

lebedev.ri accepted this revision.Oct 6 2019, 10:58 AM

I think we should proceed with this.
The question i raised can be addressed later.
It basically is: "if the L1 latency is <n> cycles, and we start executing these
load-folded instructions these <n> cycles earlier, then where are those cycles
*actually* spent, if we're in-and-out of load queue without spending any cycles there?"

This revision is now accepted and ready to land.Oct 6 2019, 10:58 AM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptOct 8 2019, 3:46 AM
Herald added a subscriber: hiraditya. · View Herald Transcript

I think we should proceed with this.
The question i raised can be addressed later.
It basically is: "if the L1 latency is <n> cycles, and we start executing these
load-folded instructions these <n> cycles earlier, then where are those cycles
*actually* spent, if we're in-and-out of load queue without spending any cycles there?"

Finally moved to https://bugs.llvm.org/show_bug.cgi?id=39830#c4