This is an archive of the discontinued LLVM Phabricator instance.

[MCA] Introducing incremental SourceMgr and resumable pipeline
ClosedPublic

Authored by myhsu on Jun 5 2022, 9:18 PM.

Details

Summary

The new resumable mca::Pipeline capability introduced in this patch allows users to save the current state of pipeline and resume from the very checkpoint.
It is better (but not require) to use with the new IncrementalSourceMgr, where users can add mca::Instruction incrementally rather than having a fixed number of instructions ahead-of-time.

Note that we're using unit tests to test these new features. Because integrating them into the llvm-mca tool will make too many churns.

This is patch 1/2 of MCA Daemon's upstreaming efforts.

Diff Detail

Event Timeline

myhsu created this revision.Jun 5 2022, 9:18 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 5 2022, 9:18 PM
myhsu requested review of this revision.Jun 5 2022, 9:18 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 5 2022, 9:18 PM

Hi Min,

Thanks for this patch. I have just started looking at it. I left a few comments inline.

Thanks!

llvm/include/llvm/MCA/Pipeline.h
54

Can you use an enum class here?

llvm/include/llvm/MCA/SourceMgr.h
26–59

This is not ideal, but I can't think of any other obvious way to avoid this refactoring.

Did you notice a perf regression after this change on normal llvm-mca runs with several iterations?

I wonder if you could make SourceMgr a template class, and have it specialised based on whether you want a "circular buffer" behaviour or not.

Ideally, class SourceMgr should only be visible to the so-called pipeline entry stage(s). In llvm-mca there are two default entry stages: EntryStage (for the out-of-order pipeline) and InOrderIssueStage (for the in-order pipeline). Any other module should not use SourceMgr; for those modules, the input assembly sequence can often be easily abstracted using an ArrayRef<MCInst>. This is for example what we do in some Views.

That being said, I am seeing uses of SourceMgr in the CustomBehaviour API. For example, AMDGPU uses a CustomBehaviour to get access to the immutable MCInst sequence and simply iterate over the MCInst. This might have been an oversight/poor design choice from our end, since we don't seem to really need any of the extra SourceMgr functionalities in CustomBehaviour. In fact, passing an ArrayRef<MCInst> would be enough (at least for the AMDGPU).
'
In retrospect, I wonder whether there was a reason for that choice.
@holland11 Would it cause problems if we changed the CustomBehaviour constructor so that it accepts an ArrayRef<MCInst> instead of a SourceMgr? Do you remember why we needed it?

32–34

Is there a reason why you want to add this method?

andreadb added inline comments.Jun 6 2022, 11:20 AM
llvm/include/llvm/MCA/Stages/Stage.h
21

Is there a reason why you use <system_error> for your new Error type?

Out of curiosity, what is one of the motivating use cases for this?

As far as reviewing goes, I haven't been actively involved in mca enough lately to consider my opinions or input of much value, but I appreciate being notified of these PRs so that I can learn from them :).

myhsu updated this revision to Diff 434616.Jun 6 2022, 2:35 PM
myhsu marked 2 inline comments as done.

Thank you for your comments. The SourceMgr performance regression is pretty interesting, I'll look deep into that.
Meanwhile, I just fixed some minor issues first.

llvm/include/llvm/MCA/SourceMgr.h
26–59

This is not ideal, but I can't think of any other obvious way to avoid this refactoring.

Did you notice a perf regression after this change on normal llvm-mca runs with several iterations?

Interesting...I haven't done such measurement and I didn't know vtable dispatching in this case would cause noticeable regression. I'll look into it.

I wonder if you could make SourceMgr a template class, and have it specialised based on whether you want a "circular buffer" behaviour or not.

Right, if vtable is the bottleneck here, another similar solution can be:

template <class SubClass> SourceMgr {
  bool hasNext() const {
    return static_cast<const SubClass *>(this)->hasNext();
  }
};

class MySourceMgr : public SourceMgr<MySourceMgr> {
  bool hasNext() const { /*not a virtual function*/ }
};

Then consumers of SourceMgr need to accept something with a template. Which is not a big problem if we only use it in few places (i.e. less fiddling with template arguments) so I'm with the idea you put below to limit as many SourceMgr usages as possible, for instance in CustomBehavior. That said, for EntryStage and InOrderIssueStage, we need to move lots of code from .cpp file to the header file since we're taking template version of SourceMgr, I'm wondering what's your thoughts about this?

Ideally, class SourceMgr should only be visible to the so-called pipeline entry stage(s). In llvm-mca there are two default entry stages: EntryStage (for the out-of-order pipeline) and InOrderIssueStage (for the in-order pipeline). Any other module should not use SourceMgr; for those modules, the input assembly sequence can often be easily abstracted using an ArrayRef<MCInst>. This is for example what we do in some Views.

Sounds reasonable to me.

That being said, I am seeing uses of SourceMgr in the CustomBehaviour API. For example, AMDGPU uses a CustomBehaviour to get access to the immutable MCInst sequence and simply iterate over the MCInst. This might have been an oversight/poor design choice from our end, since we don't seem to really need any of the extra SourceMgr functionalities in CustomBehaviour. In fact, passing an ArrayRef<MCInst> would be enough (at least for the AMDGPU).
'
In retrospect, I wonder whether there was a reason for that choice.
@holland11 Would it cause problems if we changed the CustomBehaviour constructor so that it accepts an ArrayRef<MCInst> instead of a SourceMgr? Do you remember why we needed it?

32–34

Originally we were using begin/end methods here, but with the current scheme (i.e. SourceMgr as an abstract class) it's easier to have such getInstructions rather than worry about types for the iterator (returning from begin/end) -- It's not impossible to use begin/end, it's just easier to returning an ArrayRef. Also, it works just fine with current usages. Do you think it's better to use begin/end instead?

llvm/include/llvm/MCA/Stages/Stage.h
21

I didn't know there is llvm::inconvertibleErrorCode() when I first wrote this code. I'll change to use inconvertibleErrorCode() instead.

myhsu added a comment.EditedJun 6 2022, 2:50 PM

Out of curiosity, what is one of the motivating use cases for this?

These changes are originated from a downstream project which tries use MCA in a larger scale but observed some difficulties on flexibility as well as increasing memory footprint when the input size is big. This patch (and the following one) upstream some of the improvements on those issues because we think making MCA more scalable and versatile is generally a good idea. There might be some community members with a similar need, too. Here is the full story: https://youtu.be/ZGEP7JEIKNo

As far as reviewing goes, I haven't been actively involved in mca enough lately to consider my opinions or input of much value, but I appreciate being notified of these PRs so that I can learn from them :).

myhsu added a comment.Jun 6 2022, 3:46 PM

Did you notice a perf regression after this change on normal llvm-mca runs with several iterations?

Following up on the potential performance regression caused by abstracting away SourceMgr, here is my experiment:

  • Baseline SHA: 8f7b14898fe32f9c41059517a5a3872ef089174b
  • Host arch / OS: Core i7-8700K / Ubuntu 20.04
  • MCA command: llvm-mca -mcpu=btver2 -mtriple=x86_64-unknown-linux /path/to/llvm-project/llvm/test/tools/llvm-mca/X86/BtVer2/resources-sse2.s -o /dev/null
  • Perf command: perf stat --repeat=1000 -- <MCA command>

Here are the baseline numbers:

      48.43 msec task-clock                #    0.997 CPUs utilized            ( +-  0.08% )
          0      context-switches          #    0.008 K/sec                    ( +-  7.25% )
          0      cpu-migrations            #    0.000 K/sec                  
        767      page-faults               #    0.016 M/sec                    ( +-  0.01% )
220,767,126      cycles                    #    4.558 GHz                      ( +-  0.01% )
477,655,890      instructions              #    2.16  insn per cycle           ( +-  0.00% )
112,417,237      branches                  # 2321.072 M/sec                    ( +-  0.00% )
  1,266,955      branch-misses             #    1.13% of all branches          ( +-  0.02% )

  0.0485851 +- 0.0000370 seconds time elapsed  ( +-  0.08% )

Here are the numbers after this particular patch is applied:

      47.90 msec task-clock                #    0.997 CPUs utilized            ( +-  0.07% )
          0      context-switches          #    0.003 K/sec                    ( +-  7.75% )
          0      cpu-migrations            #    0.000 K/sec                  
        762      page-faults               #    0.016 M/sec                    ( +-  0.01% )
220,706,375      cycles                    #    4.608 GHz                      ( +-  0.01% )
479,569,839      instructions              #    2.17  insn per cycle           ( +-  0.00% )
112,961,518      branches                  # 2358.488 M/sec                    ( +-  0.00% )
  1,261,302      branch-misses             #    1.12% of all branches          ( +-  0.02% )

  0.0480387 +- 0.0000322 seconds time elapsed  ( +-  0.07% )

If we look into the number of instructions, a more stable metric, there is an increase on that, but it only accounts about 0.4%.

Did you notice a perf regression after this change on normal llvm-mca runs with several iterations?

Following up on the potential performance regression caused by abstracting away SourceMgr, here is my experiment:
<snip>

If we look into the number of instructions, a more stable metric, there is an increase on that, but it only accounts about 0.4%.

That's great thanks.

If that's the case, then I am less worried about using a base class with just pure virtual methods.

llvm/include/llvm/MCA/SourceMgr.h
32–34

In principle, I prefer using ArrayRef instead using begin/end iterators.

The only problem with ArrayRef (at least in this context) is that it assumes that the underlying data structure in SourceMgr can be seen as an array-like struct.

In your patch, the std::deque breaks that assumption. As the docs say, std::deque doesn't guarantee that elements are stored contiguously.

For that reason, having that method returning an ArrayRef is a bit limiting.
Ideally, what you really want is the ability to return an "iterator range".

I noticed that llvm provides a class iterator_range. Maybe using an llvm::iterator_range could be a compromise?

myhsu marked 2 inline comments as done.Jun 8 2022, 11:24 AM
myhsu added inline comments.
llvm/include/llvm/MCA/SourceMgr.h
32–34

In your patch, the std::deque breaks that assumption. As the docs say, std::deque doesn't guarantee that elements are stored contiguously.

Correct, though IncrementalSourceMgr::getInstructions will eventually be marked as unimplemented / not applicable. Because when the users decide to recycle instructions, it's really hard to return a meaningful sequence of instructions. That said, we can still return the sequence of instructions stored in InstStorage anyway as a best-effort result.

Ideally, what you really want is the ability to return an "iterator range".

Agree, though this goes back to a previous concern where we need to abstract away the type of iterator (as iterator_range still takes the type of iterator as its template argument). I will do some explorations on this topic.

andreadb accepted this revision.Jun 9 2022, 5:51 AM

Hi Min,

llvm/include/llvm/MCA/SourceMgr.h
32–34

Right. I see the problem.

If you can't find anything better, then your current approach (i.e. marking getInstructions as unimplemented) looks good to me.

I don't have any other requests.

Thanks!

This revision is now accepted and ready to land.Jun 9 2022, 5:51 AM
myhsu marked an inline comment as done.EditedJun 10 2022, 9:15 AM

Thank you @andreadb !
It will be great if you can help me review the follow-up patch, D127084, which implement the instruction recycling feature.

Thank you @andreadb !
It will be great if you can help me review the follow-up patch, D127084, which implement the instruction recycling feature.

Sure,

I'll review that patch too.

-Andrea

This revision was landed with ongoing or failed builds.Jun 24 2022, 3:41 PM
This revision was automatically updated to reflect the committed changes.
hubert.reinterpretcast added inline comments.
llvm/unittests/tools/llvm-mca/MCATestBase.cpp
85

This is causing failures:
https://lab.llvm.org/buildbot/#/builders/57/builds/19267

/home/buildbots/docker-RHEL-buildbot/SetupBot/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/llvm-project/llvm/unittests/tools/llvm-mca/MCATestBase.cpp:85:16: error: moving a local object in a return statement prevents copy elision [-Werror,-Wpessimizing-move]
        return std::move(NewE);
               ^
/home/buildbots/docker-RHEL-buildbot/SetupBot/worker_env/ppc64le-clang-rhel-test/clang-ppc64le-rhel/llvm-project/llvm/unittests/tools/llvm-mca/MCATestBase.cpp:85:16: note: remove std::move call here
        return std::move(NewE);
               ^~~~~~~~~~    ~
1 error generated.
thakis added a subscriber: thakis.Jun 24 2022, 5:36 PM
thakis added inline comments.
llvm/unittests/tools/llvm-mca/CMakeLists.txt
26

Why is this file so complicated? All other CMakeLists.txt in llvm/unittests are very simple.

It kind of looks like this links in some, but not all, code in tools/llvm-mca/Views in to the test, instead of doing a library-level dependency. That seems
a) strange from a layering PoV
b) ODR-violation-prone in the future

Why is this designed the way it is, instead of more like all the other tests?

thakis added inline comments.Jun 24 2022, 5:55 PM
llvm/unittests/tools/llvm-mca/CMakeLists.txt
26

For example, the llvm/unittests/tools/llvm-exegesis/* does something somewhat similar (…I think?) with much less machinery, by having a test binary per target. Could that design work here?

thakis added inline comments.Jun 24 2022, 5:57 PM
llvm/unittests/tools/llvm-mca/X86/CMakeLists.txt
3

(But, similar to the llvm-exegesis tests, it's weird that this depends on internals of the X86 target. llvm-exegesis "at least" makes this choice at the tool level, but here it's only the test that does this. Is there a way to instead use the public API instead of depending on internal headers of lib/Target/X86?)

myhsu added inline comments.Jun 24 2022, 6:01 PM
llvm/unittests/tools/llvm-mca/CMakeLists.txt
26

This file is taking references from unittest/tools/llvm-exegesis/CMakeLists.txt (which is actually more complicated than this one) because llvm-mca has a similar need with llvm-exegesis.

The rationale behind this is that we want to build in-memory MCInst objects as the test input (which requires target specific header files) for a simpler test harness.
That said, I think moving files under tools/llvm-mca/Views into include/llvm/MCA/Views (and its lib/MCA counterpart) is a good future idea @andreadb what do you think?

llvm/unittests/tools/llvm-mca/MCATestBase.cpp
85

Sorry for that, will be fixed shortly...

myhsu added inline comments.Jun 24 2022, 6:10 PM
llvm/unittests/tools/llvm-mca/CMakeLists.txt
26

For example, the llvm/unittests/tools/llvm-exegesis/* does something somewhat similar (…I think?) with much less machinery, by having a test binary per target. Could that design work here?

I don't think so, there is only a single test binary, LLVMExegesisTests, same as us. We also assigned each target a folder (currently there is only X86).

thakis added inline comments.Jun 24 2022, 6:16 PM
llvm/unittests/tools/llvm-mca/CMakeLists.txt
26

Oh I see, I hadn't seen D106464. I agree that this is similar to what llvm-exegesis's tests now do, thanks.

(It'd still be nice to randomly pull in files from llvm-mca/Views.)

myhsu marked an inline comment as done.Jun 24 2022, 6:36 PM
myhsu added inline comments.
llvm/unittests/tools/llvm-mca/MCATestBase.cpp
85