This is an archive of the discontinued LLVM Phabricator instance.

[MCA] Adding the CustomBehaviour class to llvm-mca
ClosedPublic

Authored by holland11 on Jun 11 2021, 1:23 PM.

Details

Summary

TLDR:
Some instructions are not defined well enough within the target’s scheduling model for llvm-mca to be able to properly simulate its behaviour. The ideal solution to this situation is to modify the scheduling model, but that’s not always a viable strategy. Maybe other parts of the backend depend on that instruction being modelled the way that it is. Or maybe the instruction is quite complex and it’s difficult to fully capture its behaviour with tablegen. The CustomBehaviour class (which I will refer to as CB frequently) is designed to provide intuitive scaffolding for developers to implement the correct modelling for these instructions.

Some quick notes before I get into some of the implementation details:
I built this class specifically for a downstream architecture, but I tried to design it in a way that would be useful for all targets. I wanted to be able to submit these changes upstream so I found an instruction from the AMDGPU instruction set that is used quite frequently, but isn’t currently modelled properly (resulting in significantly inaccurate analysis generated by llvm-mca). My knowledge of AMDGPU is limited and there is also an issue with flags not being set correctly on many of the relevant AMDGPU instructions so my implementation isn’t perfect. However, it should still provide an example use case for CB and also an example of how someone could use the class. If any AMDGPU developers would like to improve my implementation, I encourage you to do so. And on that note, if the AMDGPU folks don’t want this at all, we can definitely remove it from the patch and stick to pushing CB without any upstream examples.

This patch only has CB integrated into the in-order pipeline, however it was designed to fit into both pipelines. I’m much more familiar with the in-order pipeline (and CB fits into it a bit more intuitively), but I plan on adding it to the out-of-order pipeline in the future. I would like some feedback on what would be the best way to do that, but for now I want to keep the scope of this patch as limited as possible so I will make another post soon asking for feedback and to open up a discussion towards integrating CB within the out-of-order pipeline.

Also, I'm Canadian so I hope the 'u' in behaviour doesn't bother anyone :).

Implementation details:

llvm-mca does its best to extract relevant register, resource, and memory information from every MCInst when lowering them to an mca::Instruction. It then uses this information to detect dependencies and simulate stalls within the pipeline. For some instructions, the information that gets captured within the mca::Instruction is not enough for mca to simulate them properly. In these cases, there are two main possibilities:

  1. The instruction has a dependency that isn’t detected by mca.
  2. mca is incorrectly enforcing a dependency that shouldn’t exist.

For the rest of this discussion, I will be focusing on (1), but I have put some thought into (2) and I may revisit it in the future.

So we have an instruction that has dependencies that aren’t picked up by mca. The basic idea for both pipelines in mca is that when an instruction wants to be dispatched, we first check for register hazards and then we check for resource hazards. This is where CB is injected. If no register or resource hazards have been detected, we make a call to CustomBehaviour::checkCustomHazard() to give the target specific CB the chance to detect and enforce any custom dependencies.

The return value for checkCustomHazaard() is an unsigned int representing the (minimum) number of cycles that the instruction needs to stall for. It’s fine to underestimate this value because when StallCycles gets down to 0, we’ll end up checking for all the hazards again before the instruction is actually dispatched. However, it’s important not to overestimate the value and the more accurate your estimate is, the more efficient mca’s execution can be.

In general, for checkCustomHazard() to be able to detect these custom dependencies, it needs information about the current instruction and also all of the instructions that are still executing within the pipeline. The mca pipeline uses mca::Instruction rather than MCInst and the current information encoded within each mca::Instruction isn’t sufficient for my use cases. I had to add a few extra attributes to the mca::Instruction class and have them get set by the MCInst during instruction building. For example, the current mca::Instruction doesn’t know its opcode, and it also doesn’t know anything about its immediate operands (both of which I had to add to the class).

With information about the current instruction, a list of all currently executing instructions, and some target specific objects (MCSubtargetInfo and MCInstrInfo which the base CB class has references to), developers should be able to detect and enforce most custom dependencies within checkCustomHazard. If you need more information than is present in the mca::Instruction, feel free to add attributes to that class and have them set during the lowering sequence from MCInst.

Fortunately, in the in-order pipeline, it’s very convenient for us to pass these arguments to checkCustomHazard. The hazard checking is taken care of within InOrderIssueStage::canExecute(). This function takes a const InstRef as a parameter (representing the instruction that currently wants to be dispatched) and the InOrderIssueStage class maintains a SmallVector<InstRef, 4> which holds all of the currently executing instructions. For the out-of-order pipeline, it’s a bit trickier to get the list of executing instructions and this is why I have held off on implementing it myself. This is the main topic I will bring up when I eventually make a post to discuss and ask for feedback.

CB is a base class where targets implement their own derived classes. If a target specific CB does not exist (or we pass in the -disable-cb flag), the base class is used. This base class trivially returns 0 from its checkCustomHazard() implementation (meaning that the current instruction needs to stall for 0 cycles aka no hazard is detected). For this reason, targets or users who choose not to use CB shouldn’t see any negative impacts to accuracy or performance (in comparison to pre-patch llvm-mca).

AMDGPU Example:

In AMDGPU, there is a family of s_waitcnt instructions. The purpose of these instructions is to force a stall until certain memory operations have been completed. For example, if we are loading a value to gpr5 and then using gpr5, one of the backend passes will detect this and place an s_waitcnt instruction before the use to ensure that gpr5 has been written to. These instructions interact with 4 different ‘counters’ (vscnt, expcnt, lgkmcnt, and vscnt) and s_waitcnt can force a wait on any number of them. Each memory instruction will also interact with 1 or more of these counters and (this is an oversimplification) the instruction will increment those counters when it begins executing, and then decrement the counters once it has written its result.

From a high level, when CB sees an s_waitcnt instruction, it will first decode its operand to determine what it’s waiting for. CB then needs to look at all of the currently executing instructions to determine if we need to stall or not. This is where things get a little tricky with respect to AMDGPU.

For each instruction within the pipeline, we need to determine if that instruction interacts with any of the counters (and which ones specifically). The backend pass that I mentioned earlier (that adds s_waitcnt instructions during codegen) has a function that does just that. However, this pass is dealing with ‘psuedo’ instructions, whereas in mca, we are dealing with ‘real’ instructions. This is relevant because many of the flags that the logic depends on (such as mayLoad, mayStore, and even some target specific flags) are not always shared between the ‘pseudo’ and ‘real’ instructions. This makes it really difficult to determine with certainty which instructions interact with which counters.

To combat this, I made some conservative assumptions (such as assuming mayLoad and mayStore will always be true). These assumption result in some instructions being flagged as interacting with more CNTs than they actually do. I do not make any assumptions about the target specific flags though and this results in certain instructions (such as ds_read and ds_write variants) not being detected at all. I highly suggest that if the AMDGPU folks would like to use llvm-mca, they look into having the flags properly applied to the ‘real’ instructions and then modifying the logic from this stage to be copied from the SIInsertWaitcnts::updateEventWaitcntAfter() function.

Another issue that I had to face was that mca (properly) enforces in-order writes on all instructions (even ones that don’t write anything such as s_waitcnt). This means that if there is a 300 cycle memory instruction currently executing, and then an s_waitcnt comes along (that isn’t meant for the 300 cycle instruction), we’ll still end up stalling for those ~300 cycles (because otherwise, the s_waitcnt instruction would finish before the memory instruction). However, mca uses a RetireOOO flag that can be set on scheduling classes within tablegen. For this reason, I modified the AMDGPU tablegen slightly to set that flag for the s_waitcnt variants. The flag is only used within mca, so (in theory) this shouldn’t have any side-effects. As mentioned earlier, if the AMDGPU folks do not want these changes, I’m happy to remove this example from the patch.

With all that background information out of the way, let’s take a look at a few examples.

ds_read_u8 v1, v0
s_waitcnt expcnt(0)
v_mul_lo_u32 v5, v3, v4
s_waitcnt lgkmcnt(0)

ds_read_u8 doesn’t interact with expcnt at all so the first s_waitcnt shouldn’t force a stall. The behaviour of the v_mul instruction is one that I’m not positive on though as far as in-order writes are concerned (in AMDGPU, should that first v_mul be allowed to complete before the ds_read?). Either way, the first wait should be dispatched and executed as soon as it’s seen.

Here’s the mca timeline BEFORE this patch:

Timeline view:
                    0123456789          
Index     0123456789          0123456789
[0,0]     DeeeeeeeeeeeeeeeeeeeE    .      ds_read_u8 v1, v0
[0,1]     .    .    .    .  DeE    .      s_waitcnt expcnt(0)
[0,2]     .    .    .    .   DeeeeeeeE    v_mul_lo_u32 v5, v3, v4
[0,3]     .    .    .    .    .    DeE    s_waitcnt lgkmcnt(0)

Notice how mca’s enforcement of the in-order writes causes the first s_waitcnt to stall.

Now here’s the mca timeline without CB, but with the RetireOOO flag set for all s_waitcnt instructions:

Timeline view:
                    0123456789 
Index     0123456789          0
[0,0]     DeeeeeeeeeeeeeeeeeeeE   ds_read_u8 v1, v0
[0,1]     .DeE .    .    .    .   s_waitcnt expcnt(0)
[0,2]     .    .    . DeeeeeeeE   v_mul_lo_u32 v5, v3, v4
[0,3]     .    .    .  DeE    .   s_waitcnt lgkmcnt(0)

And here’s with CB:

Timeline view:
                    0123456789   
Index     0123456789          012
[0,0]     DeeeeeeeeeeeeeeeeeeeE .   ds_read_u8 v1, v0
[0,1]     .DeE .    .    .    . .   s_waitcnt expcnt(0)
[0,2]     .    .    . DeeeeeeeE .   v_mul_lo_u32 v5, v3, v4
[0,3]     .    .    .    .    DeE   s_waitcnt lgkmcnt(0)

The first s_waitcnt correctly doesn’t get stalled. The second s_waitcnt correctly does get stalled.

Here’s a different example (the formatting breaks in this post due to the long sequence of characters so I use screenshots for the timelines):

s_load_dwordx4 s[0:3], s[4:5], 0x0
s_waitcnt vmcnt(0) lgkmcnt(0)
s_waitcnt_vscnt null, 0x0
global_load_dword v1, v0, s[0:1] glc dlc
s_waitcnt vmcnt(0)

Here is the timeline BEFORE this patch has been applied:
https://i.imgur.com/OZQtrE1.png

You’ll notice that the simulation is actually fairly accurate for this example, but it’s accurate for the wrong reasons. The s_waitcnt instructions aren’t waiting because they know they have to wait for the specific s_load_dword and global_load_dword instructions. They are waiting because mca is enforcing the in-order write-backs. While this may produce a reasonable output in this case, it’s not good in general.

Here is the timeline AFTER this patch has been applied, but with the -disable-cb command (so basically what it would look like to run the example through mca with s_waitcnt instructions having the RetireOOO flag):
https://i.imgur.com/s62cKYJ.png

You can see now that the s_waitcnt instructions are allowed to finish out-of-order, and without CB to enforce their behaviour, they don’t force a stall.

Now here’s the patch with CB enabled:
https://i.imgur.com/VMlG0DO.png

CB is able to detect that s_waitcnt vmcnt(0) lgkmcnt(0) needs to wait for s_load_dwordx4 to finish. And s_waitcnt vmcnt(0) needs to wait for global_load_dword to finish.

Overall, due to the flags not being reliable, if the AMDGPU developers are not interested in improving / correcting my implementation, it would probably be best to remove it from the patch. I really just wanted an upstream example to show how CB could be used, but I didn’t anticipate all of the complications that arose from this specific example. llvm-mca can be a really helpful tool for many different purposes, but especially for finding mistakes in the scheduling models. For example, I noticed that a sequence of ds_read instructions do not get simulated properly:

ds_read_u8 v1, v0
ds_read_u8 v2, v0 offset:1
ds_read_u8 v3, v0 offset:2
ds_read_u8 v4, v0 offset:3
ds_read_u8 v5, v0 offset:4
Timeline view:
                    0123456789          0123456789          0123456789          0123456789          0123456789 
Index     0123456789          0123456789          0123456789          0123456789          0123456789          0
[0,0]     DeeeeeeeeeeeeeeeeeeeE    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .   ds_read_u8 v1, v0
[0,1]     .    .    .    .    DeeeeeeeeeeeeeeeeeeeE    .    .    .    .    .    .    .    .    .    .    .    .   ds_read_u8 v2, v0 offset:1
[0,2]     .    .    .    .    .    .    .    .    DeeeeeeeeeeeeeeeeeeeE    .    .    .    .    .    .    .    .   ds_read_u8 v3, v0 offset:2
[0,3]     .    .    .    .    .    .    .    .    .    .    .    .    DeeeeeeeeeeeeeeeeeeeE    .    .    .    .   ds_read_u8 v4, v0 offset:3
[0,4]     .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    DeeeeeeeeeeeeeeeeeeeE   ds_read_u8 v5, v0 offset:4

As far as I understand, these instructions shouldn’t have to wait for each other. However, ds_read_u8 is modeled as having both a read and a write to the M0 register which mca detects as a register hazard. I looked in the ISA and I’m fairly confident that this isn’t accurate. If you start using mca, you will likely be able to find a lot of other discrepancies such as this one. And if you do want to start using the tool, the CB class can offer you the scaffolding to implement any instruction behaviour that you can’t / don’t want to model in the tablegen itself.

Diff Detail

Event Timeline

holland11 created this revision.Jun 11 2021, 1:23 PM
holland11 requested review of this revision.Jun 11 2021, 1:23 PM
holland11 updated this revision to Diff 351556.Jun 11 2021, 1:53 PM

Removed some unnecessary includes and fixed header formatting for a few files.

holland11 updated this revision to Diff 351610.Jun 11 2021, 5:54 PM

Made some modifications to please clang-tidy.

holland11 updated this revision to Diff 351619.Jun 11 2021, 7:08 PM

More clang-tidy warning fixes.

holland11 updated this revision to Diff 351647.Jun 12 2021, 3:10 AM

clang-format minor tweak.

Thanks for the patch.

I like the idea of adding a CustomBehaviour class in llvm-mca to customise/post-process dependencies and model extra delays due to unsimulated structural hazards.

I had a quick look at the patch, and so far, I like the general idea/design. There are a couple of things which must be done differently, or that needs to be improved. But I think that hopefully all those points can be easily addressed.

Speaking about the AMDGPU specific changes. It may be worthy to split this patch into two patches at least.
Specifically, the scheduling model changes related to the RetireOOO flags (as well as the new tests) should be committed as a separate patch, and possibly reviewed by @foad (and/or whoever else is interested in supporting llvm-mca for AMDGPU).

Thanks,
Andrea

llvm/docs/CommandGuide/llvm-mca.rst
989–994

I think this paragraph should be more explicit on what is the goal of the CustomBehaviour class.

The first two sentences should probably be merged together.
The focus on those two sentences is the "lack of expressiveness" (of scheduling models and machine instructions). That lack of expressivenes is what eventually leads to poor simulations.

Starting the entire paragraph with that first sentence is not great; in the absence of context, it sounds more like there are bugs in llvm-mca itself.

I also noticed that you don't explain what is the main goal of this new CustomBehaviour class. You have done a much better job at introducing it in the summary of this bug. At this point, you should at least drop a hint about what is the most common goal for it:
"it allows targets to customise data dependencies, and model delays which cannot be normally predicted just by simply evaluating register defs/uses". Something like that... Maybe you can be more accurate, but keep in mind that the description should not be too low level in this document.

llvm/include/llvm/MCA/Instruction.h
512

Can this vector be fully optional?

I am thinking about using a simple std::vector here, rather than a SmallVector, which would force us to consume space for a default number of inline elements (8 in this case).

For the (majority) of targets that don't require/implement a custom behaviour class, that field is literally wasted space. In general, I am already concerned about the sizeof mca::instruction. Those objects tend to be often too big, and this change has the potential of making them even bigger.

llvm/lib/MCA/InstrBuilder.cpp
619–621

Three questions:

  1. Is it necessary to add MCAOperands to every single instruction?
  2. Is it possible to do this as a post-processing step, if requested by the simulator?
  3. Can we limit the number of MCAOperand objects that we store in an mca::Instruction?

Ideally, this logic should be split from the normal InstrBuilder::createInstruction and moved to a separate InstrBuilder::postProcessInstruction step. That post-processing step would then be performed only if the so-called "custom behaviour" is requested.

In order to address points 1. and 3., it may be useful to implement a new "MCA Lowering Context" class, which knows about:
a) which instructions require a list of MCAOperand, and
b) which operands must be lowered to MCAOperand for the custom hazard check to work.

The MCA lowering context would know about which instructions/operands require a custom post-processing step, and skip the post-processing step for those instructions that can be safely ignored.

Essentially, have a context class that filters opcodes and rules out which operands are important to keep as MCAOperand, and which are not important for the custom behaviour class.

To give you an idea: if we know that the custom behaviour class is only interested in the first operand of a specific load opcode, then we should only create MCAOperand objects for that load, and just for its first operand.

In most cases (correct me if I am wrong), I suspect that we may not need to store MCAOperands for every single instruction out there. So, blindly lowering every MCOperand into MCAOperand - while it work in general - is a bit extreme in my opinion. Keep in mind that most targets won't implement any custom behaviour for their subtargets, and for those targets, translating to MCAOperand would be unnecessary, and it would simply waste space.

Essentially, if a custom behaviour is not requested/implemented, then we should ignore this extra post-processing step entirely, and not populate the vector of MCAOperands (this is also why it may be worthy to make the new MCAOperand vector field a std::vector).

I hope it makes sense.

llvm/tools/llvm-mca/Views/DispatchStatistics.cpp
80

Please just use something like "Uncategorised Structural Hazards stalls" (or something like that).

llvm/tools/llvm-mca/lib/AMDGPU/AMDGPUCustomBehaviour.cpp
44–45

I am not familiar with those pseudos. However, I would be careful about asserting here. One day, people might want to drive an mca pipeline from a backend pass, and even after regalloc there might still be MachineInstr pseudos around.
Just merge these with the "default" case for now (and maybe add a small code comment).

48–52

I don't think it is a good idea to generate a warning here. Instead, just raise a bug for it once this patch is committed in main. For now, please convert it into a TODO comment.

andreadb added inline comments.Jun 12 2021, 6:37 AM
llvm/lib/MCA/InstrBuilder.cpp
619–621

I forgot to mention that, if ordering of operands is important for the custom behaviour check to work, and you are concerned about having gaps in the MCAOperand sequence, then you could use an extra field to store the original operand index.

holland11 added a comment.EditedJun 12 2021, 7:47 PM

@andreadb I appreciate all of the suggestions. They all make sense to me and I'll get started on implementing them.

  1. Once I've made the changes, would it be better for me to update the diff within this post, or create a new post with the same title/body (but with a new diff)?
  1. Your concerns about the MCAOperand list are completely reasonable. I'll get started on making the list optional and only storing the operands that the target's CB requests. What do you think would be the best way to make it 'optional' (with respect to the actual datatype of the MCAOperand list within the mca::Instruction class)? Would making it a std::vector that defaults to a size of 0 be smart or can you think of a better way?

What I'm thinking is that within the CustomBehaviour.h file, there will be another class InstrPostProcess (open to better names). This class will have a public method processInstruction() (modify may be a better word). The generic class's method won't do anything, but targets who implement their own CB will also implement their own InstrPostProcess where they can choose to override the processInstruction() method. After the MCInst is lowered to mca::Instruction, but before we add it to the LoweredSequence, we will call InstrPostProcess::processInstruction(). This method will take both the MCInst and mca::Instruction and will make any modifications / additions to the mca::Instruction.

This way, for targets that don't have a CB or don't need any modifications to the mca::Instruction (such as storing specific operands), the processInstruction() method can trivially return without doing anything. And for targets who do want to make modifications to the mca::Instruction, they can have a switch statement within the processInstruction() to only modify the desired instructions.

I mentioned in the original post that CB is currently only designed to handle adding dependencies, but not removing them. As a bonus with this design, we'd get an easy way to remove undesired dependencies (on top of the original intention to only store what we need).

Edit: I just did a quick, rough, and untested re-design for the MCAOperand issue if you want a more explicit look at what this plan looks like (don't feel any pressure to look at it though, I'm just sharing in case you're curious) https://pastebin.com/rL7w0WC4

andreadb added a comment.EditedJun 13 2021, 3:50 AM

@andreadb I appreciate all of the suggestions. They all make sense to me and I'll get started on implementing them.

  1. Once I've made the changes, would it be better for me to update the diff within this post, or create a new post with the same title/body (but with a new diff)?

You don't need to create a new review. You can always update the review description if you want, and keep everything as part of this same review.

  1. Your concerns about the MCAOperand list are completely reasonable. I'll get started on making the list optional and only storing the operands that the target's CB requests. What do you think would be the best way to make it 'optional' (with respect to the actual datatype of the MCAOperand list within the mca::Instruction class)? Would making it a std::vector that defaults to a size of 0 be smart or can you think of a better way?

I would simply use a std::vector. That way, it would start empty (which is what we want most of the times).

What I'm thinking is that within the CustomBehaviour.h file, there will be another class InstrPostProcess (open to better names). This class will have a public method processInstruction() (modify may be a better word). The generic class's method won't do anything, but targets who implement their own CB will also implement their own InstrPostProcess where they can choose to override the processInstruction() method. After the MCInst is lowered to mca::Instruction, but before we add it to the LoweredSequence, we will call InstrPostProcess::processInstruction(). This method will take both the MCInst and mca::Instruction and will make any modifications / additions to the mca::Instruction.

Sound good.

This way, for targets that don't have a CB or don't need any modifications to the mca::Instruction (such as storing specific operands), the processInstruction() method can trivially return without doing anything. And for targets who do want to make modifications to the mca::Instruction, they can have a switch statement within the processInstruction() to only modify the desired instructions.

Awesome.

I mentioned in the original post that CB is currently only designed to handle adding dependencies, but not removing them. As a bonus with this design, we'd get an easy way to remove undesired dependencies (on top of the original intention to only store what we need).

Great.
However, let's just pretend for now that this is all about adding new dependencies.
We can re-evaluate the importance of this particular use case in future if we find that it is required by some code.

Edit: I just did a quick, rough, and untested re-design for the MCAOperand issue if you want a more explicit look at what this plan looks like (don't feel any pressure to look at it though, I'm just sharing in case you're curious) https://pastebin.com/rL7w0WC4

For what I can see, it looks good!

Only a couple of minor nits:

  • Protected fields of InstrPostProcess should be private (even if at the cost of adding const& getters).
  • Do we actually even need to add a derived class for x86? For X86 the base class seems already enough.

Thanks
-Andrea

holland11 added a comment.EditedJun 13 2021, 11:21 AM

Protected fields of InstrPostProcess should be private (even if at the cost of adding const& getters).

Does this hold true in general for objects that get stored in base classes, but used in derived classes? I also have the MCSubtargetInfo, SrcManager, and MCInstrInfo objects within the base CB as protected. Should I change them to private and then use getters for the derived classes to access them?

Do we actually even need to add a derived class for x86? For X86 the base class seems already enough.

We probably don't need it. I was originally going to make an empty derived class for every upstream target just so they'd have all their cmake stuff setup already for when/if people wanted to add to them. But then I just decided to keep two of them so that it would be easy for people to refer to how they are setup. Though, keeping the X86 one seems a bit misleading considering that CB doesn't even play a part in the out-of-order pipeline yet. Do you think I should remove it (the X86 derived class)?

However, let's just pretend for now that this is all about adding new dependencies.
We can re-evaluate the importance of this particular use case in future if we find that it is required by some code.

I'm really tempted to use this altered design to fix the ds_read issue that I described at the end of the original post. It would be a much cleaner example use case than the s_waitcnt 'fix' (although it wouldn't use the actual CB class at all so it wouldn't be an example of CB directly).

From a high level, all I need to do is remove the WriteState that refers to the M0 register. From a lower level, this WriteState exists in the mca::Instruction as part of SmallVector<WriteState, 2> Defs;. From looking at the way the WriteState objects interact with the RegisterFile (most notably, what happens in both pipelines when an instruction gets dispatched and needs to add its defs to the register file), I could either (1) remove the WriteState from the SmallVector, (2) set the RegisterID of the WriteState to 0, or (3) set the Latency of the WriteState to 0.

  1. As far as I can tell, there is no way to remove an element from a SmallVector. I could maybe build a new SmallVector without the specific WriteState, but this seems less than ideal.
  1. This would cause an assertion to fail within RegisterFile::addRegisterWrite. Would it be alright to change this to an if statement that exits the function instead? Edit Actually I just tried this, and there are other assertions that would need to be changed as well (one from RegisterFile::removeRegisterWrite and one from RegisterFile::onInstructionExecuted) so maybe this isn't a great option.
void RegisterFile::addRegisterWrite(WriteRef Write,
                                    MutableArrayRef<unsigned> UsedPhysRegs) {
  WriteState &WS = *Write.getWriteState();
  MCPhysReg RegID = WS.getRegisterID();
  assert(RegID && "Adding an invalid register definition?");

To

void RegisterFile::addRegisterWrite(WriteRef Write,
                                    MutableArrayRef<unsigned> UsedPhysRegs) {
  WriteState &WS = *Write.getWriteState();
  MCPhysReg RegID = WS.getRegisterID();
  if (!RegID)  return;
  1. This is my least desired route since I don't fully understand the ramifications that it would have and doesn't seem as 'explicit' as far as removing a def from an instruction goes.

If none of these options are desirable, another option could be to have a new method InstrPostProcess::skipRegDef() that gets called from InstrBuilder::createInstruction():

// Initialize writes.
unsigned WriteIndex = 0;
for (const WriteDescriptor &WD : D.Writes) {
  RegID = WD.isImplicitWrite() ? WD.RegisterID
                               : MCI.getOperand(WD.OpIndex).getReg();
  // Check if this is a optional definition that references NoReg.
  if (WD.IsOptionalDef && !RegID) {
    ++WriteIndex;
    continue;
  }

  assert(RegID && "Expected a valid register ID!");
  NewIS->getDefs().emplace_back(WD, RegID,
                                /* ClearsSuperRegs */ WriteMask[WriteIndex],
                                /* WritesZero */ IsZeroIdiom);
  ++WriteIndex;
}

To

// Initialize writes.
unsigned WriteIndex = 0;
for (const WriteDescriptor &WD : D.Writes) {
  RegID = WD.isImplicitWrite() ? WD.RegisterID
                               : MCI.getOperand(WD.OpIndex).getReg();
  // Check if this is a optional definition that references NoReg.
  if ((WD.IsOptionalDef && !RegID) || IPP->skipRegDef(WD, MCI)) {
    ++WriteIndex;
    continue;
  }

  assert(RegID && "Expected a valid register ID!");
  NewIS->getDefs().emplace_back(WD, RegID,
                                /* ClearsSuperRegs */ WriteMask[WriteIndex],
                                /* WritesZero */ IsZeroIdiom);
  ++WriteIndex;
}

The more I think about it, the more I like the last option best. What do you think?

foad added a comment.Jun 14 2021, 2:48 AM

However, this pass is dealing with ‘psuedo’ instructions, whereas in mca, we are dealing with ‘real’ instructions. This is relevant because many of the flags that the logic depends on (such as mayLoad, mayStore, and even some target specific flags) are not always shared between the ‘pseudo’ and ‘real’ instructions. This makes it really difficult to determine with certainty which instructions interact with which counters.

In most cases the reason that some flags are not set accurately on Real instructions is just that there has been no need for them. Copying them across from the corresponding Pseudo instruction is almost a no-brainer. I have done some of this before in D99187 and am happy to do more.

However, ds_read_u8 is modeled as having both a read and a write to the M0 register

I'm surprised by this. I don't see anything in the tablegen files to say that DS instructions write to M0.

Protected fields of InstrPostProcess should be private (even if at the cost of adding const& getters).

Does this hold true in general for objects that get stored in base classes, but used in derived classes? I also have the MCSubtargetInfo, SrcManager, and MCInstrInfo objects within the base CB as protected. Should I change them to private and then use getters for the derived classes to access them?

In my code, I tend to use private instead of protected whenever possible. This is just my coding style though. I don't think that LLVM imposes a particular coding style for it. Since mca has been mainly written by me, you'll find out that protected is almost never used in practice.

That being said, it is fine for you to use protected for your new class. It is your code, and -strictly speaking - there is nothing fundamentally wrong with it.

Do we actually even need to add a derived class for x86? For X86 the base class seems already enough.

We probably don't need it. I was originally going to make an empty derived class for every upstream target just so they'd have all their cmake stuff setup already for when/if people wanted to add to them. But then I just decided to keep two of them so that it would be easy for people to refer to how they are setup. Though, keeping the X86 one seems a bit misleading considering that CB doesn't even play a part in the out-of-order pipeline yet. Do you think I should remove it (the X86 derived class)?

Yes please.

In case, any X86 specific changes should go into a separate patch.

However, let's just pretend for now that this is all about adding new dependencies.
We can re-evaluate the importance of this particular use case in future if we find that it is required by some code.

I'm really tempted to use this altered design to fix the ds_read issue that I described at the end of the original post. It would be a much cleaner example use case than the s_waitcnt 'fix' (although it wouldn't use the actual CB class at all so it wouldn't be an example of CB directly).

From a high level, all I need to do is remove the WriteState that refers to the M0 register. From a lower level, this WriteState exists in the mca::Instruction as part of SmallVector<WriteState, 2> Defs;. From looking at the way the WriteState objects interact with the RegisterFile (most notably, what happens in both pipelines when an instruction gets dispatched and needs to add its defs to the register file), I could either (1) remove the WriteState from the SmallVector, (2) set the RegisterID of the WriteState to 0, or (3) set the Latency of the WriteState to 0.

I don't like the idea of using your new framework to workaround problems/bugs that are not in llvm-mca.

If you think that there is a genuine bug in their instruction definitions, then please raise a bug for it.

We should never "fix" non-mca issues within mca. This is true in general, and it doesn't only apply to mca. If there is a bug in someone else's code, we create a bug report for it, and optionally we send a patch for it. We don't pretend that the issue doesn't exist, and (worse) hack some workarounds for those bugs in our code.

So, I strongly recommend to raise bugs for all these problems you have encountered when testing the AMDGPU models.
If there is a reason why there is a write to M0, then we can consider implementing a solution based on your new class/post-processing step.
If instead it that was an oversight, then it will be fixed outside of mca without the need for a custom hack in mca.

  1. As far as I can tell, there is no way to remove an element from a SmallVector. I could maybe build a new SmallVector without the specific WriteState, but this seems less than ideal.
  1. This would cause an assertion to fail within RegisterFile::addRegisterWrite. Would it be alright to change this to an if statement that exits the function instead? Edit Actually I just tried this, and there are other assertions that would need to be changed as well (one from RegisterFile::removeRegisterWrite and one from RegisterFile::onInstructionExecuted) so maybe this isn't a great option.
void RegisterFile::addRegisterWrite(WriteRef Write,
                                    MutableArrayRef<unsigned> UsedPhysRegs) {
  WriteState &WS = *Write.getWriteState();
  MCPhysReg RegID = WS.getRegisterID();
  assert(RegID && "Adding an invalid register definition?");

To

void RegisterFile::addRegisterWrite(WriteRef Write,
                                    MutableArrayRef<unsigned> UsedPhysRegs) {
  WriteState &WS = *Write.getWriteState();
  MCPhysReg RegID = WS.getRegisterID();
  if (!RegID)  return;
  1. This is my least desired route since I don't fully understand the ramifications that it would have and doesn't seem as 'explicit' as far as removing a def from an instruction goes.

If none of these options are desirable, another option could be to have a new method InstrPostProcess::skipRegDef() that gets called from InstrBuilder::createInstruction():

// Initialize writes.
unsigned WriteIndex = 0;
for (const WriteDescriptor &WD : D.Writes) {
  RegID = WD.isImplicitWrite() ? WD.RegisterID
                               : MCI.getOperand(WD.OpIndex).getReg();
  // Check if this is a optional definition that references NoReg.
  if (WD.IsOptionalDef && !RegID) {
    ++WriteIndex;
    continue;
  }

  assert(RegID && "Expected a valid register ID!");
  NewIS->getDefs().emplace_back(WD, RegID,
                                /* ClearsSuperRegs */ WriteMask[WriteIndex],
                                /* WritesZero */ IsZeroIdiom);
  ++WriteIndex;
}

To

// Initialize writes.
unsigned WriteIndex = 0;
for (const WriteDescriptor &WD : D.Writes) {
  RegID = WD.isImplicitWrite() ? WD.RegisterID
                               : MCI.getOperand(WD.OpIndex).getReg();
  // Check if this is a optional definition that references NoReg.
  if ((WD.IsOptionalDef && !RegID) || IPP->skipRegDef(WD, MCI)) {
    ++WriteIndex;
    continue;
  }

  assert(RegID && "Expected a valid register ID!");
  NewIS->getDefs().emplace_back(WD, RegID,
                                /* ClearsSuperRegs */ WriteMask[WriteIndex],
                                /* WritesZero */ IsZeroIdiom);
  ++WriteIndex;
}

The more I think about it, the more I like the last option best. What do you think?

Assuming that it always works, Option 2. seems like the right way to go. A Write to an invalid registers should always be ignored.
That being said, let's wait to see if the issue with the write to M0 can be fixed in a different way.
If not, then it is OK to add a post-processing step for it in mca. But that would go into a separate patch anyway.

-Andrea

foad added a comment.Jun 14 2021, 7:11 AM

Another issue that I had to face was that mca (properly) enforces in-order writes on all instructions (even ones that don’t write anything such as s_waitcnt). This means that if there is a 300 cycle memory instruction currently executing, and then an s_waitcnt comes along (that isn’t meant for the 300 cycle instruction), we’ll still end up stalling for those ~300 cycles (because otherwise, the s_waitcnt instruction would finish before the memory instruction). However, mca uses a RetireOOO flag that can be set on scheduling classes within tablegen. For this reason, I modified the AMDGPU tablegen slightly to set that flag for the s_waitcnt variants. The flag is only used within mca, so (in theory) this shouldn’t have any side-effects. As mentioned earlier, if the AMDGPU folks do not want these changes, I’m happy to remove this example from the patch.

The rules are quite complicated for which types of instruction are guaranteed to complete in order with respect to which other types of instruction. The whole point of s_waitcnt is to enforce that ordering in cases where it would otherwise be unspecified. I'm not sure that a single RetireOOO flag is enough to model that complexity. It is certainly true that in most cases a simple ALU instruction can complete while there are still outstanding memory instructions.

I don't like the idea of using your new framework to workaround problems/bugs that are not in llvm-mca.

If you think that there is a genuine bug in their instruction definitions, then please raise a bug for it.

We should never "fix" non-mca issues within mca. This is true in general, and it doesn't only apply to mca. If there is a bug in someone else's code, we create a bug report for it, and optionally we send a patch for it. We don't pretend that the issue doesn't exist, and (worse) hack some workarounds for those bugs in our code.

Yeah this makes sense. I suppose I'm just trying to find different use cases where this patch could be useful in the future. That way, I can make sure that the framework is already setup in a way that facilitates that use case rather than myself or someone else needing to tweak it to do what they want. I think there are legitimate cases where we'd want to use this pattern (removing a Def from an instruction) without wanting to touch the tablegen. But this probably isn't one of those cases so I should just be using it to make sure the design is reasonable rather than trying to submit the example upstream.

In the event that I do completely remove the AMDGPU example from the patch (and then submit bug reports for all the issues I found), should I still keep an empty AMDGPUCustomBehaviour class within the patch so there is at least one example implementation? That way, if anyone wants to implement their own CB for their target, they have an example to look at for setting it up (mainly with respect to the directory structure and cmake files).

andreadb added a comment.EditedJun 14 2021, 12:14 PM

I don't like the idea of using your new framework to workaround problems/bugs that are not in llvm-mca.

If you think that there is a genuine bug in their instruction definitions, then please raise a bug for it.

We should never "fix" non-mca issues within mca. This is true in general, and it doesn't only apply to mca. If there is a bug in someone else's code, we create a bug report for it, and optionally we send a patch for it. We don't pretend that the issue doesn't exist, and (worse) hack some workarounds for those bugs in our code.

Yeah this makes sense. I suppose I'm just trying to find different use cases where this patch could be useful in the future. That way, I can make sure that the framework is already setup in a way that facilitates that use case rather than myself or someone else needing to tweak it to do what they want. I think there are legitimate cases where we'd want to use this pattern (removing a Def from an instruction) without wanting to touch the tablegen. But this probably isn't one of those cases so I should just be using it to make sure the design is reasonable rather than trying to submit the example upstream.

In the event that I do completely remove the AMDGPU example from the patch (and then submit bug reports for all the issues I found), should I still keep an empty AMDGPUCustomBehaviour class within the patch so there is at least one example implementation? That way, if anyone wants to implement their own CB for their target, they have an example to look at for setting it up (mainly with respect to the directory structure and cmake files).

As far as I understand, your CustomBehaviour plus the RetireOOO change does improve the AMDGPU simulation.
@foad, what do you think about it?

Basically, if the AMDGPU changes are a good improvement, then I still think that it makes sense to keep them as part of the patch.

About future developments:
I'd be happy to review a patch that checks for the presence of invalid writes. I have been thinking about it, and I believe that there is no harm in accepting a patch like that one. I also agree that there may be legitimate use cases for removing register definitions. So, I am fine with it. The only thing is: please, send it as a follow-up patch :)

The more I think about it, the more I believe that your framework is powerful enough to unblock a number of interesting use cases. For example, we could use it to annotate instructions based on external data generated - for example - by a profile. That way, we could refine latencies for loads; branch probabilities (if we decide to improve the simulator and enable different traces of executions); etc. These are just ideas...

Back to the patch: let see what @foad says about the new tests. If he is happy with the new numbers, then I have no further objections, and the patch can go in with the existing tests / changes to AMDGPU. If worse comes to worse, I am willing to accept your patch anyway (even if with no tests), because I think this framework is very useful in general.

-Andrea

I just spoke with Quentin, and we both agree with you. If it's alright with @foad, we'd like to push the current patch with the AMDGPU example (after I finish making the modifications Andrea suggested in his first comment). Afterwards, I will submit a bug report regarding the relevant flags not being transferred from pseudo -> real instructions. This will allow me to clean up the cnt detection logic and make it a lot more similar to the logic found in the si-insert-waitcnts pass (which will make the s_waitcnt modelling more accurate).

I'm surprised by this. I don't see anything in the tablegen files to say that DS instructions write to M0.

This is interesting. The MCInst for DS_READ_U8_gfx10 definitely has a Def for M0.

		Opcode Name= DS_READ_U8_gfx10
		SchedClassID=4
		Resource Mask=0x00000000000004, Reserved=0, #Units=1, cy=1
		Resource Mask=0x00000000000008, Reserved=0, #Units=1, cy=1
		Buffer Mask=0x00000000000004
		Buffer Mask=0x00000000000008
		 Used Units=0x0000000000000c
		Used Groups=0x00000000000000
		[Def]    OpIdx=0, Latency=20, WriteResourceID=0
		[Def][V] OpIdx=4, Latency=20, WriteResourceID=0
		[Use]    OpIdx=1, UseIndex=0
		[Use][V] OpIdx=4, UseIndex=3
		MaxLatency=20

The M0 def corresponds to [Def][V] OpIdx=4, Latency=20, WriteResourceID=0 where the [V] signifies that it's a 'variadic' operand (I'm not entirely sure what this means).

I can submit a separate bug report for this if you'd like. It's not really related to the CustomBehaviour class or the s_waitcnt instruction.

I just spoke with Quentin, and we both agree with you. If it's alright with @foad, we'd like to push the current patch with the AMDGPU example (after I finish making the modifications Andrea suggested in his first comment). Afterwards, I will submit a bug report regarding the relevant flags not being transferred from pseudo -> real instructions. This will allow me to clean up the cnt detection logic and make it a lot more similar to the logic found in the si-insert-waitcnts pass (which will make the s_waitcnt modelling more accurate).

I'm surprised by this. I don't see anything in the tablegen files to say that DS instructions write to M0.

This is interesting. The MCInst for DS_READ_U8_gfx10 definitely has a Def for M0.

		Opcode Name= DS_READ_U8_gfx10
		SchedClassID=4
		Resource Mask=0x00000000000004, Reserved=0, #Units=1, cy=1
		Resource Mask=0x00000000000008, Reserved=0, #Units=1, cy=1
		Buffer Mask=0x00000000000004
		Buffer Mask=0x00000000000008
		 Used Units=0x0000000000000c
		Used Groups=0x00000000000000
		[Def]    OpIdx=0, Latency=20, WriteResourceID=0
		[Def][V] OpIdx=4, Latency=20, WriteResourceID=0
		[Use]    OpIdx=1, UseIndex=0
		[Use][V] OpIdx=4, UseIndex=3
		MaxLatency=20

The M0 def corresponds to [Def][V] OpIdx=4, Latency=20, WriteResourceID=0 where the [V] signifies that it's a 'variadic' operand (I'm not entirely sure what this means).

I can submit a separate bug report for this if you'd like. It's not really related to the CustomBehaviour class or the s_waitcnt instruction.

I think this is probably a bug in mca.

According to AMDGPUGenInstrInfo.inc, DS_READ_U8_gfx10 has four operands (of which, the first one is a definition).

This is the output from llvm-mc:

ds_read_u8 v1, v0                       ; <MCInst #6688 DS_READ_U8_gfx10
                                ;  <MCOperand Reg:468>
                                ;  <MCOperand Reg:467>
                                ;  <MCOperand Imm:0>
                                ;  <MCOperand Imm:0>
                                ;  <MCOperand Reg:314>>

That extra register operand at the end (i.e. Reg:314) is a variadic operand.

For some reasons, mca::InstrBuilder conservatively decides to assume that Reg:314 is both defined and used.
That's a bug. It should simply check if method MCInstrDesc::variadicOpsAreDefs() returns true. Otherwise, it should assume that variadic operands are not definitions.
I'll see if I can fix it tomorrow.

holland11 updated this revision to Diff 352014.Jun 14 2021, 4:28 PM

Made some changes suggested by Andrea.

  • Removed the X86CustomBehaviour class
  • Modified the MCAOperands to only be stored for specific instructions (and only specific operands from those instructions)
  • Added the InstrPostProcess class for modifying the mca::Instruction objects after they are created, but before they are added to the LoweredSequence
  • Modified the CustomBehaviour section of the docs
  • Changed some assert(RegID) lines into if (!RegID) lines that either exit the current function or continue the current loop (this will make it so that InstrPostProcess can remove instruction Defs by setting that Def's RegisterID to 0)
  • And some other small changes.
holland11 updated this revision to Diff 352021.Jun 14 2021, 5:24 PM

clang-format fix.

The patch looks good to me.
I have requested to commit the changes to the RegisterFile as a follow-up. I'd like to keep those changes separate from this initial patch (see my comment below).

Also, please wait for the feedback from @foad about the changes/improvements to the AMDGPU model.

P.s.: The issue with the extra definition of register M0 has been fixed by commit beb5213a2ee5: [MCA][InstrBuilder] Check for the presence of flag VariadicOpsAreDefs..

Thanks
-Andrea

llvm/lib/MCA/HardwareUnits/RegisterFile.cpp
325–328 ↗(On Diff #352021)

In one of my previous messages, I've explicitly asked to send this particular change in a follow-up patch.

If possible, I'd like to keep this change separate from the rest of the work on the CustomBehaviour class. It is fine if there are no tests for it. Also, you don't need to send another patch for review just for this extra register checks. You can simply commit it straight away in main as soon as the rest of this patch is in main too.

foad added a comment.Jun 15 2021, 2:56 AM

As far as I understand, your CustomBehaviour plus the RetireOOO change does improve the AMDGPU simulation.
@foad, what do you think about it?

Basically, if the AMDGPU changes are a good improvement, then I still think that it makes sense to keep them as part of the patch.

The AMDGPU improvements are nice (thanks!) but there is a fair bit of code to review in AMDGPUCustomBehaviour.cpp, so I would prefer that all the AMDGPU parts were split out into a separate patch for ease of review. In particular I don't like working around the lack of flags on Real DS instructions; I would much rather fix that properly.

But if you really want to commit this patch as-is and then review and fix the AMDGPU parts later then I won't block it.

Made the AMDGPUCustomBehaviour class empty so that we can commit the implementation in a separate patch. Reverted the if (!RegID) changes back to their original asserts (so that they can be changed in a separate patch as well).

Forgot to revert the AMDGPU scheduling model changes in the previous diff. AMDGPU should be unchanged now and should have an empty AMDGPUCustomBehaviour implementation.

Rushed my previous updated diff and it had a mistake.

Okay, the AMDGPU code should be trivial now. The AMDGPUCB should be an empty implementation and the changes to the scheduling model should all have been reverted. I will submit a new patch later on once we get the flags sorted out.

I also reverted the RegisterFile changes (changed the ifs back to asserts). This should be ready to be committed if it still looks good to everyone.

I still don't have commit access, so could you do the honours again Andrea? I will request commit access after this patch and then hopefully be able to commit the small "assert -> if" changes within the RegisterFile as part of their own patch by myself.

Patrick Holland patrickeholland@gmail.com

Thanks a lot for all your help and feedback with this.

foad added a comment.Jun 15 2021, 11:13 AM

Made the AMDGPUCustomBehaviour class empty so that we can commit the implementation in a separate patch.

Thanks. Adding the empty AMDGPU class is fine, of course.

llvm/lib/MCA/CustomBehaviour.cpp
21

Could the first argument be ArrayRef<InstRef> IssuedInst?

Could the first argument be ArrayRef<InstRef> IssuedInst?

The place that this argument comes from (in the in-order pipeline) is

class InOrderIssueStage final : public Stage {
  ...
  /// Instructions that were issued, but not executed yet.
  SmallVector<InstRef, 4> IssuedInst;
  ...
  ...
};

I didn't setup that SmallVector and I don't know enough about how those data structures work to know if it would be smart to change it to an ArrayRef or have it get casted into an ArrayRef on its way into the checkCustomHazard function (if that's even a possible cast).

I would guess the answer to your question is no, but you probably have more familiarity with these data structures than I do so you probably know better.

andreadb accepted this revision.Jun 15 2021, 11:45 AM

I am happy to commit this patch for you once you address the minor comment from @foad about the SmallVector.

You can then send the other change to the RegisterFile for review (that would be straightforward), and the rest of the changes to AMDGPU.

Thanks for working on this!

P.s.: if you plan to do more work on llvm then it might be worthy to request commit access.

llvm/lib/MCA/CustomBehaviour.cpp
21

Yeah, it should be immutable.

This revision is now accepted and ready to land.Jun 15 2021, 11:45 AM

Yeah, it should be immutable.

Does this mean that I should make that change? So I can pass a SmallVector into a function that expects an ArrayRef? (And this just makes it so that the SmallVector is now immutable?)

Could the first argument be ArrayRef<InstRef> IssuedInst?

The place that this argument comes from (in the in-order pipeline) is

class InOrderIssueStage final : public Stage {
  ...
  /// Instructions that were issued, but not executed yet.
  SmallVector<InstRef, 4> IssuedInst;
  ...
  ...
};

I didn't setup that SmallVector and I don't know enough about how those data structures work to know if it would be smart to change it to an ArrayRef or have it get casted into an ArrayRef on its way into the checkCustomHazard function (if that's even a possible cast).

If method checkCustomHazard is not expected to add/remove and/or modify any elements in IssuedInst, then that SmallVector can simply be passed as an ArrayRef.
You don't need to worry about changing how method checkCustomHazard is called, because there is a very convenient constructor in ArrayRef which accepts a SmallVector (https://llvm.org/doxygen/ArrayRef_8h_source.html#l00090).
So, you simply change the signature of your method, and everything should work fine.

Note that by passing a SmallVector as an ArrayRef, you prevent functions from modifying its content. The internal array-like structure is effectively immutable (there are only getters).

There are variations to the concept of "array reference". For example, class MutableArrayRef lets you mutate elements (but not add/remove existing elements). So the array size would still be fixed.

For maximum flexibility, a SmallVector can always be passed around as a SmallVectorImpl. The main difference between SmallVector and SmallVectorImpl, is that SmallVectorImpl doesn't require that you specify the default (template) number of inline elements from the declaration.

I suggest to have a quick look at the docs when you have time. Some of those classes are used very often in llvm. ArrayRef in particular is a very very useful class.

I hope this makes sense.

That makes a lot of sense, thank you. Submitting new diff now.

Changed function signature for checkCustomHazard to take ArrayRef<InstRef> rather than const SmallVector<InstRef, 4> &.

This revision was landed with ongoing or failed builds.Jun 15 2021, 1:32 PM
This revision was automatically updated to reflect the committed changes.

I am seeing failures on buildbots that don't build the AMDGPU backend.
So I am going to revert it for now.

This is one of the errors I am seeing:

FAILED: bin/llvm-mca 
: && /usr/bin/g++ -fPIC -fno-semantic-interposition -fvisibility-inlines-hidden -Werror=date-time -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wno-missing-field-initializers -pedantic -Wno-long-long -Wimplicit-fallthrough -Wno-maybe-uninitialized -Wno-class-memaccess -Wno-redundant-move -Wno-pessimizing-move -Wno-noexcept-type -Wdelete-non-virtual-dtor -Wsuggest-override -Wno-comment -Wmisleading-indentation -fdiagnostics-color -ffunction-sections -fdata-sections -O3 -DNDEBUG -fuse-ld=gold    -Wl,-O3 -Wl,--gc-sections tools/llvm-mca/CMakeFiles/llvm-mca.dir/llvm-mca.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/CodeRegion.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/CodeRegionGenerator.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/PipelinePrinter.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/Views/BottleneckAnalysis.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/Views/DispatchStatistics.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/Views/InstructionInfoView.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/Views/InstructionView.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/Views/RegisterFileStatistics.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/Views/ResourcePressureView.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/Views/RetireControlUnitStatistics.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/Views/SchedulerStatistics.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/Views/SummaryView.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/Views/TimelineView.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/Views/View.cpp.o -o bin/llvm-mca  -Wl,-rpath,"\$ORIGIN/../lib"  lib/libLLVMX86AsmParser.a  lib/libLLVMX86Desc.a  lib/libLLVMX86Disassembler.a  lib/libLLVMX86Info.a  lib/libLLVMMCA.a  lib/libLLVMMC.a  lib/libLLVMMCParser.a  lib/libLLVMSupport.a  -lpthread  lib/libLLVMMCDisassembler.a  lib/libLLVMMC.a  lib/libLLVMBinaryFormat.a  lib/libLLVMDebugInfoCodeView.a  lib/libLLVMSupport.a  -lrt  -ldl  -lpthread  -lm  /usr/lib/x86_64-linux-gnu/libz.so  lib/libLLVMDemangle.a && :
tools/llvm-mca/CMakeFiles/llvm-mca.dir/llvm-mca.cpp.o:llvm-mca.cpp:function createCustomBehaviour(llvm::Triple const&, llvm::MCSubtargetInfo const&, llvm::mca::SourceMgr const&, llvm::MCInstrInfo const&) [clone .localalias]: error: undefined reference to 'llvm::mca::AMDGPUCustomBehaviour::AMDGPUCustomBehaviour(llvm::MCSubtargetInfo const&, llvm::mca::SourceMgr const&, llvm::MCInstrInfo const&)'
collect2: error: ld returned 1 exit status
FAILED: bin/llvm-mca 
: && /usr/local/bin/c++  -fPIC -fvisibility-inlines-hidden -Werror=date-time -Werror=unguarded-availability-new -Wall -Wextra -Wno-unused-parameter -Wwrite-strings -Wcast-qual -Wmissing-field-initializers -pedantic -Wno-long-long -Wc++98-compat-extra-semi -Wimplicit-fallthrough -Wcovered-switch-default -Wno-noexcept-type -Wnon-virtual-dtor -Wdelete-non-virtual-dtor -Wsuggest-override -Wstring-conversion -Wmisleading-indentation -fdiagnostics-color -ffunction-sections -fdata-sections -O3 -DNDEBUG  -Wl,-rpath-link,/home/tcwg-buildslave/worker/clang-cmake-armv7-quick/stage1/./lib  -Wl,-O3 -Wl,--gc-sections tools/llvm-mca/CMakeFiles/llvm-mca.dir/llvm-mca.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/CodeRegion.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/CodeRegionGenerator.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/PipelinePrinter.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/Views/BottleneckAnalysis.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/Views/DispatchStatistics.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/Views/InstructionInfoView.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/Views/InstructionView.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/Views/RegisterFileStatistics.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/Views/ResourcePressureView.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/Views/RetireControlUnitStatistics.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/Views/SchedulerStatistics.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/Views/SummaryView.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/Views/TimelineView.cpp.o tools/llvm-mca/CMakeFiles/llvm-mca.dir/Views/View.cpp.o  -o bin/llvm-mca  -Wl,-rpath,"\$ORIGIN/../lib"  lib/libLLVMARMAsmParser.a  lib/libLLVMAArch64AsmParser.a  lib/libLLVMARMDesc.a  lib/libLLVMAArch64Desc.a  lib/libLLVMARMDisassembler.a  lib/libLLVMAArch64Disassembler.a  lib/libLLVMARMInfo.a  lib/libLLVMAArch64Info.a  lib/libLLVMMCA.a  lib/libLLVMMC.a  lib/libLLVMMCParser.a  lib/libLLVMSupport.a  -lpthread  lib/libLLVMARMDesc.a  lib/libLLVMARMInfo.a  lib/libLLVMARMUtils.a  lib/libLLVMAArch64Desc.a  lib/libLLVMAArch64Info.a  lib/libLLVMAArch64Utils.a  lib/libLLVMMCDisassembler.a  lib/libLLVMMC.a  lib/libLLVMBinaryFormat.a  lib/libLLVMDebugInfoCodeView.a  lib/libLLVMSupport.a  -lrt  -ldl  -lpthread  -lm  /usr/lib/arm-linux-gnueabihf/libz.so  /usr/lib/arm-linux-gnueabihf/libtinfo.so  lib/libLLVMDemangle.a && :
tools/llvm-mca/CMakeFiles/llvm-mca.dir/llvm-mca.cpp.o: In function `main':
llvm-mca.cpp:(.text.main+0x2a40): undefined reference to `llvm::mca::AMDGPUCustomBehaviour::AMDGPUCustomBehaviour(llvm::MCSubtargetInfo const&, llvm::mca::SourceMgr const&, llvm::MCInstrInfo const&)'
clang-12: error: linker command failed with exit code 1 (use -v to see invocation)

The problem is that llvm-mca.cpp includes "lib/AMDGPU/AMDGPUCustomBehaviour.h".
However, according to the new CMakeLists.txt, that directory is not build if this check fails:

if (LLVM_TARGETS_TO_BUILD MATCHES "AMDGPU")```.

So that explains why we are only seeing failures on buildbots that don't build the AMDGPU target.

I am afraid that you need a different mechanism for getting target specific info...
This comment was removed by holland11.
holland11 added a comment.EditedJun 15 2021, 1:56 PM

Good call. I'll work something out and then update the diff once I've tested it properly.

holland11 updated this revision to Diff 352274.EditedJun 15 2021, 3:46 PM

Quentin came up with a solution to the build problem and I implemented it. We define macros within /llvm-mca/lib/CMakeLists.txt based on which targets are being built. Then we use those macros within llvm-mca.cpp to guard the includes and references to target specific CBs.

Tested by building for all targets and by building for only x86.

Should be good for another commit attempt if you're alright with that change.

holland11 updated this revision to Diff 352374.Jun 16 2021, 1:55 AM

Restructured the way the target macros were being set in the cmake files so that they are properly scoped.

Thanks!
It must have been pretty annoying to have to address that issue. I am glad that you were able to find a workaround for it.

Going back to the AMDGPU changes (sorry if I reopen the discussion).
I noticed that @foad has committed a sequence of patches aiming at improving llvm-mca simulation (thanks @foad !).
Do you know if in future we will still need a CustomBehaviour class for AMDGPU?

In general, it think that it is valuable to have all your initial patch in tree.
People that want to add CustomBehaviour classes for their own targets would encounter the same problems that you had to solve with those macro definitions. Having an example that shows how to address that issue is very useful in general.

foad added a comment.Jun 16 2021, 7:25 AM

Going back to the AMDGPU changes (sorry if I reopen the discussion).
I noticed that @foad has committed a sequence of patches aiming at improving llvm-mca simulation (thanks @foad !).
Do you know if in future we will still need a CustomBehaviour class for AMDGPU?

I think we will still need the class. My patches only remove the need for AMDGPUCustomBehaviour::manuallySetWaitCntInfo with its long list of DS instruction opcodes.

Going back to the AMDGPU changes (sorry if I reopen the discussion).
I noticed that @foad has committed a sequence of patches aiming at improving llvm-mca simulation (thanks @foad !).
Do you know if in future we will still need a CustomBehaviour class for AMDGPU?

I think we will still need the class. My patches only remove the need for AMDGPUCustomBehaviour::manuallySetWaitCntInfo with its long list of DS instruction opcodes.

Thanks for clarifying it.

In which case, the patch still LGTM.
Hopefully this time I won't need to revert it :-p

Cheers
-Andrea

This revision was landed with ongoing or failed builds.Jun 16 2021, 8:57 AM