TLDR:
Some instructions are not defined well enough within the target’s scheduling model for llvm-mca to be able to properly simulate its behaviour. The ideal solution to this situation is to modify the scheduling model, but that’s not always a viable strategy. Maybe other parts of the backend depend on that instruction being modelled the way that it is. Or maybe the instruction is quite complex and it’s difficult to fully capture its behaviour with tablegen. The CustomBehaviour class (which I will refer to as CB frequently) is designed to provide intuitive scaffolding for developers to implement the correct modelling for these instructions.
Some quick notes before I get into some of the implementation details:
I built this class specifically for a downstream architecture, but I tried to design it in a way that would be useful for all targets. I wanted to be able to submit these changes upstream so I found an instruction from the AMDGPU instruction set that is used quite frequently, but isn’t currently modelled properly (resulting in significantly inaccurate analysis generated by llvm-mca). My knowledge of AMDGPU is limited and there is also an issue with flags not being set correctly on many of the relevant AMDGPU instructions so my implementation isn’t perfect. However, it should still provide an example use case for CB and also an example of how someone could use the class. If any AMDGPU developers would like to improve my implementation, I encourage you to do so. And on that note, if the AMDGPU folks don’t want this at all, we can definitely remove it from the patch and stick to pushing CB without any upstream examples.
This patch only has CB integrated into the in-order pipeline, however it was designed to fit into both pipelines. I’m much more familiar with the in-order pipeline (and CB fits into it a bit more intuitively), but I plan on adding it to the out-of-order pipeline in the future. I would like some feedback on what would be the best way to do that, but for now I want to keep the scope of this patch as limited as possible so I will make another post soon asking for feedback and to open up a discussion towards integrating CB within the out-of-order pipeline.
Also, I'm Canadian so I hope the 'u' in behaviour doesn't bother anyone :).
Implementation details:
llvm-mca does its best to extract relevant register, resource, and memory information from every MCInst when lowering them to an mca::Instruction. It then uses this information to detect dependencies and simulate stalls within the pipeline. For some instructions, the information that gets captured within the mca::Instruction is not enough for mca to simulate them properly. In these cases, there are two main possibilities:
- The instruction has a dependency that isn’t detected by mca.
- mca is incorrectly enforcing a dependency that shouldn’t exist.
For the rest of this discussion, I will be focusing on (1), but I have put some thought into (2) and I may revisit it in the future.
So we have an instruction that has dependencies that aren’t picked up by mca. The basic idea for both pipelines in mca is that when an instruction wants to be dispatched, we first check for register hazards and then we check for resource hazards. This is where CB is injected. If no register or resource hazards have been detected, we make a call to CustomBehaviour::checkCustomHazard() to give the target specific CB the chance to detect and enforce any custom dependencies.
The return value for checkCustomHazaard() is an unsigned int representing the (minimum) number of cycles that the instruction needs to stall for. It’s fine to underestimate this value because when StallCycles gets down to 0, we’ll end up checking for all the hazards again before the instruction is actually dispatched. However, it’s important not to overestimate the value and the more accurate your estimate is, the more efficient mca’s execution can be.
In general, for checkCustomHazard() to be able to detect these custom dependencies, it needs information about the current instruction and also all of the instructions that are still executing within the pipeline. The mca pipeline uses mca::Instruction rather than MCInst and the current information encoded within each mca::Instruction isn’t sufficient for my use cases. I had to add a few extra attributes to the mca::Instruction class and have them get set by the MCInst during instruction building. For example, the current mca::Instruction doesn’t know its opcode, and it also doesn’t know anything about its immediate operands (both of which I had to add to the class).
With information about the current instruction, a list of all currently executing instructions, and some target specific objects (MCSubtargetInfo and MCInstrInfo which the base CB class has references to), developers should be able to detect and enforce most custom dependencies within checkCustomHazard. If you need more information than is present in the mca::Instruction, feel free to add attributes to that class and have them set during the lowering sequence from MCInst.
Fortunately, in the in-order pipeline, it’s very convenient for us to pass these arguments to checkCustomHazard. The hazard checking is taken care of within InOrderIssueStage::canExecute(). This function takes a const InstRef as a parameter (representing the instruction that currently wants to be dispatched) and the InOrderIssueStage class maintains a SmallVector<InstRef, 4> which holds all of the currently executing instructions. For the out-of-order pipeline, it’s a bit trickier to get the list of executing instructions and this is why I have held off on implementing it myself. This is the main topic I will bring up when I eventually make a post to discuss and ask for feedback.
CB is a base class where targets implement their own derived classes. If a target specific CB does not exist (or we pass in the -disable-cb flag), the base class is used. This base class trivially returns 0 from its checkCustomHazard() implementation (meaning that the current instruction needs to stall for 0 cycles aka no hazard is detected). For this reason, targets or users who choose not to use CB shouldn’t see any negative impacts to accuracy or performance (in comparison to pre-patch llvm-mca).
AMDGPU Example:
In AMDGPU, there is a family of s_waitcnt instructions. The purpose of these instructions is to force a stall until certain memory operations have been completed. For example, if we are loading a value to gpr5 and then using gpr5, one of the backend passes will detect this and place an s_waitcnt instruction before the use to ensure that gpr5 has been written to. These instructions interact with 4 different ‘counters’ (vscnt, expcnt, lgkmcnt, and vscnt) and s_waitcnt can force a wait on any number of them. Each memory instruction will also interact with 1 or more of these counters and (this is an oversimplification) the instruction will increment those counters when it begins executing, and then decrement the counters once it has written its result.
From a high level, when CB sees an s_waitcnt instruction, it will first decode its operand to determine what it’s waiting for. CB then needs to look at all of the currently executing instructions to determine if we need to stall or not. This is where things get a little tricky with respect to AMDGPU.
For each instruction within the pipeline, we need to determine if that instruction interacts with any of the counters (and which ones specifically). The backend pass that I mentioned earlier (that adds s_waitcnt instructions during codegen) has a function that does just that. However, this pass is dealing with ‘psuedo’ instructions, whereas in mca, we are dealing with ‘real’ instructions. This is relevant because many of the flags that the logic depends on (such as mayLoad, mayStore, and even some target specific flags) are not always shared between the ‘pseudo’ and ‘real’ instructions. This makes it really difficult to determine with certainty which instructions interact with which counters.
To combat this, I made some conservative assumptions (such as assuming mayLoad and mayStore will always be true). These assumption result in some instructions being flagged as interacting with more CNTs than they actually do. I do not make any assumptions about the target specific flags though and this results in certain instructions (such as ds_read and ds_write variants) not being detected at all. I highly suggest that if the AMDGPU folks would like to use llvm-mca, they look into having the flags properly applied to the ‘real’ instructions and then modifying the logic from this stage to be copied from the SIInsertWaitcnts::updateEventWaitcntAfter() function.
Another issue that I had to face was that mca (properly) enforces in-order writes on all instructions (even ones that don’t write anything such as s_waitcnt). This means that if there is a 300 cycle memory instruction currently executing, and then an s_waitcnt comes along (that isn’t meant for the 300 cycle instruction), we’ll still end up stalling for those ~300 cycles (because otherwise, the s_waitcnt instruction would finish before the memory instruction). However, mca uses a RetireOOO flag that can be set on scheduling classes within tablegen. For this reason, I modified the AMDGPU tablegen slightly to set that flag for the s_waitcnt variants. The flag is only used within mca, so (in theory) this shouldn’t have any side-effects. As mentioned earlier, if the AMDGPU folks do not want these changes, I’m happy to remove this example from the patch.
With all that background information out of the way, let’s take a look at a few examples.
ds_read_u8 v1, v0 s_waitcnt expcnt(0) v_mul_lo_u32 v5, v3, v4 s_waitcnt lgkmcnt(0)
ds_read_u8 doesn’t interact with expcnt at all so the first s_waitcnt shouldn’t force a stall. The behaviour of the v_mul instruction is one that I’m not positive on though as far as in-order writes are concerned (in AMDGPU, should that first v_mul be allowed to complete before the ds_read?). Either way, the first wait should be dispatched and executed as soon as it’s seen.
Here’s the mca timeline BEFORE this patch:
Timeline view: 0123456789 Index 0123456789 0123456789 [0,0] DeeeeeeeeeeeeeeeeeeeE . ds_read_u8 v1, v0 [0,1] . . . . DeE . s_waitcnt expcnt(0) [0,2] . . . . DeeeeeeeE v_mul_lo_u32 v5, v3, v4 [0,3] . . . . . DeE s_waitcnt lgkmcnt(0)
Notice how mca’s enforcement of the in-order writes causes the first s_waitcnt to stall.
Now here’s the mca timeline without CB, but with the RetireOOO flag set for all s_waitcnt instructions:
Timeline view: 0123456789 Index 0123456789 0 [0,0] DeeeeeeeeeeeeeeeeeeeE ds_read_u8 v1, v0 [0,1] .DeE . . . . s_waitcnt expcnt(0) [0,2] . . . DeeeeeeeE v_mul_lo_u32 v5, v3, v4 [0,3] . . . DeE . s_waitcnt lgkmcnt(0)
And here’s with CB:
Timeline view: 0123456789 Index 0123456789 012 [0,0] DeeeeeeeeeeeeeeeeeeeE . ds_read_u8 v1, v0 [0,1] .DeE . . . . . s_waitcnt expcnt(0) [0,2] . . . DeeeeeeeE . v_mul_lo_u32 v5, v3, v4 [0,3] . . . . DeE s_waitcnt lgkmcnt(0)
The first s_waitcnt correctly doesn’t get stalled. The second s_waitcnt correctly does get stalled.
Here’s a different example (the formatting breaks in this post due to the long sequence of characters so I use screenshots for the timelines):
s_load_dwordx4 s[0:3], s[4:5], 0x0 s_waitcnt vmcnt(0) lgkmcnt(0) s_waitcnt_vscnt null, 0x0 global_load_dword v1, v0, s[0:1] glc dlc s_waitcnt vmcnt(0)
Here is the timeline BEFORE this patch has been applied:
https://i.imgur.com/OZQtrE1.png
You’ll notice that the simulation is actually fairly accurate for this example, but it’s accurate for the wrong reasons. The s_waitcnt instructions aren’t waiting because they know they have to wait for the specific s_load_dword and global_load_dword instructions. They are waiting because mca is enforcing the in-order write-backs. While this may produce a reasonable output in this case, it’s not good in general.
Here is the timeline AFTER this patch has been applied, but with the -disable-cb command (so basically what it would look like to run the example through mca with s_waitcnt instructions having the RetireOOO flag):
https://i.imgur.com/s62cKYJ.png
You can see now that the s_waitcnt instructions are allowed to finish out-of-order, and without CB to enforce their behaviour, they don’t force a stall.
Now here’s the patch with CB enabled:
https://i.imgur.com/VMlG0DO.png
CB is able to detect that s_waitcnt vmcnt(0) lgkmcnt(0) needs to wait for s_load_dwordx4 to finish. And s_waitcnt vmcnt(0) needs to wait for global_load_dword to finish.
Overall, due to the flags not being reliable, if the AMDGPU developers are not interested in improving / correcting my implementation, it would probably be best to remove it from the patch. I really just wanted an upstream example to show how CB could be used, but I didn’t anticipate all of the complications that arose from this specific example. llvm-mca can be a really helpful tool for many different purposes, but especially for finding mistakes in the scheduling models. For example, I noticed that a sequence of ds_read instructions do not get simulated properly:
ds_read_u8 v1, v0 ds_read_u8 v2, v0 offset:1 ds_read_u8 v3, v0 offset:2 ds_read_u8 v4, v0 offset:3 ds_read_u8 v5, v0 offset:4
Timeline view: 0123456789 0123456789 0123456789 0123456789 0123456789 Index 0123456789 0123456789 0123456789 0123456789 0123456789 0 [0,0] DeeeeeeeeeeeeeeeeeeeE . . . . . . . . . . . . . . . . ds_read_u8 v1, v0 [0,1] . . . . DeeeeeeeeeeeeeeeeeeeE . . . . . . . . . . . . ds_read_u8 v2, v0 offset:1 [0,2] . . . . . . . . DeeeeeeeeeeeeeeeeeeeE . . . . . . . . ds_read_u8 v3, v0 offset:2 [0,3] . . . . . . . . . . . . DeeeeeeeeeeeeeeeeeeeE . . . . ds_read_u8 v4, v0 offset:3 [0,4] . . . . . . . . . . . . . . . . DeeeeeeeeeeeeeeeeeeeE ds_read_u8 v5, v0 offset:4
As far as I understand, these instructions shouldn’t have to wait for each other. However, ds_read_u8 is modeled as having both a read and a write to the M0 register which mca detects as a register hazard. I looked in the ISA and I’m fairly confident that this isn’t accurate. If you start using mca, you will likely be able to find a lot of other discrepancies such as this one. And if you do want to start using the tool, the CB class can offer you the scaffolding to implement any instruction behaviour that you can’t / don’t want to model in the tablegen itself.
I think this paragraph should be more explicit on what is the goal of the CustomBehaviour class.
The first two sentences should probably be merged together.
The focus on those two sentences is the "lack of expressiveness" (of scheduling models and machine instructions). That lack of expressivenes is what eventually leads to poor simulations.
Starting the entire paragraph with that first sentence is not great; in the absence of context, it sounds more like there are bugs in llvm-mca itself.
I also noticed that you don't explain what is the main goal of this new CustomBehaviour class. You have done a much better job at introducing it in the summary of this bug. At this point, you should at least drop a hint about what is the most common goal for it:
"it allows targets to customise data dependencies, and model delays which cannot be normally predicted just by simply evaluating register defs/uses". Something like that... Maybe you can be more accurate, but keep in mind that the description should not be too low level in this document.