This is a pretty huge patch, with no details in the commit log.
One hour between sending the patch out and landing it is not sufficient for anyone to meaningfully
review the patch and there are no mentions of the review done anywhere else.
While the code only changes AMDGPU back-end, it does not mean that the patch should be just rubber-stamped.
It's a year of work necessarily downstream. Every line there was reviewed and tested in the downstream. I understand no one can reasonably review something that big, although I cannot break it into small patches after a year of changes and fixes. Not that I have too much choice.
This is a problem because I've removed forAllLanes.
This is a hack, we should be using a different register class for cases that don't support a given subregister index not scanning for an example non-reserved register
To be fair to @rampitec , it was not his desire to push this up in 1 big patch. We needed this upstreamed and no time was given to him to break it up into reasonably sized pieces. If it appears to be his doing/his intent, well, it should not. There have been a couple comments; I believe most addressed; comments will continue to be addressed.
Who landed the change is not particularly important.
What does matter is to make sure that shortcutting the review does not become a regular occurrence.
Stuff happens, Sometimes the standard rules may need to be bent. However, it should not be a unilateral decision by the committing side.
A better way to handle that would be to send the patch for review few days early (you presumably did have most/all of these changes made by then), provide details describing the changes (single subject line falls a bit short), outline your situation explaining why the patch can't be split and reviewed pre-commit. If the changes are indeed well-reviewed downstream, that would probably not pose much of a challenge to get the patch approved. If the patch does need further cleanups, at least we would have a reasonable idea how invasive they would be and could make an informed call. "Commit now, ask for forgiveness later" is not among the LLVM contribution guidelines, not for large patches. At the very minimum it should've been publicly discussed before the fact.
We have BoolFOption to generate -fsomething and -fno-something
Nit: This looks odd. GFX90A does not need to be in the middle of the list. It makes it somewhat confusing to tell which ID is really the last. The _LAST enum says it's GFX90A, but it's not the last item of the list. There are already out-of-name-order GPUs at the end of the list, so putting 90A at the end would probably be a better choice. At least we'd still have the numeric values in order. Right now the list is ordered neither by the name nor by the value.
There's also a question of whether something needs to be done about the missing values 0x3c..0x3e. Presumably the _FIRST.._LAST enums specify the range we'll use to iterate over the GPU IDs. Do we handle the missing values correctly?
Looks like it's benign at the moment as we're only using it to return amdgcn triple in ELFObjectFile.h. I'd add the placeholder enums for the reserved/unused values within the range.
"we needed this upstream" is a business issue on AMD's side, not an issue for the llvm project. In general the expectation is that code is reviewed according to the guidelines and a single reviewer with one (small) patch that wasn't a revert doesn't feel like sufficient review for something of this size. For something this size I'd have expected Matt to at least be on the reviewer line and that also wasn't done. This feels like an abuse of the review system and probably should be reverted.
The reason we use m_amdgpu_Features_Group is it is getting transformed to target features (-mtgsplit to +tgsplit, mno-tgsplit to -tgsplit. For example, tgsplit target feature in AMDGPU backend:
Does BoolFOption get translated to target features as well?
but it will still need to be in m_amdgpu_Features_Group because https://github.com/llvm/llvm-project/blob/main/clang/lib/Driver/ToolChains/AMDGPU.cpp#L403 unless we want to switch away from that. so maybe make a group an optional, last template parameter to BoolMOption?
I'd appreciate it if you could find a solution that does not involve reverting and reapplying later, as this will triple the amount of churn we get downstream. (I realise LLVM policy is not to care about downstream but I thought I'd plead my case anyway!)
Ideally yes. We can still use register classes for this, with special care to make sure we never end up with the unaligned virtual registers in the wrong contexts.
The less that's tracked by the instruction definitions, the more special case code we have to right. I've been thinking of swapping out the entire MCInstrDesc table per-subtarget to make this easier, although that may be a painful change.
I do not see how it can be less code. You will need to duplicate all VALU pseudos, not just real instructions. Which means every time you write in the source something like AMDGPU::FLAT_LOAD_DWORDX2 you would have to write an if. For every VALU instruction.
It's less code because the code that's already there is supposed to rely on the static operand definitions. Every time we want to deviate from those, we end up writing manual code in the verifier and fixup things here and there that differ.
The point of swapping out the table would be to eliminate all the VALU pseudos. We would just have the same enum values referring to different physical instruction definitions
This makes sense, although as you said also quite painful and to me also sounds like a hack. There is still a lot of legalization needed even with this approach. Every time you hit an instruction not supported by a target you will need to do something about it. In a worst case expanding. Sounds like another year of work. Especially when you look at highly specialized ASICs which can do this but cannot do that, and you have a lot them.
JBTW it will not help anyway. Not for this problem. You may create an operand of a different RC or you may just reserve every other register like I did, the net result will be the same, you will end up using prohibited register. Imagine you are using an RC where only even tuples are added. And then you are using sub1_sub2 subreg of it. RA will happily allocate forbidden register just like it does now. To me this is RA bug in the first place to allocate a reserved register.
The only thing which could help is an another register info without odd wide subregs, but that you cannot achieve just by duplication of instruction definitions, for that you would need to duplicate register info as well. This is almost a new BE.
It's not a hack, this is how operand classes are intended to work. You wouldn't be producing these instructions on targets that don't support them (ideally we would also have a verifier for this, which is another area where subtarget handling is weak).
The point is to not reserve them. References to unaligned registers can exist, they just can't be used in the context of a real machine operand. D97316 switches to using dedicated classes for alignment (the further cleanup would be to have this come directly from the instruction definitions instead of fixing them up after isel).
I have replied in D97316, but I do not believe it will help as is. It will run into the exactly same issue as with reserved registers approach and their subregs.
I have checked the commit. This code is not even related to vgpr agignment. It was about SGPRs in fact:
If stack is present 4 low SGPRs are reserved. The pass was
The testcase is mem_clause_sreg256_used_stack from memory_clause.mir.