Pack the address registers in the legalizer. Avoid introducing a huge
family of new intermediate operations by filling dead operands with
noreg.
Details
Diff Detail
Event Timeline
Why is this conceptually a legalization rather than part of the instruction selection?
The register layout packing/unpacking code should be exposed to the legalizer and combines. Doing it later means it could possibly be in a waterfall loop, and won't benefit from the combines on the pack/unpack code.
More abstractly, selection should operate on the types expected for the final instruction. It's the legalizer's job to get registers that are the right type/size
The part that makes me uncomfortable here and I couldn't quite put a finger on initially is that it implies a change in the semantics of the intrinsic. What this and related changes are doing is implicitly changing the design such that the intrinsics mean something different before vs. after legalization. That is not what legalization is usually supposed to do.
I see your point about combiner passes -- so could we perhaps just select the image instructions early instead? It's not like we have very interesting things happening for image instructions in the ISel patterns anyway, and one of the benefits of the GlobalISel infrastructure is that it's supposed to be flexible enough for stuff like that...
Ideally we should have separate intermediate image opcodes. However, I have no interest in creating another giant set of image opcodes that will need more searchable tables. I don't think it's really a semantic change, and it should be possible to go backwards from the legalized intrinsic to what it was originally. What I am worried about is what happens if you run the legalizer twice, which should be OK but I'm ignoring this problem for now until I have the full selection working. Two options I've considered are re-using some of the extra bits in one of the immediate arguments to encode how it's been changed, or to use a variadic wrapper instruction which will just capture the intrinsic ID and how it was legalized.
I see your point about combiner passes -- so could we perhaps just select the image instructions early instead? It's not like we have very interesting things happening for image instructions in the ISel patterns anyway, and one of the benefits of the GlobalISel infrastructure is that it's supposed to be flexible enough for stuff like that...
I think the only reasonable place outside of selection to do this would be in RegBankSelect/applyMappingImpl, which I did consider before going with the legalizer. We still need to RegBankSelect the intrinsics, and possibly move them into a waterfall loop, and I think introducing real register class constraints earlier is generally undesirable. Eventually, we should run some combiner pass after which should take care of packing code. We could theoretically trick RegBankSelect into handling the selected instructions, but that would also be pretty disgusting. Another small issue with this is it sort of implies making the full set of legalization artifacts legal for <3 x f16> cases. I started working towards this in D72639, but I don't really like it and would prefer if these were all legalized to <4 x s16> during legalization. I think we could use the artifact combiner to eliminate the illegal register types in RegBankSelect, but I don't think that should be mandatory. We claim some of these these are legal today, but that's mostly a hack to deal with missing features in the legalizer.
Big "no" on another set of image opcodes from my side as well.
I see your point about combiner passes -- so could we perhaps just select the image instructions early instead? It's not like we have very interesting things happening for image instructions in the ISel patterns anyway, and one of the benefits of the GlobalISel infrastructure is that it's supposed to be flexible enough for stuff like that...
I think the only reasonable place outside of selection to do this would be in RegBankSelect/applyMappingImpl, which I did consider before going with the legalizer. We still need to RegBankSelect the intrinsics, and possibly move them into a waterfall loop, and I think introducing real register class constraints earlier is generally undesirable. Eventually, we should run some combiner pass after which should take care of packing code. We could theoretically trick RegBankSelect into handling the selected instructions, but that would also be pretty disgusting. Another small issue with this is it sort of implies making the full set of legalization artifacts legal for <3 x f16> cases. I started working towards this in D72639, but I don't really like it and would prefer if these were all legalized to <4 x s16> during legalization. I think we could use the artifact combiner to eliminate the illegal register types in RegBankSelect, but I don't think that should be mandatory. We claim some of these these are legal today, but that's mostly a hack to deal with missing features in the legalizer.
Part of the problem here is that the final machine instructions have register class constraints in the first place. I've been wondering for some time now what register classes even buy us in the end. It seems to me that they're largely useless, and almost everything we need from a conceptual point of view is contained in the register banks. The few complications around M0, SCC, VCC, could be dealt with explicitly since we largely shouldn't allocate them using a generic approach anyway.
Is there a way to just relax those constraints for a select subset of opcodes, like the image opcodes?
The register classes aren't really useless, and we do have a variety of more exotic operand constraints to deal with. We need them to represent cases like operands that don't support M0/exec/whatever, and cases like SReg_96 only supporting one 64-bit subregister. VCC is also a completely normal, allocatable register. The class constraints don't even necessarily matter when allocating, but when folding copies between classes.
Is there a way to just relax those constraints for a select subset of opcodes, like the image opcodes?
I don't know what this really would mean
I see your point about combiner passes -- so could we perhaps just select the image instructions early instead? It's not like we have very interesting things happening for image instructions in the ISel patterns anyway, and one of the benefits of the GlobalISel infrastructure is that it's supposed to be flexible enough for stuff like that...
I think the only reasonable place outside of selection to do this would be in RegBankSelect/applyMappingImpl, which I did consider before going with the legalizer. We still need to RegBankSelect the intrinsics, and possibly move them into a waterfall loop, and I think introducing real register class constraints earlier is generally undesirable. Eventually, we should run some combiner pass after which should take care of packing code. We could theoretically trick RegBankSelect into handling the selected instructions, but that would also be pretty disgusting. Another small issue with this is it sort of implies making the full set of legalization artifacts legal for <3 x f16> cases. I started working towards this in D72639, but I don't really like it and would prefer if these were all legalized to <4 x s16> during legalization. I think we could use the artifact combiner to eliminate the illegal register types in RegBankSelect, but I don't think that should be mandatory. We claim some of these these are legal today, but that's mostly a hack to deal with missing features in the legalizer.
Part of the problem here is that the final machine instructions have register class constraints in the first place. I've been wondering for some time now what register classes even buy us in the end. It seems to me that they're largely useless, and almost everything we need from a conceptual point of view is contained in the register banks. The few complications around M0, SCC, VCC, could be dealt with explicitly since we largely shouldn't allocate them using a generic approach anyway.
The register classes aren't really useless, and we do have a variety of more exotic operand constraints to deal with. We need them to represent cases like operands that don't support M0/exec/whatever, and cases like SReg_96 only supporting one 64-bit subregister. VCC is also a completely normal, allocatable register. The class constraints don't even necessarily matter when allocating, but when folding copies between classes.
Conversely, the straightjacket of register classes causes a lot of pain around image instructions and indirect register indexing. They also complicate every single instance of checking whether a register is SGPR or VGPR, which is something that we do quite a lot.
I don't think it's entirely honest to pretend that VCC is a completely normal, allocatable register. VCC use affects code size, which should be taken into account when allocating it. It's also special due to the interaction with VCCZ. Finally, IIRC the scoreboard in gfx10 treats VCC specially, which may also have implications (I haven't fully thought those through though).
The SReg_96 comment is interesting. Where do we end up with SReg_96 in the first place after legalization? I can only think of indirect indexing. Also, how is it different from SReg_128 not allowing you to take sub1_sub2 as a subregister?
Is there a way to just relax those constraints for a select subset of opcodes, like the image opcodes?
I don't know what this really would mean
The thing under discussion here from my perspective is that it's awkward to overload the semantics of image intrinsics in the way that this and related changes are doing, and the question was why we can't just directly go to the final image instructions. One aspect of this is that you'd have a non-generic machine instruction refering to register that don't have a register class, for a couple of passes at least. That doesn't seem too crazy to me.
These aren't really constraints, and are merely optimization hints. Trying to treat VCC as different (or only as a physical register) can only penalize code in all of these situations. VCCZ is effectively an alias, and we don't try to make use of it currently. None of these issues impact instruction selection.
The SReg_96 comment is interesting. Where do we end up with SReg_96 in the first place after legalization? I can only think of indirect indexing. Also, how is it different from SReg_128 not allowing you to take sub1_sub2 as a subregister?
Copies involving physical registers, and also inline asm. It's a question of composing subregisters. If you want a sub0_sub1 of an SReg_96, you may have to copy to get a properly aligned register pair. The way we do calling convention lowering today happens to avoid this in normal cases, but I don't want to rely on this kind of behavior. We'll currently get SReg_96 from 96-bit phis, but we could also start legalizing these into 32-bit pieces. The register class constraints aren't directly relevant to this specific problem, as the main reason I want to defer selection from here in the first place is we don't even have the register bank yet. I could start directly selecting in RegBankSelect, but I don't think that's optimal either.
Is there a way to just relax those constraints for a select subset of opcodes, like the image opcodes?
I don't know what this really would mean
The thing under discussion here from my perspective is that it's awkward to overload the semantics of image intrinsics in the way that this and related changes are doing, and the question was why we can't just directly go to the final image instructions. One aspect of this is that you'd have a non-generic machine instruction refering to register that don't have a register class, for a couple of passes at least. That doesn't seem too crazy to me.
I'm leaning towards inventing what is essentially a custom G_INTRINSIC type to track the legalization of the awkward cases. The important information will still be tracked by preserving the intrinsic ID operand, but the operands will be changed as here. I think this only requires a small number of wrapper operations (I think 1, but maybe 4 at most). The current intermediate DAG nodes seem to get away with just _d16 variants for dealing with the annoying unpacked register layout case.
Rebase and use observer. I think I can put off a wrapper op a bit longer until the selection patch
Fix producing G_CONCAT_VECTORS with single source, which should probably be illegal but ends up selecting just fine
The thing under discussion here from my perspective is that it's awkward to overload the semantics of image intrinsics in the way that this and related changes are doing, and the question was why we can't just directly go to the final image instructions. One aspect of this is that you'd have a non-generic machine instruction refering to register that don't have a register class, for a couple of passes at least. That doesn't seem too crazy to me.
I'm leaning towards inventing what is essentially a custom G_INTRINSIC type to track the legalization of the awkward cases. The important information will still be tracked by preserving the intrinsic ID operand, but the operands will be changed as here. I think this only requires a small number of wrapper operations (I think 1, but maybe 4 at most). The current intermediate DAG nodes seem to get away with just _d16 variants for dealing with the annoying unpacked register layout case.
I like this idea. I can see how this could be considered a change that is separate from this change, so this one LGTM.
llvm/lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp | ||
---|---|---|
3126 | Yes, store instructions don't support TFE to the best of my knowledge. Store instructions can still be used on images that are partially resident, but they simply become no-ops if the destination address isn't mapped. |
Yes, store instructions don't support TFE to the best of my knowledge.
Store instructions can still be used on images that are partially resident, but they simply become no-ops if the destination address isn't mapped.