Diff Detail
Event Timeline
Hi Matt,
thank you for looking into this!
Please take a look at a few comments below, the ones starting with "A nitpick:" are just notes, we don't have to discuss or fix them. The others are more interesting.
Thanks,
Roman
lib/CodeGen/GlobalISel/LegalizerHelper.cpp | ||
---|---|---|
767 | A nitpick: at the callee side MoreTy name is used, which seems fitting. Maybe sync the implementation with it? | |
792 | A nitpick: .getReg(0) here as well may be? | |
793 | I'm looking at test_and_v3s8 below, and thinking, maybe if we do this via an unmerge / merge pair, the results would be better, as these ops appear to be more likely to disappear as artifacts than such an insert, that splices a vector into another vector. The intermediate "scalar" size doesn't have to be scalar, but could be a vector with a number of lanes being GCD of the original and target number of lines, like v2s16 = G_IMPLICIT_DEF If it is a scalar, the "merge" component will be a G_BUILD_VECTOR, of course. What do you think? | |
lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp | ||
157 | The largest scalar size of a vector this rule could possibly fire on is 10 bits, or if we stick to the powers of 2, 8 bits (1 vectors aren't supported, 2 vector is even => 3 is the smallest number of lanes. To be small, has to be 31 bits wide or less => 10 bit is the largest scalar size). So it would turn V3S8 to V4S8 for instance. Which is not legal, not scalar, not wider than 32 bits, not odd -> it will be scalarized, and then the scalars widened to S32. So, the resulting ops if the type we start with is V3S8 are 4 32-bit ops (4 x S32), one of which is on undef, unmerge / merge pair, 4 32-bit exts, and 4 32-bit truncs. It's either worse or the same as it would be if we just remove this moreElementsIf rule completely, I suspect. Maybe you have intended making V4S8 legal? After all, these are bitwise ops with lanes being independent. | |
test/CodeGen/AMDGPU/GlobalISel/legalize-and.mir | ||
220–221 | This test wasn't legalized at all. I suspect this is because fewerElementsVectorBasic, being called to narrow 5 vector to 3 vector here, does this: const unsigned NarrowSize = NarrowTy.getSizeInBits(); // 96 const unsigned Size = DstTy.getSizeInBits(); // 160 const int NumParts = Size / NarrowSize; // 1 const LLT EltTy = DstTy.getElementType(); // s32 const unsigned EltSize = EltTy.getSizeInBits(); // 32 const unsigned BitsForNumParts = NarrowSize * NumParts; // 96 // Check if we have any leftovers. If we do, then only handle the case where // the leftover is one element. if (BitsForNumParts != Size && BitsForNumParts + EltSize != Size) // 96 != 160 && 96 + 32 != 160 return UnableToLegalize; The above works for v3s32, but not for any more elements of an odd number. | |
346 | Same, this is not legalized at all. |
Address comments
lib/CodeGen/GlobalISel/LegalizerHelper.cpp | ||
---|---|---|
793 | I think this requires more experimentation. It creates a bunch more intermediate operations and vregs. The common case should be 3->4 elements, with matching offsets which should fold away easily. For now most of the legalizations for merge/unmerge are missing so I think this might not work around at the moment. | |
lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp | ||
157 | At this point I'm not expecting any of the legalization actions for sub-32 bit types to be perfectly accurate. Most of the legalizations are missing, and I'm not sure what the optimal end state looks like yet (such as whether it's better to promote elements first or split the vector). For now I've just been putting placeholders that will make use of what's implemented. My intention was to legalize from 3->4 elements for this. When there are actual optimizations, the undef 4th component should be eliminated if this does end up promoted to 32-bits. We have legal s16 and v2s16 ops on some subtargets, so it makes sense to treat those as just rounded to integer sized vectors. I'm less sure what should happen for 8-bit elements at this point, but for now I'm trying to keep them packed and follow along with the 16-bit vectors. |
lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp | ||
---|---|---|
157 | We also have an encoding on some subtargets that allows free access to any byte offset with an implicit sext/zext and I haven't considered if we should make more s8 operations legal as a result. |
Thanks for addressing the comments.
Hopefully, we will make fewerElementsVectorBasic more solid than it is now (being able to handle its inputs when originalNumberOfElements % requestedNumberOfElements is either 0 or 1 is hard to call a consistent and expected behavior), but for now working around is probably good enough.
A nitpick: at the callee side MoreTy name is used, which seems fitting. Maybe sync the implementation with it?