This is an archive of the discontinued LLVM Phabricator instance.

GlobalISel: Implement moreElementsVector for bit ops
ClosedPublic

Authored by arsenm on Feb 12 2019, 7:51 AM.

Diff Detail

Event Timeline

arsenm created this revision.Feb 12 2019, 7:51 AM

Hi Matt,

thank you for looking into this!

Please take a look at a few comments below, the ones starting with "A nitpick:" are just notes, we don't have to discuss or fix them. The others are more interesting.

Thanks,
Roman

lib/CodeGen/GlobalISel/LegalizerHelper.cpp
767

A nitpick: at the callee side MoreTy name is used, which seems fitting. Maybe sync the implementation with it?

792

A nitpick: .getReg(0) here as well may be?

793

I'm looking at test_and_v3s8 below, and thinking, maybe if we do this via an unmerge / merge pair, the results would be better, as these ops appear to be more likely to disappear as artifacts than such an insert, that splices a vector into another vector.

The intermediate "scalar" size doesn't have to be scalar, but could be a vector with a number of lanes being GCD of the original and target number of lines, like

v2s16 = G_IMPLICIT_DEF
v2s16, v2s16, v2s16 = G_UNMERGE_VALUES v6s16
v8s16 = G_CONCAT_VECTORS v2s16, v2s16, v2s16, v2s16

If it is a scalar, the "merge" component will be a G_BUILD_VECTOR, of course.

What do you think?

lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
155

The largest scalar size of a vector this rule could possibly fire on is 10 bits, or if we stick to the powers of 2, 8 bits (1 vectors aren't supported, 2 vector is even => 3 is the smallest number of lanes. To be small, has to be 31 bits wide or less => 10 bit is the largest scalar size).

So it would turn V3S8 to V4S8 for instance. Which is not legal, not scalar, not wider than 32 bits, not odd -> it will be scalarized, and then the scalars widened to S32.

So, the resulting ops if the type we start with is V3S8 are 4 32-bit ops (4 x S32), one of which is on undef, unmerge / merge pair, 4 32-bit exts, and 4 32-bit truncs. It's either worse or the same as it would be if we just remove this moreElementsIf rule completely, I suspect.

Maybe you have intended making V4S8 legal? After all, these are bitwise ops with lanes being independent.

test/CodeGen/AMDGPU/GlobalISel/legalize-and.mir
223

This test wasn't legalized at all.

I suspect this is because fewerElementsVectorBasic, being called to narrow 5 vector to 3 vector here, does this:

const unsigned NarrowSize = NarrowTy.getSizeInBits();  // 96
const unsigned Size = DstTy.getSizeInBits();  // 160
const int NumParts = Size / NarrowSize;  // 1
const LLT EltTy = DstTy.getElementType();  // s32
const unsigned EltSize = EltTy.getSizeInBits(); // 32
const unsigned BitsForNumParts = NarrowSize * NumParts;  // 96
// Check if we have any leftovers. If we do, then only handle the case where
// the leftover is one element.
if (BitsForNumParts != Size && BitsForNumParts + EltSize != Size)  // 96 != 160 && 96 + 32 != 160
  return UnableToLegalize;

The above works for v3s32, but not for any more elements of an odd number.

341

Same, this is not legalized at all.

arsenm updated this revision to Diff 187171.Feb 17 2019, 11:38 AM
arsenm marked 4 inline comments as done.

Address comments

lib/CodeGen/GlobalISel/LegalizerHelper.cpp
793

I think this requires more experimentation. It creates a bunch more intermediate operations and vregs. The common case should be 3->4 elements, with matching offsets which should fold away easily.

For now most of the legalizations for merge/unmerge are missing so I think this might not work around at the moment.

lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
155

At this point I'm not expecting any of the legalization actions for sub-32 bit types to be perfectly accurate. Most of the legalizations are missing, and I'm not sure what the optimal end state looks like yet (such as whether it's better to promote elements first or split the vector). For now I've just been putting placeholders that will make use of what's implemented.

My intention was to legalize from 3->4 elements for this. When there are actual optimizations, the undef 4th component should be eliminated if this does end up promoted to 32-bits.

We have legal s16 and v2s16 ops on some subtargets, so it makes sense to treat those as just rounded to integer sized vectors. I'm less sure what should happen for 8-bit elements at this point, but for now I'm trying to keep them packed and follow along with the 16-bit vectors.

arsenm marked an inline comment as done.Feb 18 2019, 7:45 AM
arsenm added inline comments.
lib/Target/AMDGPU/AMDGPULegalizerInfo.cpp
155

We also have an encoding on some subtargets that allows free access to any byte offset with an implicit sext/zext and I haven't considered if we should make more s8 operations legal as a result.

rtereshin accepted this revision.Feb 18 2019, 1:51 PM

Thanks for addressing the comments.

Hopefully, we will make fewerElementsVectorBasic more solid than it is now (being able to handle its inputs when originalNumberOfElements % requestedNumberOfElements is either 0 or 1 is hard to call a consistent and expected behavior), but for now working around is probably good enough.

This revision is now accepted and ready to land.Feb 18 2019, 1:51 PM
arsenm closed this revision.Feb 19 2019, 8:29 AM

r354345