This is an archive of the discontinued LLVM Phabricator instance.

[SVE] Lower fixed length vXi8/vXi16 SDIV to scalable
ClosedPublic

Authored by cameron.mcinally on Aug 17 2020, 3:01 PM.

Details

Summary

There are no nxv16i8/nxv8i16 SDIV instructions, so these fixed width operations must be promoted to nxv4i32. This required adding an OverrideNEON flag to LowerToScalableOp(...), so that we can make use of the existing scalable lowering for the smaller vectors (e.g. v8i8).

Also notice that a new test file was needed. The existing SDIV tests live in sve-fixed-length-int-arith.ll, which uses the legalization-style FileCheck macros. But the complicated lowering of these operations really needs the "end result"-style macros. If this patch is accepted, I'll move the vXi32/vXi64 tests to this new file as well.

Diff Detail

Event Timeline

cameron.mcinally requested review of this revision.Aug 17 2020, 3:01 PM

Also notice that a new test file was needed. The existing SDIV tests live in sve-fixed-length-int-arith.ll, which uses the legalization-style FileCheck macros. But the complicated lowering of these operations really needs the "end result"-style macros. If this patch is accepted, I'll move the vXi32/vXi64 tests to this new file as well.

I don't see any obvious difference in the RUN lines?

cameron.mcinally added a comment.EditedAug 17 2020, 5:06 PM

I don't see any obvious difference in the RUN lines?

It's the VBITS_* differences. The VBITS_LE_* prefixes are better for checking the legalizations (really splitting), when the current VL > the fixed VL. The VBITS_GE_* prefixes are better for ignoring the legalizations, i.e. the current VL <= the fixed VL.

For the i8/i16 SDIV cases, the lowerings are pretty long, so it's nicer to check the cases where legalization was boring.

paulwalker-arm added inline comments.Aug 18 2020, 8:57 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
8953

I just wanted to double check that you are aware this is going to result in i8/i16 fixed length vector divides being different to the i32/i64 ones. The latter being predicated with the former cases not (or rather using an "all true" predicate).

Given divides are rarely cheap I prefer the predicated route but I guess there's no reason to be consistent at this stage.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
8953

Oh, for the smaller vectors that are passed in regs? Yeah, I didn't catch that. Any clue why that's happening? Is it because there are no loads/stores to lower?

cameron.mcinally marked an inline comment as not done.Aug 18 2020, 9:22 AM
cameron.mcinally added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
8953

Oh, I see what you're saying now. The second predicate register is being generated all1s. That's surprising. Ultimately, both the i32/i64 and i8/i16 cases should be going through return LowerToPredicatedOp(Op, DAG, PredOpcode);. I'll see if I can find the difference.

cameron.mcinally marked an inline comment as not done.Aug 18 2020, 9:25 AM

Ok, I see it now. We have to explicitly call LowerToPredicatedOp(...) with the fixed types still intact, so that getPredicateForVector(...) will generate the correct predicate. Will update...

paulwalker-arm added inline comments.Aug 18 2020, 9:29 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
8953

I see you're finding the answers quicker than I can type them :)

For what it's worth I've nothing against the current patch since it is simple and works, I just wanted to ensure you were aware what was going on.

Updated DIV lowering to make use of correct fixed length predicates.

The lowering is somewhat convoluted, so I'd like to hear if anyone has a better solution. I've attempted to extend the operands and truncate the results as scalable, but leave the DIVs fixed until they reach vXi32. This allows the intermediate fixed DIVs to be further lowered.

There's probably some room for code reuse with other instructions that do not have i8/i16 vector support, but I don't have a good view of what those instructions look like right now (if they exist).

cameron.mcinally marked 2 inline comments as done.Aug 19 2020, 10:57 AM
efriedma accepted this revision.Aug 19 2020, 1:01 PM

LGTM.

The end result is a big uglier than I would have hoped for, but I can't think of any particularly better way given the constraints.

This revision is now accepted and ready to land.Aug 19 2020, 1:01 PM

The end result is a big uglier than I would have hoped for, but I can't think of any particularly better way given the constraints.

Agreed. I'll wait for Paul to have a look to see if he has any tricks.

We also need a peep for the smaller vectors, where the extend can fit in one register.

The end result is a big uglier than I would have hoped for, but I can't think of any particularly better way given the constraints.

Agreed. I'll wait for Paul to have a look to see if he has any tricks.

I don't think there's a very different approach because you really want custom type legalisation for a legal type, which would be nice but doesn't exists. That said the custom lowering for i8/i16 doesn't have to be SVE specific as there's the alternative approach of using normal ISD nodes to do the widening so that only the final SDIV lowering is SVE specific. I'm thinking of the following:

std::tie(Op0Lo, Op0Hi) = DAG.SplitVector(Op.getOperand(0), DL);
std::tie(Op1Lo, Op1Hi) = DAG.SplitVector(Op.getOperand(1), DL);

Op0Lo = DAG.getNode(ISD::ZERO_EXTEND, DL, FixedWidenedVT, Op0Lo);
Op1Lo = DAG.getNode(ISD::ZERO_EXTEND, DL, FixedWidenedVT, Op1Lo);
Op0Hi = DAG.getNode(ISD::ZERO_EXTEND, DL, FixedWidenedVT, Op0Hi);
Op1Hi = DAG.getNode(ISD::ZERO_EXTEND, DL, FixedWidenedVT, Op1Hi);

ResultLo = DAG.getNode(Op.getOpcode(), DL, FixedWidenedVT, Op0Lo, Op0Hi);
ResultHi = DAG.getNode(Op.getOpcode(), DL, FixedWidenedVT, Op0Lo, Op0Hi);

ResultLo = DAG.getNode(ISD::TRUNCATE, DL, HalfVT, ResultLo);
ResultHi = DAG.getNode(ISD::TRUNCATE, DL, HalfVT, ResultHi);

return DAG.getNode(ISD::CONCAT_VECTORS, DL, VT, ResultLo, ResultHi);

Which I think looks nicer and might automatically benefit from optimising the "split" away when the resulting expanded operands can fit within a single register. The big downside to this is that bigger than NEON fixed length EXTRACT_SUBVECTOR and CONCAT_VECTORS support doesn't exist so it'll either not work today or result in terrible code.

Based on this I'm happy to take the above as a refactoring exercise (which I'm happy to do if you want) and have the patch landed in its current form.

That said the custom lowering for i8/i16 doesn't have to be SVE specific as there's the alternative approach of using normal ISD nodes to do the widening so that only the final SDIV lowering is SVE specific.

I tried that at first, but the NEON assembly was pretty ugly. I may have made a mistake there, but if not, it was all scalarized. The SVE lowering is uglier, but the generated code seems cleaner.

Extending fixed width vectors
=======================

sdiv_v8i8:
	.cfi_startproc
	umov	w11, v1.b[1]
	umov	w9, v1.b[2]
	umov	w13, v0.b[1]
	fmov	s2, w11
	umov	w10, v1.b[3]
	umov	w8, v1.b[4]
	umov	w12, v0.b[2]
	umov	w14, v0.b[3]
	fmov	s3, w13
	umov	w11, v0.b[4]
	zip1	v1.8b, v1.8b, v0.8b
	zip1	v0.8b, v0.8b, v0.8b
	mov	v2.h[1], w9
	shl	v1.4h, v1.4h, #8
	shl	v0.4h, v0.4h, #8
	mov	v3.h[1], w12
	mov	v2.h[2], w10
	sshr	v1.4h, v1.4h, #8
	sshr	v0.4h, v0.4h, #8
	mov	v3.h[2], w14
	mov	v2.h[3], w8
	umov	w13, v1.h[0]
	umov	w8, v0.h[0]
	mov	v3.h[3], w11
	shl	v2.4h, v2.4h, #8
	umov	w9, v1.h[1]
	umov	w10, v1.h[2]
	umov	w12, v0.h[1]
	umov	w11, v0.h[2]
	fmov	s0, w13
	fmov	s1, w8
	shl	v3.4h, v3.4h, #8
	sshr	v2.4h, v2.4h, #8
	mov	v0.s[1], w9
	fmov	s4, w9
	mov	v1.s[1], w12
	umov	w8, v2.h[1]
	umov	w9, v2.h[2]
	umov	w13, v2.h[0]
	fmov	s2, w12
	sshr	v3.4h, v3.4h, #8
	mov	v4.s[1], w10
	mov	v2.s[1], w11
	umov	w12, v3.h[0]
	shl	v0.2s, v0.2s, #16
	shl	v1.2s, v1.2s, #16
	umov	w10, v3.h[1]
	umov	w11, v3.h[2]
	fmov	s3, w13
	sshr	v0.2s, v0.2s, #16
	fmov	s5, w12
	sshr	v1.2s, v1.2s, #16
	ptrue	p0.s, vl2
	shl	v4.2s, v4.2s, #16
	shl	v2.2s, v2.2s, #16
	mov	v3.s[1], w8
	fmov	s6, w8
	sshr	v4.2s, v4.2s, #16
	sdivr	z0.s, p0/m, z0.s, z1.s
	sshr	v1.2s, v2.2s, #16
	mov	v5.s[1], w10
	fmov	s2, w10
	sdiv	z1.s, p0/m, z1.s, z4.s
	mov	v6.s[1], w9
	mov	v2.s[1], w11
	shl	v3.2s, v3.2s, #16
	shl	v4.2s, v5.2s, #16
	shl	v5.2s, v6.2s, #16
	shl	v2.2s, v2.2s, #16
	sshr	v3.2s, v3.2s, #16
	sshr	v4.2s, v4.2s, #16
	sdivr	z3.s, p0/m, z3.s, z4.s
	sshr	v4.2s, v5.2s, #16
	sshr	v2.2s, v2.2s, #16
	sdiv	z2.s, p0/m, z2.s, z4.s
	uzp1	v2.4h, v3.4h, v2.4h
	uzp1	v0.4h, v0.4h, v1.4h
	uzp1	v0.8b, v0.8b, v2.8b
	ret

Oh, but to be fair, I didn't use DAG.SplitVector(Op.getOperand(0), DL);. So that may avoid some of the ugly expanding.

paulwalker-arm accepted this revision.Aug 20 2020, 11:25 AM

The expanding is because we don't yet attempt to lower the subvector and concat_vector operations. As I say, I think today the correct move is to take this patch and then see what the future holds when we have full support for subvec and concat.