Page MenuHomePhabricator

[SVE][WIP] Implement lowering for fixed width select
ClosedPublic

Authored by cameron.mcinally on Aug 5 2020, 2:57 PM.

Details

Summary

Posting this patch seeking guidance...

This patch is lowering code for fixed width select. There's an issue with the select's condition operand though. Fixed width vectors-of-i1s (e.g. v8i1) aren't legal. They will be promoted to a larger legal integer vector (e.g. v8i64). We run into problems when we lower the fixed width masks to their legal scalable counterparts (e.g. nxv2i1).

Here's a hard example:

In AArch64TargetLowering::useSVEForFixedLengthVectorV(...), we have this comment:

// Fixed length predicates should be promoted to i8.
// NOTE: This is consistent with how NEON (and thus 64/128bit vectors) work.

That's problematic for a select like this:

select <8 x i1> %mask, <8 x i64> %op1, <8 x i64> %op2

At VL=512, the v8i1 mask will be promoted to v8i64. In order to lower this to a scalable mask, we'd need to insert the v8i64 subvector into a nxv2i64. And then truncate that ZPR by performing a CMPNE against 0, to get the final nxv2i1 mask. Between the zero extend to promote the vXi1 mask, and the truncate to get back to a nxvXi1, there's a lot of extra instructions.

What's the best way to proceed with this? Are these extra instructions something we live with? Or should we play with making the vXi1 mask types legal when we're lowering fixed width vectors? I don't have a good feel for how the latter will fit with the existing NEON mask though.

P.S. You'll notice the included tests do not have CHECK lines yet. I didn't want to do that work until there was a clear direction forward. I will post a small example and the generated assembly for the reviewer's convenience shortly.

Diff Detail

Event Timeline

Herald added a reviewer: efriedma. · View Herald Transcript
Herald added a project: Restricted Project. · View Herald Transcript
cameron.mcinally requested review of this revision.Aug 5 2020, 2:57 PM

For this IR test and -aarch64-sve-vector-bits-min=512:

target triple = "aarch64-unknown-linux-gnu"

define void @select(<8 x double>* %a, <8 x double>* %b, <8 x i1>* %c) #0 {
; CHECK: select:
  %mask = load <8 x i1>, <8 x i1>* %c
  %op1 = load <8 x double>, <8 x double>* %a
  %op2 = load <8 x double>, <8 x double>* %b
  %sel = select <8 x i1> %mask, <8 x double> %op1, <8 x double> %op2
  store <8 x double> %sel, <8 x double>* %a, align 4 
  ret void
}       

attributes #0 = { "target-features"="+sve" }

this patch will generate:

	ldrb	w8, [x2]
	ptrue	p0.d, vl8
	mov	x9, sp
	ptrue	p1.d
	lsr	w10, w8, #7
	lsr	w11, w8, #6
	lsr	w12, w8, #5
	lsr	w13, w8, #4

        // Extend the mask: v8i1 -> v8i64
	sbfx	x10, x10, #0, #1
	sbfx	x11, x11, #0, #1
	stp	x11, x10, [sp, #48]
	sbfx	x11, x12, #0, #1
	sbfx	x12, x13, #0, #1
	lsr	w10, w8, #3
	stp	x12, x11, [sp, #32]
	lsr	w11, w8, #2
	sbfx	x10, x10, #0, #1
	sbfx	x11, x11, #0, #1
	stp	x11, x10, [sp, #16]
	sbfx	x10, x8, #0, #1
	lsr	w8, w8, #1
	sbfx	x8, x8, #0, #1
	stp	x10, x8, [sp]

        // Load extended mask into ZPR: nxv2i64
	ld1d	{ z0.d }, p0/z, [x9]
	ld1d	{ z1.d }, p0/z, [x0]
	ld1d	{ z2.d }, p0/z, [x1]

        // Truncate the mask: nxv2i64 -> nxv2i1
	and	z0.d, z0.d, #0x1
	cmpne	p2.d, p1/z, z0.d, #0

        // Combine the select mask with the fixed VL mask
        // Note: Not sure if this is *really* needed, or if we can trust the select mask,
        //       but that's a separate issue.
	and	p1.b, p1/z, p0.b, p2.b

	sel	z0.d, p1, z1.d, z2.d
	st1d	{ z0.d }, p0, [x0]

At VL=512, the v8i1 mask will be promoted to v8i64. In order to lower this to a scalable mask, we'd need to insert the v8i64 subvector into a nxv2i64. And then truncate that ZPR by performing a CMPNE against 0, to get the final nxv2i1 mask. Between the zero extend to promote the vXi1 mask, and the truncate to get back to a nxvXi1, there's a lot of extra instructions.

I had this concern when I was reviewing the code in question. @paulwalker-arm said he found the conversions were usually folded away in his prototype. Most i1 vectors will be produced by a compare that returns an nxv2i1 or something like that.

For load <8 x i1> specifically, the code is terrible because we're using the generic target-independent expansion, which goes element by element. If we cared, we could custom-lower it to something more reasonable. Nobody has looked into it because there isn't any way to generate that operation from C code.

cameron.mcinally added a comment.EditedAug 5 2020, 4:29 PM

For load <8 x i1> specifically, the code is terrible because we're using the generic target-independent expansion, which goes element by element. If we cared, we could custom-lower it to something more reasonable. Nobody has looked into it because there isn't any way to generate that operation from C code.

That's interesting. If we could load the vXi1 and then vector extend it, it might be more palatable. I haven't checked if there are instructions to support that though. And I wonder if it will get weird with a vector-of-i1s smaller than a byte...

At VL=512, the v8i1 mask will be promoted to v8i64. In order to lower this to a scalable mask, we'd need to insert the v8i64 subvector into a nxv2i64. And then truncate that ZPR by performing a CMPNE against 0, to get the final nxv2i1 mask. Between the zero extend to promote the vXi1 mask, and the truncate to get back to a nxvXi1, there's a lot of extra instructions.

I had this concern when I was reviewing the code in question. @paulwalker-arm said he found the conversions were usually folded away in his prototype. Most i1 vectors will be produced by a compare that returns an nxv2i1 or something like that.

I guess if it is amortized away, it's not a big deal. But a CMPNE is 4 cycles and a SX is 4 cycles. So we have an 8 cycle no-op. That's not great.

Also, I think we can get rid of this AND, unless I'm missing an edge case. The LSBs will still be 1 without the AND. I'm not sure if the top bits could get us in trouble though:

        // Truncate the mask: nxv2i64 -> nxv2i1
	and	z0.d, z0.d, #0x1
	cmpne	p2.d, p1/z, z0.d, #0

For load <8 x i1> specifically, the code is terrible because we're using the generic target-independent expansion, which goes element by element. If we cared, we could custom-lower it to something more reasonable. Nobody has looked into it because there isn't any way to generate that operation from C code.

That's interesting. If we could load the vXi1 and then vector extend it, it might be more palatable. I haven't checked if there are instructions to support that though. And I wonder if it will get weird with a vector-of-i1s smaller than a byte...

We could do something like svlsr(svdup(x), svindex(0,1)). But again, it's not really worth optimizing this.

At VL=512, the v8i1 mask will be promoted to v8i64. In order to lower this to a scalable mask, we'd need to insert the v8i64 subvector into a nxv2i64. And then truncate that ZPR by performing a CMPNE against 0, to get the final nxv2i1 mask. Between the zero extend to promote the vXi1 mask, and the truncate to get back to a nxvXi1, there's a lot of extra instructions.

I had this concern when I was reviewing the code in question. @paulwalker-arm said he found the conversions were usually folded away in his prototype. Most i1 vectors will be produced by a compare that returns an nxv2i1 or something like that.

I guess if it is amortized away, it's not a big deal. But a CMPNE is 4 cycles and a SX is 4 cycles. So we have an 8 cycle no-op. That's not great.

Folded away, as in, DAGCombine gets rid of the extra instructions. If that doesn't work right now, we should be able to make it work with a little more code.

Also, I think we can get rid of this AND, unless I'm missing an edge case.

VSELECT masks are guaranteed to be all-ones or all-zeros for vectors types on ARM.

Matt added a subscriber: Matt.Aug 5 2020, 11:55 PM

At VL=512, the v8i1 mask will be promoted to v8i64. In order to lower this to a scalable mask, we'd need to insert the v8i64 subvector into a nxv2i64. And then truncate that ZPR by performing a CMPNE against 0, to get the final nxv2i1 mask. Between the zero extend to promote the vXi1 mask, and the truncate to get back to a nxvXi1, there's a lot of extra instructions.

I had this concern when I was reviewing the code in question. @paulwalker-arm said he found the conversions were usually folded away in his prototype. Most i1 vectors will be produced by a compare that returns an nxv2i1 or something like that.

I guess if it is amortized away, it's not a big deal. But a CMPNE is 4 cycles and a SX is 4 cycles. So we have an 8 cycle no-op. That's not great.

Folded away, as in, DAGCombine gets rid of the extra instructions. If that doesn't work right now, we should be able to make it work with a little more code.

I see now. My misunderstanding was that the folding away only happened within the loop block. And that we'd still have to pay the cost pre and post loop.

Reiterating what (I think) Paul said from the SVE Sync-up call, it sounds like he has a plan for DAGCombine to clean these up during ISel, instead of during type legalization. If that's correct, then it should be all good.

Just to better put my position in words. This patch does highlight the expected problem with the fixed length code generation for SVE approach, namely extensions and truncations are more costly than they need to be for SVE. At the block level my expectation is that most should get folded away as more and more of the fixed length operations are lowered to SVE. Here though is where my expectation does not match reality because I believe we lower the ext/trunc operations too early meaning we don't have a DAGCombine phase that's able to remove them. This is why my original prototype made these operations legal and effectively lowered them during selection. That said it's presumably possible to recognise the target equivalent patterns?

This will not help when crossing block boundaries (perhaps it will for global isel?) so ultimately for the best code quality[1] we're going to want to consider i1 based vectors legal but that is not a straight forward transformation (perhaps a topic for the SVE sync call?) and while "expanding" operations is a much bigger performance issue I'd rather remained focused on that. Of course there's the argument that the more we move along the current path the harder it will be to change, but currently there's only a handful of functions that relate to this and I'm continually monitoring it.

[1] Well, other than using scalable vectors directly :)

cameron.mcinally updated this revision to Diff 287473.EditedAug 24 2020, 1:40 PM

Updating to exhibit the problem mentioned in D85546. One way to see the issue is:

llc -aarch64-sve-vector-bits-min=256 -asm-verbose=0 < llvm-project/llvm/test/CodeGen/AArch64/sve-fixed-length-fp-select.ll

The problem originates in Legalize. DAGTypeLegalizer::PromoteIntOp_SIGN_EXTEND(...) will produce fixed length DAGs like this:

    t18: v8i16 = any_extend t16
  t20: v8i16 = sign_extend_inreg t18, ValueType:ch:v8i1
t12: v8f16 = vselect t20, t10, t11

This ISD::SIGN_EXTEND_INREG operation expects both the result and operand type to be the same.

However, when we go to lower this to AArch64ISD::SIGN_EXTEND_INREG_MERGE_PASSTHRU, that operation expects a half vector type for the operand. Here's a special case -- for illustrative purposes only:

def : Pat<(sext_inreg (nxv8i16 ZPR:$Zs), nxv8i8),  (SXTB_ZPmZ_H (IMPLICIT_DEF), (PTRUE_H 31), ZPR:$Zs)>;

At the surface, it's not really a big problem, since nxv8i8 and nxv8i16 are really the same register class. I just can't find a good way to bitcast between the two types with the fixed length lowering utilites.

However, when we go to lower this to AArch64ISD::SIGN_EXTEND_INREG_MERGE_PASSTHRU, that operation expects a half vector type for the operand. Here's a special case -- for illustrative purposes only:

To be precise SIGN_EXTEND_INREG_MERGE_PASSTHRU expects its VT operand to have the same number of elements as its data operand. The element type of the VT operand describes the location of the sign bit that is used when performing the extension. This is identical to the requirements for SIGN_EXTEND_INREG.

I'm still not sure of the issue you are highlighting. When I apply the patch and run the test, the first thing I hit is a selection failure, which is fixed by D86394. I then hit an assert in AArch64TargetLowering::ReconstructShuffle, which looks like it needs updating to take wide and/or scalable vectors into account. I just stubbed the call out and the code generation succeeds, albeit with terrible code. Most of this looks to be due to the lowering of the i1 based load. When I replace that load with an integer load (matching the size of the floating-point operands) and use an icmp[1] to construct the select mask I get the expected code generation.

[1] We don't have lowering for floating-point compares yet. I have a work in progress patch but it's proving a little problematic as I'm trying to make use of existing expand code like we do for scalable vectors but it seems the two legalisation paths are not equivalent.

˜Bah, egg on my face. You're right that D86394 fixes the immediate issue. Sorry for the noise.

Updating Diff for recent upstream changes.

There are two maybe controversial changes, so will comment on them inline...

cameron.mcinally added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
7337–7341

@t.p.northover, does this change look okay to you? I suspect that this assert is assuming we only have NEON 64b and 128b vectors.

Fixed width SVE now has larger vectors, and for these particular tests we want to extract a 1/4 width subvector from a single width vector.

9136

Just checking if anyone sees a problem with this change. We need to explicitly check that this is a 128b vector now, since SVE can see larger fixed width vectors.

Fix bad copy-and-paste in CHECK lines. No significant changes made.

Update tests -- don't use pointer args for NEON sized vectors.

efriedma added inline comments.Sep 2 2020, 2:57 PM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
7328

Is this assert also a potential issue?

7337–7341

Why is 4 special here?

7377

Is AArch64ISD::EXT going to work with arbitrarily large vectors?

9136

This looks fine.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
7337–7341

It's a splitting quirk. I'm not even sure how to succinctly describe it, but here goes...

Let's assume a fixed width edge-case of -aarch64-sve-vector-bits-min=256. The weird splitting kicks in with vectors of >=64 elements.

In particular, anytime we hit a v64i1 and the operands have elements of 32b, the problem comes up. A v64i1 isn't a legal type, so it's split to 2*v32i1. v32i1 isn't legal either, but we can promote it to v32i8 (a 256b type). This is where the first BUILD_VECTOR comes in. Call it v32i8=BUILD_VECTOR.

So now we have something like:

<v32i32> = vselect <v32i8> cond, <v32i32> op1, <v32i32> op2

Legalization continues, and v32i32 isn't legal, so we split again to v16i32, and again to v8i32 (now legal at 256b). The mask is also split during this, so we end up with a v8i8=BUILD_VECTOR for the mask operand.

<v8i32> = vselect <v8i8> cond, <v8i32> op1, <v8i32> op2

Everything is legal now. The ReconstructShuffle(...) function now lowers the v8i8=BUILD_VECTOR, and finds the v32i8=BUILD_VECTOR as the source. So that v32/v8 gives us the magic 4. It's wonky, but I'm fairly sure it's bound at 4.

Now that I'm writing this out, I'm seeing that ReconstructShuffle(...) probably needs to be guarded better in general. We shouldn't hit the problem with fixed lowering (yet?), but I suppose someone could write a pathological set of BUILD_VECTORs to trip it up.

It's unfortunate that we can't distinguish masks from integer vectors. If we could, we'd avoid this. But that would mean making the vXi1 types legal, which I'm assuming causes headaches with NEON masks.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
7328

Not sure. The fixed-width lowering case should never get here. Splitting will only be taking a subvector out of a larger vector with the same element type.

7377

The fixed-width lowering case should never get here. We always have one source that's being split.

But, yeah, I agree that this is a little sketchy. Maybe we should have a separate routine to lower the splitting case? I don't know. Just thinking out loud.

Or maybe we should have a special 1 source case within ReconstructShuffle(...)? That could work.

If you aren't planning to extend ReconstructShuffle to work with arbitrary BUILD_VECTOR nodes, I'd recommend adding a bailout, and implementing a separate codepath for large vectors that only handles the cases you actually need.

Turns out that ReconstructShuffle(...) needs a lot of work to be updated. The EXTRACT_SUBVECTOR index assumes we're extracting a 1/2 width vector. I missed that.

I added a function that lowers specific BUILD_VECTORs to EXTRACT_SUBVECTORs, hoping that it would short-circuit the ReconstructShuffle(...) problem. That was a dead end, since small EXTRACT_SUBVECTORs aren't legal. And we'd still run into the ReconstructShuffle(...) problem with those small BUILD_VECTORs..

But when the EXTRACT_SUBVECTORs are large enough to be legal, the new function could be a win with some tuning. It replaces the scalar loads with a vector load. I've included that function in this Diff for review, but will break it out into a separate Diff if it seems interesting enough to pursue. Thoughts?

Remove the BUILD_VECTOR->EXTRACT_SUBVECTOR transform so that this is easier to review. Will post a separate Diff for that.

When we legalize vectors greater than the fixed VL, the splitting code is pretty poor. That will need more work in the future.

cameron.mcinally marked 3 inline comments as done.Sep 14 2020, 7:45 AM
cameron.mcinally added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
7337–7341

This now bails out when a non-NEON fixed width BUILD_VECTOR is seen. It's not ideal, but could be updated later if desired.

efriedma added inline comments.Sep 14 2020, 11:37 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
7337–7341

I think you need to guard the creation of the AArch64ISD::EXT more explicitly; it won't do anything right now, but it might once we start messing with isShuffleMaskLegal.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
7337–7341

Did you have something particular in mind?

We could check Src.MaxElt - Src.MinElt <= NumSrcElts/2 (note the < is needed in case undefs are in play). That's true for all 3 if-conditions though, so would be better as an assert.

We could also check that Src.MinElt < NumSrcElts && Src.MaxElt >= NumSrcElts. That should be covered by the 2 preceding if-conditions, but it would be good to fortify for future code changes.

Am I missing anything else?

efriedma added inline comments.Sep 16 2020, 7:12 PM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
7337–7341

Maybe something like the following?

if (!SrcVT.s64BitVector()) {
  // Don't know how to lower AArch64ISD::EXT for SVE vectors.
  return SDValue();
}

Here's the most obvious choice -- glueing the new check to the AArch64ISD::EXT generation. That's good because the intention is obvious, and future code changes won't get in the way. But it could also be argued that the NEON-sized vector check should precede this entire if-else statement. I.e. the EXTRACT_SUBVECTORs have the same problem as AArch64ISD::EXT. None of this code is ready for SVE-sized fixed vectors.

Thoughts on this? I don't have a strong opinion.

efriedma accepted this revision.Sep 17 2020, 11:37 AM

LGTM

The other codepaths should do the right thing in theory, so I'm fine with leaving them, I think.

This revision is now accepted and ready to land.Sep 17 2020, 11:37 AM