This is a follow up to address a review comment from D124869. When deciding whether to PRE a vsetvli, we can allow non-LMUL1 vsetvlis.
Details
Diff Detail
Unit Tests
Event Timeline
There is an option to scale the bitwidth returned to the vectorizer by TTI getRegisterBitWidth. Using that you can get LMUL>1 fixed vector loops. https://godbolt.org/z/34asbPcv7 should work for scalar too.
I had added some lmul fixed length tests to test/CodeGen/RISCV/rvv/sink-splat-operands.ll. (Odd name, but it's where all of our non-lmul variants were, so...)
We seem to end up with odd patterns around loads and stores in the loop where we toggle back and forth between e8 and e32. This toggling means we can't currently PRE.
llvm/test/CodeGen/RISCV/rvv/sink-splat-operands.ll | ||
---|---|---|
4181 | I guess this didn't optimized because the amount was in a register? |
llvm/test/CodeGen/RISCV/rvv/sink-splat-operands.ll | ||
---|---|---|
4181 | Yep, large constant AVLs are probably a case we need to handle explicitly. Haven't quite fully thought through what we want there. |
Seems like there's no fractional LMULs tested by this patch? Does this suggest we should add some more test coverage?
Well, I would, but I could not find an example in tree of what a fractional LMUL looks like in IR. (Probably just because I don't know what syntax looks like). If you give me an example, I can take it from there.
It's certainly easier with scalable vectors, but we do codegen fractional LMULs for fixed vectors if the minimum VLEN is sufficiently large that we know the vector can be contained within a fraction of a whole register. For example, this (copied) test case uses mf2 with -riscv-v-vector-bits-min=256:
define void @sink_splat_mul_lmulmf2(i32* nocapture %a, i32 signext %x) { entry: %broadcast.splatinsert = insertelement <4 x i32> poison, i32 %x, i64 0 %broadcast.splat = shufflevector <4 x i32> %broadcast.splatinsert, <4 x i32> poison, <4 x i32> zeroinitializer br label %vector.body vector.body: ; preds = %vector.body, %entry %index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ] %0 = getelementptr inbounds i32, i32* %a, i64 %index %1 = bitcast i32* %0 to <4 x i32>* %wide.load = load <4 x i32>, <4 x i32>* %1, align 8 %2 = mul <4 x i32> %wide.load, %broadcast.splat %3 = bitcast i32* %0 to <4 x i32>* store <4 x i32> %2, <4 x i32>* %3, align 8 %index.next = add nuw i64 %index, 4 %4 = icmp eq i64 %index.next, 1024 br i1 %4, label %for.cond.cleanup, label %vector.body for.cond.cleanup: ; preds = %vector.body ret void }
Added coverage in 33b1be591.
For my context, why is it profitable to use fractional LMULs over LMUL=1? I'm aware of the extend/truncate cases, but for operations like VADD, it seems like using mf2 and m1 are equivalent (assuming VL is the same) right?
The only case I can think of that might be profitable would be using a fractional lmul so that VLMax (and thus the x0 encoding) is equal to the AVL. That seems somewhat questionable on it's own.
Using a mix of lmuls makes removing vsetvlis trickier. If we simply canonicalized fractional lmuls to lmul=1 (using knowledge about the vector length if needed for the vlmax case), it seems we'd potentially remove vsetvlis.
At least toggling back and forth between fractional and lmul=1 doesn't change VL for the subset of AVLs less than the fractional width. This does at least mean we can use the AVL preserving variant. (Though, I'm not sure we actually do this... a quick look seems to indicate we don't.)
In general, I'm struggling to understand why we'd want to use fractional lmuls. Any ideas?
You're correct as far as the hardware behavior goes.
The only case I can think of that might be profitable would be using a fractional lmul so that VLMax (and thus the x0 encoding) is equal to the AVL. That seems somewhat questionable on it's own.
Using a mix of lmuls makes removing vsetvlis trickier. If we simply canonicalized fractional lmuls to lmul=1 (using knowledge about the vector length if needed for the vlmax case), it seems we'd potentially remove vsetvlis.
At least toggling back and forth between fractional and lmul=1 doesn't change VL for the subset of AVLs less than the fractional width. This does at least mean we can use the AVL preserving variant. (Though, I'm not sure we actually do this... a quick look seems to indicate we don't.)
In general, I'm struggling to understand why we'd want to use fractional lmuls. Any ideas?
The fixed vector to scalable vector mapping has been designed so that the vectors of ELEN(64 or 32) sized elements with total width <= riscv-v-vector-bits-min will produce an LMUL=1 scalable vector. Wider vectors will use LMUL=2,4,8. Fractional LMUL is not supported for vectors with SEW==ELEN by spec. Vectors with the same number elements and smaller SEW will use a proportionally smaller ELEN. This mapping is independent of what types are actually used in the basic block or function.
So in mixed element width code with all vectors having the same number elements, all the vsetvlis should have the same SEW::LMUL ratio. That seems like the ideal property to have for vsetvli removal.
On an ELEN=64 target, with no i64/f64 elements and all vectors <= riscv-v-vector-bits-min we will only have fractional LMULs.
Ok, I think I get your point here. In the world where all of the fixed length vectors are the same number of elements (if possibly different element types), VL does not change when we switch element types. As such, we can use the VL preserving form as I noted above.
I do see some cases where by changing from a fractional lmul to LMUL1, we might be able to remove a vsetvli entirely, but all of those require speculation safety proofs (to avoid having to change VL).
Basically, we're making the (entirely reasonable for real hardware) guess that a VL preserving vsetvli is cheaper to execute than a vsetvli which preserves VTYPE and changes VL. The default scheme biases towards creating cases where we only change VTYPE. The alternate canonicalization (towards a single LMUL), runs the risk of requiring a VL changing vsetvli.
Thanks for the context.
I'll give the speculation cases some more thought, but that's low priority at the moment. I don't have any concrete motivating examples.
I guess this didn't optimized because the amount was in a register?