This patch teaches the VSETVLI insertion pass to perform a very limited form of partial redundancy elimination. The motivating example comes from the fixed length vectorization of a simple loop such as:
for (unsigned i = 0; i < a_len; i++)
    a[i] += b;Without this change, the core vector loop and preheader is as follows:
.LBB0_3:                                # %vector.ph
	andi	a1, a6, -8
	addi	a4, a0, 16
	mv	a5, a1
.LBB0_4:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
	addi	a3, a4, -16
	vsetivli	zero, 4, e32, m1, ta, mu
	vle32.v	v8, (a3)
	vle32.v	v9, (a4)
	vadd.vx	v8, v8, a2
	vadd.vx	v9, v9, a2
	vse32.v	v8, (a3)
	vse32.v	v9, (a4)
	addi	a5, a5, -8
	addi	a4, a4, 32
	bnez	a5, .LBB0_4The key thing to note here is that, I believe, the execution of the vsetivli only needs to happen once. Since there's no tail folding happening here, the value of the vector configuration registers are invariant through the loop.
After this patch, we hoist the configuration into the preheader and perform it once.
.LBB0_3:                                # %vector.ph
	andi	a1, a6, -8
	vsetivli	zero, 4, e32, m1, ta, mu
	addi	a4, a0, 16
	mv	a5, a1
.LBB0_4:                                # %vector.body
                                        # =>This Inner Loop Header: Depth=1
	addi	a3, a4, -16
	vle32.v	v8, (a3)
	vle32.v	v9, (a4)
	vadd.vx	v8, v8, a2
	vadd.vx	v9, v9, a2
	vse32.v	v8, (a3)
	vse32.v	v9, (a4)
	addi	a5, a5, -8
	addi	a4, a4, 32
	bnez	a5, .LBB0_4Once this lands, I plan on extending it to non-immediate AVLs in a separate patch, but frankly, that's less important. For a scalable loop, we always have a setvli above, and in most cases, full redundancy (via the existing dataflow) kicks in. There's enough cases to be worth handling via FRE eventually, but fixed length vectors are definitely much higher impact.
Any reason to restrict this uint8_t? Looks like it's a assigned to unsigned where it's used.