This patch teaches the VSETVLI insertion pass to perform a very limited form of partial redundancy elimination. The motivating example comes from the fixed length vectorization of a simple loop such as:
for (unsigned i = 0; i < a_len; i++) a[i] += b;
Without this change, the core vector loop and preheader is as follows:
.LBB0_3: # %vector.ph andi a1, a6, -8 addi a4, a0, 16 mv a5, a1 .LBB0_4: # %vector.body # =>This Inner Loop Header: Depth=1 addi a3, a4, -16 vsetivli zero, 4, e32, m1, ta, mu vle32.v v8, (a3) vle32.v v9, (a4) vadd.vx v8, v8, a2 vadd.vx v9, v9, a2 vse32.v v8, (a3) vse32.v v9, (a4) addi a5, a5, -8 addi a4, a4, 32 bnez a5, .LBB0_4
The key thing to note here is that, I believe, the execution of the vsetivli only needs to happen once. Since there's no tail folding happening here, the value of the vector configuration registers are invariant through the loop.
After this patch, we hoist the configuration into the preheader and perform it once.
.LBB0_3: # %vector.ph andi a1, a6, -8 vsetivli zero, 4, e32, m1, ta, mu addi a4, a0, 16 mv a5, a1 .LBB0_4: # %vector.body # =>This Inner Loop Header: Depth=1 addi a3, a4, -16 vle32.v v8, (a3) vle32.v v9, (a4) vadd.vx v8, v8, a2 vadd.vx v9, v9, a2 vse32.v v8, (a3) vse32.v v9, (a4) addi a5, a5, -8 addi a4, a4, 32 bnez a5, .LBB0_4
Once this lands, I plan on extending it to non-immediate AVLs in a separate patch, but frankly, that's less important. For a scalable loop, we always have a setvli above, and in most cases, full redundancy (via the existing dataflow) kicks in. There's enough cases to be worth handling via FRE eventually, but fixed length vectors are definitely much higher impact.
Any reason to restrict this uint8_t? Looks like it's a assigned to unsigned where it's used.