This change is a bit of a hack, but I've run out of better ideas. Basically, I'm adding a fudge factor to the cost of insertelement and extractelement operations at very small VLs. This has the effect of making vectorization of partially vectorizeable sub-trees appear less profitable to SLP. The result is that we vectorize significantly fewer small trees when SLP is enabled.
Note that because this penalty *isn't* being applied to loads and stores, we will still vectorize an VL=2 tree (or an VL=2 subtree with a wider root) if the sub-tree is fully vectorizeable.
Here's the impact on namd from spec 2017.
Before:
vsetivli : 2181 total 1: 135 2: 1411 3: 21 4: 579 8: 35 vsetvli : 735 total zero, zero: 687 reg, zero : 48
After:
vsetivli : 1360 total 1: 73 2: 737 3: 11 4: 505 8: 34 vsetvli : 257 total zero, zero: 191 reg, zero : 66
This interface isn't solely used by SLP, but it's close. There's one use in CodeGenPrepare which will cause CGP to be slightly more aggressive about speculating unused lanes, and one use in LV which will bias us away from uniform stores at very small VLs. Within SLP, this is mostly used to compute build_vector and extract costs. As the SLP test diff shows, this sometimes results in odd choices - e.g. why did scalarization cost going up result in fewer sub-vector extracts? - but on the whole clearly decreases vectorization at small VLs.
For context, I've been trying to improve the vector codegen at small VLs for the last few weeks (and @luke has been helping), but we don't seem to be making significant progress. My goal with this patch is basically to side step that work, be able to enable SLP by default, and *then* return to trying to hammer on small VL codegen.
Is it needed for CostKind == TTI::TCK_CodeSize ?
I was thinking the total cost was composed of vmv instruction (BaseCost) and vslide instruction (SlideCost).
IIUC, BaseCost includes vector-scalar communication cost, and the addend 2 here accounts for this.