Changes VPReplicateRecipe to extract the last lane from an unconditional,
uniform store instruction. collectLoopUniforms will also add stores to
the list of uniform instructions where Legal->isUniformMemOp is true.
setCostBasedWideningDecision now sets the widening decision for
all uniform memory ops to Scalarize, where previously GatherScatter
may have been chosen for scalable stores.
This fixes an assert ("Cannot yet scalarize uniform stores") in
setCostBasedWideningDecision when we have a loop containing a
uniform i1 store and a scalable VF, which we cannot create a scatter for.
Unfortunately this is incorrect now. The worklist currently relies on instructions only demanding the first lane, and this is used to propagate this property later on, using: if all users of an operand demand the first lane only, the operand itself also only needs to compute the first lane. I added a clarifying comment in b4992dbb21ff9159285ae0aec73f3d760344b0e5
Adding stores violates that. You should probably be able to work around that by adding them to Uniforms without adding them to the worklist. It might be worth calling out that entries in Uniforms may demand the first or last lane.