vrgather.vv across multiple vector registers (i.e. LMUL > 1) requires all to all data movement. This includes two conceptual sets of changes:
- For permutes, we were modeling these as being linear in LMUL.
- For reverse, we were modeling them as being fixed cost in LMUL.
Noticed via code inspection while looking at something else.
Its worth asking whether we should be lowering reverse to something other than a vrgather at high LMULs. That shuffle is quite expensive.