This is a follow-up to D70607 where we made any extract element on SLM more costly than default. But that is pessimistic for extract from element 0 because that corresponds to x86 movd/movq instructions. These generally have >1 cycle latency, but they are probably implemented as single uop instructions.
Note that no vectorization tests are affected by this change. Also, no targets besides SLM are affected because those are falling through to the default cost of 1 anyway. But this will become visible/important if we add more specializations via cost tables.