As I experimented with making an fp-vector element 0 insert free (given that there is the "vector merge high" instruction available which takes two input vectors and combines them into one), I ran into a regression. In the end, it seemed like the SLP-vectorizer is doing one more vectorization (4 instead of 3) in a function, which ended up to cause the vector operands to be significantly more numerous. This would be worth looking into at some point, probably.
However, one of the things I also noticed was many VL64 -> VREPG instructions where this could have seemingly been just VLREPG. IIUC, SystemZ buildVector() creates a REPLICATE node if there is a loaded element, with the intent of this becoming VLREP. It seems that this folding does not happen if the load has more than one user.
To remedy this, I experimented with a patch that handles these cases by putting those other users of the load to use the REPLICATE 0-element instead of the load. This way, the load has only the REPLICATE node as user, and we get a VLREP.
So far, I have only looked at my test case, which was floating point (calculix/Utilities_DV), and I am not sure of all the implications. It just seems better to get VLREP in more cases... For the files that get a different opcode count comparing to trunk on SPEC, see
I think the regression I saw during experiments disappeared mostly with this fix. Not sure about the performance effects generally yet.
What exactly are you trying to check here? As far as I can see, UseVT is always equal to LdVT, right? (It's the type of UI's "Val", which is the value of N ...)
Why not simply checking that the result type of the load is a floating-point type above, and then be done with it?