Performance results so far show minor changes both ways, overall a slight regression, which may be worth looking into. I however got several nice improvements with my first version of this patch which (by mistake sort of) loaded the fp constants into a full VR128 instead of a VR32/VR64. I found that I with that first patch got a lot more spilling overall (the vector regs are never callee saved, right), but however also some nice improvements:
(VR128 version) Improvements: 0.938: f507.cactuBSSN_r 0.975: f544.nab_r 0.985: f511.povray_r Regressions: 1.016: i523.xalancbmk_r
I am trying to look into why cactus is so much better and I know for one thing that in that particular file there is a lot *less* spilling for some reason when constants are loaded into full vector regs instead of VR32/64 ones (this is the same file that was improving previusly with fp-contract=off, with a huge function that is not that easy to analyze...). This could be something "random" or there may be an explanation / heuristic to be found. I am not sure why VR128 should work better... The VR128 version gave less spills than main or the VR32/VR64 version, which were about the same.
Not quite sure if it is necessary to have the new SystemZ::INSERT_FP_LANE - perhaps it would be easier to just emit the machine nodes directly there instead.
With a handling in foldMemoryOperandImpl(), I don't know much to improve except to try to figure out more about the cactus difference.