I was experimenting with the LoopRotation option -rotation-max-header-size and found that if I increased this a few more loops were vectorized.
This is a threshold that compares against the computed size of the header which is done by calling getUserCost() (instead of getArithmeticInstrCost(), etc). This seems to be a poor-mans cost function, and I looked over the various costs that are actually used for SystemZ.
The only cost I came to question was that for the extension of an i1, which is per default free. I suspect this is wrong on SystemZ, as it is implemented with LOC for int, and branch sequence for floating point.
So far, I have seen some mixed results mostly in favor, but also a regression or two. I also experimented with the SimplifyCFG option -phi-node-folding-threshold=3 (instead of default 2), to handle some case that seemed better transformed by the CFG but were not with the added cost of the i1 extension. This in turn also led to mixed results.
I think this might be worth investigating further. In particular, I wonder if it may be worth fine-tuning exactly what the SimplifyCFG decides to speculate or not. Raising the limit to 3 seems beneficial in some cases at least.