I think there would be a small code size win from this change and possibly a slight perf win, but I don't have a representative benchmarking system to test that theory. I figure it's worth posting this patch to get feedback and let others give it a try if they're interested. If you have access to SPEC or other standard benchmarks, I'd be most grateful to know if it helps.
The irony is that AMD Jaguar apparently does not have macro-fusion, so the target I was hoping to help the most is excluded from consideration...
In the motivating case from PR35681 and represented in the new test in this patch:
...there's a 37 -> 31 byte size win for the loop because we eliminate the big base address offsets.