I was temped to try to change the MVETailPredication pass to a simpler "if loop has a activelanemask, convert it to a vctp" pass. Here I've just left it as is, but allowed vector reductions through into tail predicated loops.
Details
Diff Detail
Event Timeline
Just a remark on this:
I was temped to try to change the MVETailPredication pass to a simpler "if loop has a activelanemask, convert it to a vctp" pass.
After the rewrite of this pass to use activelanemask, that's essentially all there's left here, so I wouldn't see the benefit of that (unless I miss something). But anyway, what Sam said, nice codegen changes.
After the rewrite of this pass to use activelanemask, that's essentially all there's left here
Yep, that's the idea, but taking it further. The pass currently starts off by finding set_loop_iterations hardware loop intrinsics, then searches through for masked load/stores (but doesn't check for normal loads/stores) and then can get held up by other intrinsics like the reductions fixed here. Not much of that is really beneficial and the whole pass could simply be converting active lane masks to vctp's. As in I think that is often better even if you don't produce a tail predicated loop at the end of it.
Although I'm not sure how much it would help in the grand scheme of things. If we are in the state where we have a active lane mask but don't for whatever reason produce a tail predicated loop, the codegen is going to be pretty bad either way.