The meat of this patch is a change to lowering of llvm.get.active.lane.mask to first compute the remaining lanes which need to be active in the scalar domain, and then use a single vector comparison. This replaces the existing lowering which uses a vector saturating add, and then a comparison. (For discussion, I'm ignoring the splats since they generally get folded into using instructions.) This results in a significant codegen improvement for RISCV, and while I'm not an expert in AArch64, the result appears profitable to me - confirmation desired.
To do this, I have the change the specified semantics of the intrinsic slightly. Specifically, I need the assumption that trip count is unsigned greater than or equal to the base index to avoid needing a saturating subtract in the scalar domain. (I tried using the saturating subtract, and practical codegen results were poor.) As far as I can tell, the revised semantics is consistent with actual usage.
As an aside, I am wondering if we need this intrinsic at all. The lowering chosen here could be used by the vectorizer directly, and the AArch64 whillelo pattern match for the EVL form would seem straight forward. Maybe we'd have trouble folding the SUB back in, but has anyone played with this?
A potential issue I see with this change is that it doesn't play well with unrolling in the vectorizer. For example, if each iteration of a loop handles 8 elements at a time with vector width 4, you get two calls to llvm.get.active.lane.mask, I think.