This is an archive of the discontinued LLVM Phabricator instance.

Redefine get.active.lane.mask to allow a more scalar lowering
AbandonedPublic

Authored by reames on Jul 11 2022, 10:12 AM.

Details

Summary

The meat of this patch is a change to lowering of llvm.get.active.lane.mask to first compute the remaining lanes which need to be active in the scalar domain, and then use a single vector comparison. This replaces the existing lowering which uses a vector saturating add, and then a comparison. (For discussion, I'm ignoring the splats since they generally get folded into using instructions.) This results in a significant codegen improvement for RISCV, and while I'm not an expert in AArch64, the result appears profitable to me - confirmation desired.

To do this, I have the change the specified semantics of the intrinsic slightly. Specifically, I need the assumption that trip count is unsigned greater than or equal to the base index to avoid needing a saturating subtract in the scalar domain. (I tried using the saturating subtract, and practical codegen results were poor.) As far as I can tell, the revised semantics is consistent with actual usage.

As an aside, I am wondering if we need this intrinsic at all. The lowering chosen here could be used by the vectorizer directly, and the AArch64 whillelo pattern match for the EVL form would seem straight forward. Maybe we'd have trouble folding the SUB back in, but has anyone played with this?

Diff Detail

Event Timeline

reames created this revision.Jul 11 2022, 10:12 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 11 2022, 10:12 AM
reames requested review of this revision.Jul 11 2022, 10:12 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 11 2022, 10:12 AM

As an aside, I am wondering if we need this intrinsic at all. The lowering chosen here could be used by the vectorizer directly, and the AArch64 whillelo pattern match for the EVL form would seem straight forward. Maybe we'd have trouble folding the SUB back in, but has anyone played with this?

The intrinsic was originally designed to be pattern-matched by MVE loop optimizations (llvm/lib/Target/ARM/MVETailPredication.cpp). I think there were issues with making the pattern-matching work reliably without the intrinsic.

llvm/docs/LangRef.rst
19962

A potential issue I see with this change is that it doesn't play well with unrolling in the vectorizer. For example, if each iteration of a loop handles 8 elements at a time with vector width 4, you get two calls to llvm.get.active.lane.mask, I think.

As an aside, I am wondering if we need this intrinsic at all. The lowering chosen here could be used by the vectorizer directly, and the AArch64 whillelo pattern match for the EVL form would seem straight forward. Maybe we'd have trouble folding the SUB back in, but has anyone played with this?

The intrinsic was originally designed to be pattern-matched by MVE loop optimizations (llvm/lib/Target/ARM/MVETailPredication.cpp). I think there were issues with making the pattern-matching work reliably without the intrinsic.

Yes, exactly, just wanted to confirm this. Pattern matching without the intrinsic was our first approach. That worked well for simple examples, but became a mess and very fragile pretty quickly. The intrinsic was the best way to communicate this from middle end to backend.

I haven't looked into details yet of the proposed change in semantics, but before I do that, my question is whether more efficient lowering is important at all (and thus if we want to put the effort in). Because the way I look at it is this: as soon as get.active.lane.mask is lowered by SelectionDAG something is "wrong". That is, it isn't picked up by a backend pass and lowered to a target intrinsic/instruction, or the intrinsic shouldn't have been emitted in the first place. The SelectionDAG lowering has always been a safety net just in case it wasn't lowered or missed. At least that was the idea when we added this. Not sure if things have changed or if there's another use-case.

I haven't looked into details yet of the proposed change in semantics, but before I do that, my question is whether more efficient lowering is important at all (and thus if we want to put the effort in). Because the way I look at it is this: as soon as get.active.lane.mask is lowered by SelectionDAG something is "wrong". That is, it isn't picked up by a backend pass and lowered to a target intrinsic/instruction, or the intrinsic shouldn't have been emitted in the first place. The SelectionDAG lowering has always been a safety net just in case it wasn't lowered or missed. At least that was the idea when we added this. Not sure if things have changed or if there's another use-case.

I recently switched RISCV over to using get.active.lane.mask as the IR representation, and letting the generic SDAG do the lowering. See https://reviews.llvm.org/D129221. This was a preparation step for matching get.active.lane.mask during ISEL lowering to convert some masks into VL predication.

I could reverse course here and make this change in the vectorizer, and then do pattern matching on the new compare in ISEL. Doing the pattern match for the simple pattern (icmp ult step_vector, N) wouldn't be particular hard. Do you think that's a better overall approach?

llvm/docs/LangRef.rst
19962

Can you expand on why this would be a problem? I would expect to find the base incremented by the VF for the second unrolled copy, but that still has to be less than TC.

Hm, I'm assuming a loop with a scalar epilogue. Is there maybe a problem here for tail folding? I hadn't considered that, let me think a bit and get back to you.

efriedma added inline comments.Jul 12 2022, 10:49 AM
llvm/docs/LangRef.rst
19962

I was thinking of something along the lines of tail folding, yes. Or scenarios with mixed element types.

This is just from the general perspective of what constructs are useful for vectorization, though. I'm not deeply familiar with the exact loop forms LoopVectorizer generates at the moment.

reames abandoned this revision.Jul 12 2022, 5:19 PM

As pointed out by @efriedma, this change is unsound. The proposed lowering works for the first unrolled copy of a vectorized loop, but in a tail folded loop the second unrolled copy can validly have a starting IV value greater than the trip count. The expected result is a zero vector. This change lost that.

I'm going to explore other options for achieving the lowering for the first unrolled iteration while being correct for all unrolled copies, but that's probably going to require knowledge from the vectorizer. Regardless, doing it here is wrong since we don't have enough information to tell which unrolled iteration we're dealing with.