This is an archive of the discontinued LLVM Phabricator instance.

[RISCV] Improve insert_vector_elt for fixed mask registers.
Needs RevisionPublic

Authored by jacquesguan on Feb 7 2022, 1:39 AM.

Details

Summary

Now the backend promotes mask vector to an i8 vector and insert element to that. We could bitcast to a widen element vector, and extract from it to GPR, then use I instruction to set the certain bit, and insert back to the widen element vector.

Diff Detail

Event Timeline

jacquesguan created this revision.Feb 7 2022, 1:39 AM
jacquesguan requested review of this revision.Feb 7 2022, 1:39 AM

Improve code.

Improve code.

craig.topper added inline comments.Feb 7 2022, 5:42 PM
llvm/test/CodeGen/RISCV/rvv/fixed-vectors-insert-i1.ll
109–117

The upper xlen-1 bits of a0 have an unknown value. The value of %elt is only in the lower bit. You need to mask off the other bits. A mask would only not be needed if the i1 was passed with zeroext attribute.

Address comment.

llvm/test/CodeGen/RISCV/rvv/fixed-vectors-insert-i1.ll
109–117

Done, thanks.

I may have an alternate suggestion for the bit manipulation

%old_bit = srl %wideelt, %bitindex shift old_bit to bit 0
%b = xor %old_bit, new_bit
xor the old and new values together
%c = andi %b, 1 clear all bits except bit 0. This will clear all the extra bits from the shift and remove any extra bits that came with new_bit.
%shl = shl %c, %bitindex
shift the xored value back to the correction position
%xor = xor %wideelt, shl // at bitindex this will be old_bit^(old_bit^new_bit) which cancels to new_bit. Every other index will xor with 0.

This might be a shorter sequence, but I'm not sure.

refactor the bit manipulation

I may have an alternate suggestion for the bit manipulation

%old_bit = srl %wideelt, %bitindex shift old_bit to bit 0
%b = xor %old_bit, new_bit
xor the old and new values together
%c = andi %b, 1 clear all bits except bit 0. This will clear all the extra bits from the shift and remove any extra bits that came with new_bit.
%shl = shl %c, %bitindex
shift the xored value back to the correction position
%xor = xor %wideelt, shl // at bitindex this will be old_bit^(old_bit^new_bit) which cancels to new_bit. Every other index will xor with 0.

This might be a shorter sequence, but I'm not sure.

Great idea, thanks.

Herald added a project: Restricted Project. · View Herald TranscriptMar 14 2022, 12:39 AM

I'm sorry it took me so long to get round to this! I think it LGTM now other than questions/suggestions I've left. @craig.topper are you happy with it?

llvm/lib/Target/RISCV/RISCVISelLowering.cpp
4411

It's a bit of a shame we have to indent twice. We could do if (VecVT.isFixedLengthVector() && VecVT.getVectorNumElements() >= 8) which might be clearer for readers, even if we later have to take NumElts later. Just a style suggestion though.

4413

I think maybe isPowerOf2_32(NumElts) would be most correct here? I realise there's an assertion in one case below but I'm worried that something experimenting with differently-sized vectors may not hit this and get incorrect code.

Something worth thinking about is that this may not be a good sequence for a core with decoupled scalar and vector units. This introduces a copy from vector to scalar which prevents the scalar core from running ahead of the vector unit in such a design.

reames requested changes to this revision.Sep 14 2022, 11:29 AM
reames added a subscriber: reames.

I don't think we should land this patch. It involves moving a value from vector to scalar and then back again. The vector to scalar domain crossing is likely to be expensive on at least some real hardware.

We can probably do better by forming the bitmask on the vector side entirely. Given an index in a scalar register, we can produce a single bit mask by comparing a vid vector against that index. With that, we can construct a vmerge with a scalar operand to set or clear the desired bit in the mask.

This revision now requires changes to proceed.Sep 14 2022, 11:29 AM

I don't think we should land this patch. It involves moving a value from vector to scalar and then back again. The vector to scalar domain crossing is likely to be expensive on at least some real hardware.

We can probably do better by forming the bitmask on the vector side entirely. Given an index in a scalar register, we can produce a single bit mask by comparing a vid vector against that index. With that, we can construct a vmerge with a scalar operand to set or clear the desired bit in the mask.

I don't understand your last sentence well, I think that vmerge could not be used for mask type, so how to only merge the desired bit?

I don't think we should land this patch. It involves moving a value from vector to scalar and then back again. The vector to scalar domain crossing is likely to be expensive on at least some real hardware.

We can probably do better by forming the bitmask on the vector side entirely. Given an index in a scalar register, we can produce a single bit mask by comparing a vid vector against that index. With that, we can construct a vmerge with a scalar operand to set or clear the desired bit in the mask.

I don't understand your last sentence well, I think that vmerge could not be used for mask type, so how to only merge the desired bit?

You're right on the vmerge point. I made this same mistake in another review, caught it, and then didn't realize I'd already posted this comment. Sorry for the confusion.

However, you can replace the vmerge in the above with vmandn and a vmor.

So, code sequence looks something like:

v2 = incoming masking
x1 = incoming index
x2 = incoming bool (elt)
v1 = vid
v0 = vseteq v1, <idx>
v2 = vmandn v2, v0 // original mask with lane cleared
v3 = vmv.v.x x2
v3 = vseteq v3, 1
v3 = vmand v3, v0
v2 = vmor v2, v3

OR

x1 = incoming index
x2 = incoming bool (elt)
v1 = vid
v0 = vseteq v1, <idx>
v2 = vmandn v2, v0 // original mask with lane cleared
bneqz x2, skip
v0 = vxor v3, v3
skip: 
v2 = vmor v2, v0

I'm not claiming this is optimal codegen; there may be better. This is just what occurs to me with a bit of thought.