This is an archive of the discontinued LLVM Phabricator instance.

Fixed store operation for a vector of i1.
AbandonedPublic

Authored by delena on Dec 4 2016, 10:59 AM.

Details

Summary

It is not target specific. Store of <4 x i1 > was implemented incorrectly, it should write into 4 bytes.

Diff Detail

Repository
rL LLVM

Event Timeline

delena updated this revision to Diff 80210.Dec 4 2016, 10:59 AM
delena retitled this revision from to Fixed store operation for a vector of i1..
delena updated this object.
delena added reviewers: spatel, RKSimon, igorb.
delena set the repository for this revision to rL LLVM.
delena added a subscriber: llvm-commits.
craig.topper added inline comments.
lib/CodeGen/SelectionDAG/TargetLowering.cpp
3298

Should this just be

unsigned Stride = (MemSclVT.getSizeInBits() + 7) / 8;

So that it always rounds up to the next byte for any size that isn't divisible by 8?

delena added inline comments.Dec 4 2016, 11:28 AM
lib/CodeGen/SelectionDAG/TargetLowering.cpp
3298

May be. But it will never be called with i9 or i541. It is power of 2.

majnemer added inline comments.
lib/CodeGen/SelectionDAG/TargetLowering.cpp
3298

How about unsigned Stride = MemSvlVT.getStoreSize() ? That would make it very obvious.

It is not target specific. Store of <4 x i1 > was implemented incorrectly, it should write into 4 bytes.

Is that what we decided we wanted? I thought that we decided that we wanted i1 vectors to be bit packed?

Is that what we decided we wanted? I thought that we decided that we wanted i1 vectors to be bit packed?

I remember. First of all, the current implementation scalarizes store and just writes all bits to the same place, because Stride=0.
The same bug, btw, we have in scalarized load. I did not find how to reproduce it yet.

The second issue is the "masked store" of a vector of i1. Theoretically, if we do not support a masked operation, it should be scalarized.
If we store a vector of bits, we should do load-modify-store. Is this operation allowed in terms of multithreading?
Masked and unmasked store should be compatible.

As far as AVX-512 mask registers, spill-fill should be done in bits, using KMOV.

RKSimon edited edge metadata.Dec 5 2016, 5:23 AM

Weird coincidence - I created PR31265 today as I've been having issues with trying to make use of the PMOVMSKB/MOVMSKPD/MOVMSKPS instructions to handle horizontal reductions of <X x i1> vector comparison results.

All "store", "load", "masked.store" "masked.load" should be compatible between them. At the beginning I thought that the operation should have the same behavior across all X86 targets. But now I see that the code is not target specific and the decision should be common for all targets. I also see that there are a lot of bugs there and the operations are not really tested.
In my opinion, the better and easier approach would be storing i1 vectors in bytes. I'll also fix AVX-512 and add tests.

If you're going to put padding into <4 x i1> vectors, you're going to have to change the IR to match; getStoreSizeInBits() is currently 4 for <4 x i1>.

If we wanted bit-packed vectors, we could just ban masked.store for vectors whose elements aren't byte-aligned.

Minor correction, I meant to say that VectorType::getBitWidth()/DataLayout::getTypeSizeInBits() for <4 x i1> is 4.

If you're going to put padding into <4 x i1> vectors, you're going to have to change the IR to match; getStoreSizeInBits() is currently 4 for <4 x i1>.

I agree, I'll need to take care for many aspects. I'll, probably, need to fix lowering of "bitcast <4 x i1> to i4".
The getStoreSizeInBits() will return 32. getSizeInBits() will return 4. But we have the same problem in the bit-packed variant.

If we wanted bit-packed vectors, we could just ban masked.store for vectors whose elements aren't byte-aligned.

We can ban masked.store for i1 vectors, but I'm afraid that scalarization for the regular non-masked "store" will become impossible or will require load-modify-store sequence.
That's why I'd prefer to write bytes instead of bits.

You don't need load-modify-store for an unmasked store; we're allowed to clobber the padding bits, so you can do something like this:

define void @store_vec_i1(<2 x i1>* %p, <2 x i1> %val) {
  ; following is equivalent to `store <2 x i1> %val, <2 x i1>* %p`
  %a0 = extractelement <2 x i1> val, i32 0
  %r0 = zext i1 %v1 to i8
  %a1 = extractelement <2 x i1> val, i32 1
  %b1 = zext i1 %v1 to i8
  %c1 = shl i8 %b1, 1
  %r1 = or i1 %r0, %c1
  %p1 = bitcast <2 x i1>* %p to i8*
  store i8 %r1, i8* %p1
}
bjope added a subscriber: bjope.Dec 5 2016, 2:26 PM
delena abandoned this revision.Dec 6 2016, 1:15 AM

You don't need load-modify-store for an unmasked store; we're allowed to clobber the padding bits, so you can do something like this:

I said that we need load-modify-store for the masked version. (or to ban it at all, I don't really like this idea)

I'm abandoning this review, since we need to make high-level decision about the form of the store.
I'll send RFC later to the dev list.