It is not target specific. Store of <4 x i1 > was implemented incorrectly, it should write into 4 bytes.
Diff Detail
- Repository
- rL LLVM
Event Timeline
| lib/CodeGen/SelectionDAG/TargetLowering.cpp | ||
|---|---|---|
| 3298 | Should this just be unsigned Stride = (MemSclVT.getSizeInBits() + 7) / 8; So that it always rounds up to the next byte for any size that isn't divisible by 8? | |
| lib/CodeGen/SelectionDAG/TargetLowering.cpp | ||
|---|---|---|
| 3298 | May be. But it will never be called with i9 or i541. It is power of 2. | |
| lib/CodeGen/SelectionDAG/TargetLowering.cpp | ||
|---|---|---|
| 3298 | How about unsigned Stride = MemSvlVT.getStoreSize() ? That would make it very obvious. | |
It is not target specific. Store of <4 x i1 > was implemented incorrectly, it should write into 4 bytes.
Is that what we decided we wanted? I thought that we decided that we wanted i1 vectors to be bit packed?
I remember. First of all, the current implementation scalarizes store and just writes all bits to the same place, because Stride=0.
The same bug, btw, we have in scalarized load. I did not find how to reproduce it yet.
The second issue is the "masked store" of a vector of i1. Theoretically, if we do not support a masked operation, it should be scalarized.
 If we store a vector of bits, we should do load-modify-store. Is this operation allowed in terms of multithreading?
Masked and unmasked store should be compatible.
As far as AVX-512 mask registers, spill-fill should be done in bits, using KMOV.
Weird coincidence - I created PR31265 today as I've been having issues with trying to make use of the PMOVMSKB/MOVMSKPD/MOVMSKPS instructions to handle horizontal reductions of <X x i1> vector comparison results.
All "store", "load", "masked.store" "masked.load" should be compatible between them. At the beginning I thought that the operation should have the same behavior across all X86 targets. But now I see that the code is not target specific and the decision should be common for all targets. I also see that there are a lot of bugs there and the operations are not really tested.
In my opinion, the better and easier approach would be  storing i1 vectors in bytes. I'll also fix AVX-512 and add tests.
If you're going to put padding into <4 x i1> vectors, you're going to have to change the IR to match; getStoreSizeInBits() is currently 4 for <4 x i1>.
If we wanted bit-packed vectors, we could just ban masked.store for vectors whose elements aren't byte-aligned.
Minor correction, I meant to say that VectorType::getBitWidth()/DataLayout::getTypeSizeInBits() for <4 x i1> is 4.
I agree, I'll need to take care for many aspects. I'll, probably, need to fix lowering of "bitcast <4 x i1> to i4".
The getStoreSizeInBits() will return 32. getSizeInBits() will return 4. But we have the same problem in the bit-packed variant.
If we wanted bit-packed vectors, we could just ban masked.store for vectors whose elements aren't byte-aligned.
We can ban masked.store for i1 vectors, but I'm afraid that scalarization for the regular non-masked "store" will become impossible or will require load-modify-store sequence.
That's why I'd prefer to write bytes instead of bits.
You don't need load-modify-store for an unmasked store; we're allowed to clobber the padding bits, so you can do something like this:
define void @store_vec_i1(<2 x i1>* %p, <2 x i1> %val) {
  ; following is equivalent to `store <2 x i1> %val, <2 x i1>* %p`
  %a0 = extractelement <2 x i1> val, i32 0
  %r0 = zext i1 %v1 to i8
  %a1 = extractelement <2 x i1> val, i32 1
  %b1 = zext i1 %v1 to i8
  %c1 = shl i8 %b1, 1
  %r1 = or i1 %r0, %c1
  %p1 = bitcast <2 x i1>* %p to i8*
  store i8 %r1, i8* %p1
}You don't need load-modify-store for an unmasked store; we're allowed to clobber the padding bits, so you can do something like this:
I said that we need load-modify-store for the masked version. (or to ban it at all, I don't really like this idea)
I'm abandoning this review, since we need to make high-level decision about the form of the store.
I'll send RFC later to the dev list.
Should this just be
unsigned Stride = (MemSclVT.getSizeInBits() + 7) / 8;
So that it always rounds up to the next byte for any size that isn't divisible by 8?