This is an archive of the discontinued LLVM Phabricator instance.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3298	Should this just be unsigned Stride = (MemSclVT.getSizeInBits() + 7) / 8; So that it always rounds up to the next byte for any size that isn't divisible by 8?

delena added inline comments.Dec 4 2016, 11:28 AM

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3298	May be. But it will never be called with i9 or i541. It is power of 2.

majnemer added a subscriber: majnemer.Dec 4 2016, 11:49 AM

majnemer added inline comments.

lib/CodeGen/SelectionDAG/TargetLowering.cpp
3298	How about `unsigned Stride = MemSvlVT.getStoreSize()` ? That would make it very obvious.

It is not target specific. Store of <4 x i1 > was implemented incorrectly, it should write into 4 bytes.

Is that what we decided we wanted? I thought that we decided that we wanted i1 vectors to be bit packed?

In D27395#612972, @hfinkel wrote:

Is that what we decided we wanted? I thought that we decided that we wanted i1 vectors to be bit packed?

I remember. First of all, the current implementation scalarizes store and just writes all bits to the same place, because Stride=0.
The same bug, btw, we have in scalarized load. I did not find how to reproduce it yet.

The second issue is the "masked store" of a vector of i1. Theoretically, if we do not support a masked operation, it should be scalarized.
If we store a vector of bits, we should do load-modify-store. Is this operation allowed in terms of multithreading?
Masked and unmasked store should be compatible.

As far as AVX-512 mask registers, spill-fill should be done in bits, using KMOV.

Weird coincidence - I created PR31265 today as I've been having issues with trying to make use of the PMOVMSKB/MOVMSKPD/MOVMSKPS instructions to handle horizontal reductions of <X x i1> vector comparison results.

All "store", "load", "masked.store" "masked.load" should be compatible between them. At the beginning I thought that the operation should have the same behavior across all X86 targets. But now I see that the code is not target specific and the decision should be common for all targets. I also see that there are a lot of bugs there and the operations are not really tested.
In my opinion, the better and easier approach would be storing i1 vectors in bytes. I'll also fix AVX-512 and add tests.

If you're going to put padding into <4 x i1> vectors, you're going to have to change the IR to match; getStoreSizeInBits() is currently 4 for <4 x i1>.

If we wanted bit-packed vectors, we could just ban masked.store for vectors whose elements aren't byte-aligned.

Minor correction, I meant to say that VectorType::getBitWidth()/DataLayout::getTypeSizeInBits() for <4 x i1> is 4.

In D27395#613711, @efriedma wrote:

If you're going to put padding into <4 x i1> vectors, you're going to have to change the IR to match; getStoreSizeInBits() is currently 4 for <4 x i1>.

I agree, I'll need to take care for many aspects. I'll, probably, need to fix lowering of "bitcast <4 x i1> to i4".
The getStoreSizeInBits() will return 32. getSizeInBits() will return 4. But we have the same problem in the bit-packed variant.

If we wanted bit-packed vectors, we could just ban masked.store for vectors whose elements aren't byte-aligned.

We can ban masked.store for i1 vectors, but I'm afraid that scalarization for the regular non-masked "store" will become impossible or will require load-modify-store sequence.
That's why I'd prefer to write bytes instead of bits.

You don't need load-modify-store for an unmasked store; we're allowed to clobber the padding bits, so you can do something like this:

define void @store_vec_i1(<2 x i1>* %p, <2 x i1> %val) {
  ; following is equivalent to `store <2 x i1> %val, <2 x i1>* %p`
  %a0 = extractelement <2 x i1> val, i32 0
  %r0 = zext i1 %v1 to i8
  %a1 = extractelement <2 x i1> val, i32 1
  %b1 = zext i1 %v1 to i8
  %c1 = shl i8 %b1, 1
  %r1 = or i1 %r0, %c1
  %p1 = bitcast <2 x i1>* %p to i8*
  store i8 %r1, i8* %p1
}

bjope added a subscriber: bjope.Dec 5 2016, 2:26 PM

You don't need load-modify-store for an unmasked store; we're allowed to clobber the padding bits, so you can do something like this:

I said that we need load-modify-store for the masked version. (or to ban it at all, I don't really like this idea)

I'm abandoning this review, since we need to make high-level decision about the form of the store.
I'll send RFC later to the dev list.

Revision Contents

Path

Size

lib/

CodeGen/

SelectionDAG/

TargetLowering.cpp

1 line

test/

CodeGen/

X86/

vec-trunc-store.ll

23 lines

Diff 80210

lib/CodeGen/SelectionDAG/TargetLowering.cpp

Show First 20 Lines • Show All 3,289 Lines • ▼ Show 20 Lines	SDValue TargetLowering::scalarizeVectorStore(StoreSDNode *ST,

// The type of data as saved in memory.		// The type of data as saved in memory.
EVT MemSclVT = StVT.getScalarType();		EVT MemSclVT = StVT.getScalarType();

EVT PtrVT = BasePtr.getValueType();		EVT PtrVT = BasePtr.getValueType();

// Store Stride in bytes		// Store Stride in bytes
unsigned Stride = MemSclVT.getSizeInBits() / 8;		unsigned Stride = MemSclVT.getSizeInBits() / 8;
		Stride = (Stride == 0) ? 1 : Stride; // Stride can't be less than 1 byte.
		craig.topperUnsubmitted Not Done Reply Inline Actions Should this just be unsigned Stride = (MemSclVT.getSizeInBits() + 7) / 8; So that it always rounds up to the next byte for any size that isn't divisible by 8? craig.topper: Should this just be unsigned Stride = (MemSclVT.getSizeInBits() + 7) / 8; So that it always…
		delenaAuthorUnsubmitted Not Done Reply Inline Actions May be. But it will never be called with i9 or i541. It is power of 2. delena: May be. But it will never be called with i9 or i541. It is power of 2.
		majnemerUnsubmitted Not Done Reply Inline Actions How about `unsigned Stride = MemSvlVT.getStoreSize()` ? That would make it very obvious. majnemer: How about `unsigned Stride = MemSvlVT.getStoreSize()` ? That would make it very obvious.
EVT IdxVT = getVectorIdxTy(DAG.getDataLayout());		EVT IdxVT = getVectorIdxTy(DAG.getDataLayout());
unsigned NumElem = StVT.getVectorNumElements();		unsigned NumElem = StVT.getVectorNumElements();

// Extract each of the elements from the original vector and save them into		// Extract each of the elements from the original vector and save them into
// memory individually.		// memory individually.
SmallVector<SDValue, 8> Stores;		SmallVector<SDValue, 8> Stores;
for (unsigned Idx = 0; Idx < NumElem; ++Idx) {		for (unsigned Idx = 0; Idx < NumElem; ++Idx) {
SDValue Elt = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, SL, RegSclVT, Value,		SDValue Elt = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, SL, RegSclVT, Value,
▲ Show 20 Lines • Show All 389 Lines • Show Last 20 Lines

test/CodeGen/X86/vec-trunc-store.ll

	Show All 26 Lines
	; CHECK-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]			; CHECK-NEXT: pshufd {{.*#+}} xmm0 = xmm0[0,2,2,3]
	; CHECK-NEXT: movq %xmm0, (%rax)			; CHECK-NEXT: movq %xmm0, (%rax)
	; CHECK-NEXT: retq			; CHECK-NEXT: retq
	%t = load <4 x i32>, <4 x i32>* %p			%t = load <4 x i32>, <4 x i32>* %p
	%cti44 = trunc <4 x i32> %t to <4 x i16> ; <<4 x i16>> [#uses=1]			%cti44 = trunc <4 x i32> %t to <4 x i16> ; <<4 x i16>> [#uses=1]
	store <4 x i16> %cti44, <4 x i16>* undef			store <4 x i16> %cti44, <4 x i16>* undef
	ret void			ret void
	}			}

				define void @store_vec_i1(<4 x i1>* %p, <4 x i1> %val) {
				; CHECK-LABEL: store_vec_i1:
				; CHECK: # BB#0:
				; CHECK-NEXT: movd %xmm0, %eax
				; CHECK-NEXT: andl $1, %eax
				; CHECK-NEXT: movb %al, (%rdi)
				; CHECK-NEXT: pshufd {{.*#+}} xmm1 = xmm0[3,1,2,3]
				; CHECK-NEXT: movd %xmm1, %eax
				; CHECK-NEXT: andl $1, %eax
				; CHECK-NEXT: movb %al, 3(%rdi)
				; CHECK-NEXT: pshufd {{.*#+}} xmm1 = xmm0[2,3,0,1]
				; CHECK-NEXT: movd %xmm1, %eax
				; CHECK-NEXT: andl $1, %eax
				; CHECK-NEXT: movb %al, 2(%rdi)
				; CHECK-NEXT: pshufd {{.*#+}} xmm0 = xmm0[1,1,2,3]
				; CHECK-NEXT: movd %xmm0, %eax
				; CHECK-NEXT: andl $1, %eax
				; CHECK-NEXT: movb %al, 1(%rdi)
				; CHECK-NEXT: retq
				store <4 x i1> %val, <4 x i1>* %p
				ret void;
				}

This is an archive of the discontinued LLVM Phabricator instance.

Fixed store operation for a vector of i1.AbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 80210

lib/CodeGen/SelectionDAG/TargetLowering.cpp

test/CodeGen/X86/vec-trunc-store.ll

Fixed store operation for a vector of i1.
AbandonedPublic