Diff 107373

lib/Target/X86/X86InterleavedAccess.cpp

Show First 20 Lines • Show All 92 Lines • ▼ Show 20 Lines	public:
/// \brief Lowers this interleaved access group into X86-specific		/// \brief Lowers this interleaved access group into X86-specific
/// instructions/intrinsics.		/// instructions/intrinsics.
bool lowerIntoOptimizedSequence();		bool lowerIntoOptimizedSequence();
};		};
} // end anonymous namespace		} // end anonymous namespace

bool X86InterleavedAccessGroup::isSupported() const {		bool X86InterleavedAccessGroup::isSupported() const {
VectorType *ShuffleVecTy = Shuffles[0]->getType();		VectorType *ShuffleVecTy = Shuffles[0]->getType();
uint64_t ShuffleVecSize = DL.getTypeSizeInBits(ShuffleVecTy);
Type *ShuffleEltTy = ShuffleVecTy->getVectorElementType();		Type *ShuffleEltTy = ShuffleVecTy->getVectorElementType();
		unsigned WideInstSize;

// Currently, lowering is supported for 4-element vectors of 64 bits on AVX.		// Currently, lowering is supported for 4-element vectors of 64 bits on AVX.
uint64_t ExpectedShuffleVecSize;		if (isa<LoadInst>(Inst)) {
if (isa<LoadInst>(Inst))		if (DL.getTypeSizeInBits(ShuffleVecTy) != 256)
		DavidKreitzerUnsubmitted Not Done Reply Inline Actions It is not just a question of profitability. If lowerIntoOptimizedSequence were called for a load instruction that is too small, it would make an incorrect transformation, because the decompose function would generate Factor loads, each of size ShuffleVecSize. I would also recommend a slightly different fix. Rather than checking the expected shuffle size for both the load & the store, I would check the "expected wide vector size". For loads, that means checking the type of the load. For stores, it means checking the type of the shuffle. DavidKreitzer: It is not just a question of profitability. If lowerIntoOptimizedSequence were called for a…
		FarhanaAuthorUnsubmitted Not Done Reply Inline Actions It is not just a question of profitability. If lowerIntoOptimizedSequence were called for a load instruction that is too small, it would make an incorrect transformation, because the decompose function would generate Factor loads, each of size ShuffleVecSize. But, we could handle it easily by generating dummy vectors. And the reason we are not doing that because it will generate inefficient sequence. So, it boils down to the profitability. Farhana: It is not just a question of profitability. If lowerIntoOptimizedSequence were called for a…
		DavidKreitzerUnsubmitted Not Done Reply Inline Actions You could make the ShuffleVecTy check more generic by testing this, which is similar to what you had before adding interleaved store support: if (WideInstSize != DL.getTypeSizeInBits(ShuffleVecType) * Factor) return false; I get your point about the profitability. I think your comment in the test adequately captures that. DavidKreitzer: You could make the ShuffleVecTy check more generic by testing this, which is similar to what…
		FarhanaAuthorUnsubmitted Not Done Reply Inline Actions I cannot write the way you suggested, it would be incorrect. We are checking the ShuffleVecType, so it can't be multiplied with Factor. For store, it will check against 1024 * 4. Farhana: I cannot write the way you suggested, it would be incorrect. We are checking the ShuffleVecType…
		FarhanaAuthorUnsubmitted Not Done Reply Inline Actions But we can make it semi generic by hardcoding the number of elements. Farhana: But we can make it semi generic by hardcoding the number of elements.
		DavidKreitzerUnsubmitted Not Done Reply Inline Actions I cannot write the way you suggested, it would be incorrect. We are checking the ShuffleVecType, so it can't be multiplied with Factor. For store, it will check against 1024 * 4. But the code in question is guarded by isa<LoadInst>(Inst). At any rate, what you have is fine if you change 256 to ShuffleElemSize * SupportedNumElem. DavidKreitzer: > I cannot write the way you suggested, it would be incorrect. We are checking the…
		FarhanaAuthorUnsubmitted Not Done Reply Inline Actions if (WideInstSize != DL.getTypeSizeInBits(ShuffleVecType) * Factor) return false; But the code in question is guarded by isa<LoadInst>(Inst). But ShuffleVecType is type of shuffles[0] which can be different load vs store. So, comparing WideInstSize with DL.getTypeSizeInBits(ShuffleVecType) * Factor will result in incorrect value. When it's a store, it will compare with with 1024 * 4. At any rate, what you have is fine if you change 256 to ShuffleElemSize * SupportedNumElem. It will produce incorrect result in the current code since ShuffleElemSize is not guaranteed to be 4 at that point. I like the way it is since all checks are confined in one place, make it look clean. But I don't mind to change it since you prefer the other way. Farhana: if (WideInstSize != DL.getTypeSizeInBits(ShuffleVecType) * Factor) return false; But the…
		FarhanaAuthorUnsubmitted Not Done Reply Inline Actions Regarding 256, I take it back. You are right, we don't need the element size check. Farhana: Regarding 256, I take it back. You are right, we don't need the element size check.
ExpectedShuffleVecSize = 256;		return false;
else
ExpectedShuffleVecSize = 1024;		WideInstSize = DL.getTypeSizeInBits(Inst->getType());
		} else
		WideInstSize = DL.getTypeSizeInBits(Shuffles[0]->getType());

if (!Subtarget.hasAVX() \|\| ShuffleVecSize != ExpectedShuffleVecSize \|\|		if (!Subtarget.hasAVX() \|\| WideInstSize != 1024 \|\|
DL.getTypeSizeInBits(ShuffleEltTy) != 64 \|\| Factor != 4)		DL.getTypeSizeInBits(ShuffleEltTy) != 64 \|\| Factor != 4)
return false;		return false;

return true;		return true;
}		}

void X86InterleavedAccessGroup::decompose(		void X86InterleavedAccessGroup::decompose(
Instruction VecInst, unsigned NumSubVectors, VectorType SubVecTy,		Instruction VecInst, unsigned NumSubVectors, VectorType SubVecTy,
▲ Show 20 Lines • Show All 166 Lines • Show Last 20 Lines

test/Transforms/InterleavedAccess/X86/interleaved-accesses-64bits-avx.ll

	Show First 20 Lines • Show All 211 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: ret void			; CHECK-NEXT: ret void
	;			;
	%s0 = shufflevector <16 x double> %v0, <16 x double> %v1, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>			%s0 = shufflevector <16 x double> %v0, <16 x double> %v1, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>
	%s1 = shufflevector <16 x double> %v2, <16 x double> %v3, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>			%s1 = shufflevector <16 x double> %v2, <16 x double> %v3, <32 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7, i32 8, i32 9, i32 10, i32 11, i32 12, i32 13, i32 14, i32 15, i32 16, i32 17, i32 18, i32 19, i32 20, i32 21, i32 22, i32 23, i32 24, i32 25, i32 26, i32 27, i32 28, i32 29, i32 30, i32 31>
	%interleaved.vec = shufflevector <32 x double> %s0, <32 x double> %s1, <16 x i32> <i32 4, i32 32, i32 16, i32 8, i32 5, i32 33, i32 17, i32 9, i32 6, i32 34, i32 18, i32 10, i32 7, i32 35, i32 19, i32 11>			%interleaved.vec = shufflevector <32 x double> %s0, <32 x double> %s1, <16 x i32> <i32 4, i32 32, i32 16, i32 8, i32 5, i32 33, i32 17, i32 9, i32 6, i32 34, i32 18, i32 10, i32 7, i32 35, i32 19, i32 11>
	store <16 x double> %interleaved.vec, <16 x double>* %ptr, align 16			store <16 x double> %interleaved.vec, <16 x double>* %ptr, align 16
	ret void			ret void
	}			}

				; This verifies whether the test passes and does not hit any assertions.
				; Today, X86InterleavedAccess could have handled this case and
				; generate transposed sequence by extending the current implementation
				; which would be creating dummy vectors of undef. But it decided not to
				; optimize these cases where the load-size is less Factor * NumberOfElements.
				; Because a better sequence can easily be generated by CG.

				@a = local_unnamed_addr global <4 x double> zeroinitializer, align 32
				; Function Attrs: norecurse nounwind readonly uwtable
				define <4 x double> @test_unhandled(<4 x double> %b) {
				entry:
				%0 = load <4 x double>, <4 x double>* @a, align 32
				%1 = shufflevector <4 x double> %0, <4 x double> undef, <4 x i32> <i32 3, i32 undef, i32 undef, i32 undef>
				%shuffle = shufflevector <4 x double> %1, <4 x double> %b, <4 x i32> <i32 0, i32 4, i32 0, i32 0>
				ret <4 x double> %shuffle
				}

This is an archive of the discontinued LLVM Phabricator instance.

A fix for bug33826
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 107373

lib/Target/X86/X86InterleavedAccess.cpp

test/Transforms/InterleavedAccess/X86/interleaved-accesses-64bits-avx.ll

This is an archive of the discontinued LLVM Phabricator instance.

A fix for bug33826ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 107373

lib/Target/X86/X86InterleavedAccess.cpp

test/Transforms/InterleavedAccess/X86/interleaved-accesses-64bits-avx.ll

A fix for bug33826
ClosedPublic