This is an archive of the discontinued LLVM Phabricator instance.

[VectorCombine] allow vector loads with mismatched insert type
ClosedPublic

Authored by spatel on Aug 18 2020, 11:03 AM.

Download Raw Diff

Details

Reviewers

RKSimon
xbolva00
lebedev.ri
craig.topper
nikic

Commits

rG8fb055932c08: [VectorCombine] allow vector loads with mismatched insert type

Summary

This is an enhancement to D81766 to allow loading the minimum target vector type into an IR vector with a different number of elements.

In one of the motivating tests from PR16739, SLP creates <2 x float> load ops mixed with <4 x float> insert ops, so we want to handle that pattern in addition to potential oversized vectors created by the vectorizers.

I'm not sure if we should try to model the cost of the identity shuffle as an insert/extract subvector since we are shuffling with undef?

Diff Detail

Event Timeline

spatel created this revision.Aug 18 2020, 11:03 AM

Herald added a project: Restricted Project. · View Herald TranscriptAug 18 2020, 11:03 AM

Herald added subscribers: hiraditya, mcrosier. · View Herald Transcript

spatel requested review of this revision.Aug 18 2020, 11:03 AM

Ping.

I'm not sure if we should try to model the cost of the identity shuffle as an insert/extract subvector since we are shuffling with undef?

This came up on PR43605 - we have an insert subvector cost enum but we can't actually create shufflevectors with an insertion pattern - it requires 2 or more shuffles working together - so its tricky to cost it - I'm wondering if we should replace the insert cost enum with something else (concat/lengthen/whatever). We have a similar problem for extract subvector shuffles - we don't have anything that takes elements from one/both sources but is more than just a basic sequential mask. It comes down to what use cases we have here with the VectorCombiner as well as the vectorizers. In short, the shuffle cost enums don't really match what the ir shufflevector can do.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
150	Are we in danger of creating out of bounds shuffle mask indices if the dst vector type is more than 2x the original size (v2f32 -> v16f32 etc.) ? I think they canonicalize to undef but I'm not sure (+ have no access to the source tree atm)

Patch updated:
Fixed shuffle mask creation to not go out-of-bounds for greater than 2x subvector size difference.

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
150	Nice catch - yes, that would crash on creation. Adjusted one of the tests to verify that.

spatel mentioned this in rG9cea682faaa0: [VectorCombine] adjust test for better coverage; NFC.Aug 26 2020, 1:52 PM

LGTM with one minor

llvm/lib/Transforms/Vectorize/VectorCombine.cpp
150	(style) Don't call Ty->getNumElements in every loop Maybe pull it out above the loop as we call it in the SmallVector constuctor as well?

This revision is now accepted and ready to land.Sep 1 2020, 4:25 AM

Closed by commit rG8fb055932c08: [VectorCombine] allow vector loads with mismatched insert type (authored by spatel). · Explain WhySep 2 2020, 5:11 AM

This revision was automatically updated to reflect the committed changes.

spatel added a commit: rG8fb055932c08: [VectorCombine] allow vector loads with mismatched insert type.

MaskRay mentioned this in D87538: [VectorCombine] Don't vectorize scalar load under asan/hwasan/memtag/tsan.Sep 11 2020, 4:33 PM

I've filed an issue for a performance regression caused by this patch:

https://bugs.llvm.org/show_bug.cgi?id=47558

In D86160#2279646, @kazu wrote:

I've filed an issue for a performance regression caused by this patch:

https://bugs.llvm.org/show_bug.cgi?id=47558

This should get that example back to where it was:
rG48a23bccf373

Revision Contents

Path

Size

llvm/

lib/

Transforms/

Vectorize/

VectorCombine.cpp

17 lines

test/

Transforms/

VectorCombine/

X86/

load.ll

28 lines

Diff 286357

llvm/lib/Transforms/Vectorize/VectorCombine.cpp

Show All 23 Lines
#include "llvm/IR/Function.h"		#include "llvm/IR/Function.h"
#include "llvm/IR/IRBuilder.h"		#include "llvm/IR/IRBuilder.h"
#include "llvm/IR/PatternMatch.h"		#include "llvm/IR/PatternMatch.h"
#include "llvm/InitializePasses.h"		#include "llvm/InitializePasses.h"
#include "llvm/Pass.h"		#include "llvm/Pass.h"
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"
#include "llvm/Transforms/Utils/Local.h"		#include "llvm/Transforms/Utils/Local.h"
#include "llvm/Transforms/Vectorize.h"		#include "llvm/Transforms/Vectorize.h"
		#include <numeric>

using namespace llvm;		using namespace llvm;
using namespace llvm::PatternMatch;		using namespace llvm::PatternMatch;

#define DEBUG_TYPE "vector-combine"		#define DEBUG_TYPE "vector-combine"
STATISTIC(NumVecLoad, "Number of vector loads formed");		STATISTIC(NumVecLoad, "Number of vector loads formed");
STATISTIC(NumVecCmp, "Number of vector compares formed");		STATISTIC(NumVecCmp, "Number of vector compares formed");
STATISTIC(NumVecBO, "Number of vector binops formed");		STATISTIC(NumVecBO, "Number of vector binops formed");
▲ Show 20 Lines • Show All 55 Lines • ▼ Show 20 Lines	bool VectorCombine::vectorizeLoadInsert(Instruction &I) {
// Match insert of scalar load.		// Match insert of scalar load.
Value *Scalar;		Value *Scalar;
if (!match(&I, m_InsertElt(m_Undef(), m_Value(Scalar), m_ZeroInt())))		if (!match(&I, m_InsertElt(m_Undef(), m_Value(Scalar), m_ZeroInt())))
return false;		return false;
auto *Load = dyn_cast<LoadInst>(Scalar);		auto *Load = dyn_cast<LoadInst>(Scalar);
Type *ScalarTy = Scalar->getType();		Type *ScalarTy = Scalar->getType();
if (!Load \|\| !Load->isSimple())		if (!Load \|\| !Load->isSimple())
return false;		return false;
		auto *Ty = dyn_cast<FixedVectorType>(I.getType());
		if (!Ty)
		return false;

// TODO: Extend this to match GEP with constant offsets.		// TODO: Extend this to match GEP with constant offsets.
Value *PtrOp = Load->getPointerOperand()->stripPointerCasts();		Value *PtrOp = Load->getPointerOperand()->stripPointerCasts();
assert(isa<PointerType>(PtrOp->getType()) && "Expected a pointer type");		assert(isa<PointerType>(PtrOp->getType()) && "Expected a pointer type");

unsigned VectorSize = TTI.getMinVectorRegisterBitWidth();		unsigned VectorSize = TTI.getMinVectorRegisterBitWidth();
uint64_t ScalarSize = ScalarTy->getPrimitiveSizeInBits();		uint64_t ScalarSize = ScalarTy->getPrimitiveSizeInBits();
if (!ScalarSize \|\| !VectorSize \|\| VectorSize % ScalarSize != 0)		if (!ScalarSize \|\| !VectorSize \|\| VectorSize % ScalarSize != 0)
return false;		return false;

// Check safety of replacing the scalar load with a larger vector load.		// Check safety of replacing the scalar load with a larger vector load.
unsigned VecNumElts = VectorSize / ScalarSize;		unsigned VecNumElts = VectorSize / ScalarSize;
auto *VectorTy = VectorType::get(ScalarTy, VecNumElts, false);		auto *VectorTy = VectorType::get(ScalarTy, VecNumElts, false);
// TODO: Allow insert/extract subvector if the type does not match.
if (VectorTy != I.getType())
return false;
Align Alignment = Load->getAlign();		Align Alignment = Load->getAlign();
const DataLayout &DL = I.getModule()->getDataLayout();		const DataLayout &DL = I.getModule()->getDataLayout();
if (!isSafeToLoadUnconditionally(PtrOp, VectorTy, Alignment, DL, Load, &DT))		if (!isSafeToLoadUnconditionally(PtrOp, VectorTy, Alignment, DL, Load, &DT))
return false;		return false;

unsigned AS = Load->getPointerAddressSpace();		unsigned AS = Load->getPointerAddressSpace();

// Original pattern: insertelt undef, load [free casts of] ScalarPtr, 0		// Original pattern: insertelt undef, load [free casts of] ScalarPtr, 0
int OldCost = TTI.getMemoryOpCost(Instruction::Load, ScalarTy, Alignment, AS);		int OldCost = TTI.getMemoryOpCost(Instruction::Load, ScalarTy, Alignment, AS);
APInt DemandedElts = APInt::getOneBitSet(VecNumElts, 0);		APInt DemandedElts = APInt::getOneBitSet(VecNumElts, 0);
OldCost += TTI.getScalarizationOverhead(VectorTy, DemandedElts, true, false);		OldCost += TTI.getScalarizationOverhead(VectorTy, DemandedElts, true, false);

// New pattern: load VecPtr		// New pattern: load VecPtr
int NewCost = TTI.getMemoryOpCost(Instruction::Load, VectorTy, Alignment, AS);		int NewCost = TTI.getMemoryOpCost(Instruction::Load, VectorTy, Alignment, AS);

// We can aggressively convert to the vector form because the backend can		// We can aggressively convert to the vector form because the backend can
// invert this transform if it does not result in a performance win.		// invert this transform if it does not result in a performance win.
if (OldCost < NewCost)		if (OldCost < NewCost)
return false;		return false;

// It is safe and potentially profitable to load a vector directly:		// It is safe and potentially profitable to load a vector directly:
// inselt undef, load Scalar, 0 --> load VecPtr		// inselt undef, load Scalar, 0 --> load VecPtr
IRBuilder<> Builder(Load);		IRBuilder<> Builder(Load);
Value *CastedPtr = Builder.CreateBitCast(PtrOp, VectorTy->getPointerTo(AS));		Value *CastedPtr = Builder.CreateBitCast(PtrOp, VectorTy->getPointerTo(AS));
LoadInst *VecLd = Builder.CreateAlignedLoad(VectorTy, CastedPtr, Alignment);		Value *VecLd = Builder.CreateAlignedLoad(VectorTy, CastedPtr, Alignment);

		// If the insert type does not match the target's minimum vector type,
		// use an identity shuffle to shrink/grow the vector.
		if (Ty != VectorTy) {
		SmallVector<int, 16> Mask(Ty->getNumElements());
		std::iota(Mask.begin(), Mask.end(), 0);
		RKSimonUnsubmitted Done Reply Inline Actions Are we in danger of creating out of bounds shuffle mask indices if the dst vector type is more than 2x the original size (v2f32 -> v16f32 etc.) ? I think they canonicalize to undef but I'm not sure (+ have no access to the source tree atm) RKSimon: Are we in danger of creating out of bounds shuffle mask indices if the dst vector type is more…
		spatelAuthorUnsubmitted Done Reply Inline Actions Nice catch - yes, that would crash on creation. Adjusted one of the tests to verify that. spatel: Nice catch - yes, that would crash on creation. Adjusted one of the tests to verify that.
		RKSimonUnsubmitted Not Done Reply Inline Actions (style) Don't call Ty->getNumElements in every loop Maybe pull it out above the loop as we call it in the SmallVector constuctor as well? RKSimon: (style) Don't call Ty->getNumElements in every loop Maybe pull it out above the loop as we…
		VecLd = Builder.CreateShuffleVector(VecLd, UndefValue::get(VectorTy), Mask);
		}
replaceValue(I, *VecLd);		replaceValue(I, *VecLd);
++NumVecLoad;		++NumVecLoad;
return true;		return true;
}		}

/// Determine which, if any, of the inputs should be replaced by a shuffle		/// Determine which, if any, of the inputs should be replaced by a shuffle
/// followed by extract from a different index.		/// followed by extract from a different index.
ExtractElementInst *VectorCombine::getShuffleExtract(		ExtractElementInst *VectorCombine::getShuffleExtract(
▲ Show 20 Lines • Show All 611 Lines • Show Last 20 Lines

llvm/test/Transforms/VectorCombine/X86/load.ll

	Show First 20 Lines • Show All 340 Lines • ▼ Show 20 Lines
	; CHECK-NEXT: [[R:%.*]] = insertelement <4 x float> undef, float [[S]], i32 0			; CHECK-NEXT: [[R:%.*]] = insertelement <4 x float> undef, float [[S]], i32 0
	; CHECK-NEXT: ret <4 x float> [[R]]			; CHECK-NEXT: ret <4 x float> [[R]]
	;			;
	%s = load float, float* %p, align 4			%s = load float, float* %p, align 4
	%r = insertelement <4 x float> undef, float %s, i32 0			%r = insertelement <4 x float> undef, float %s, i32 0
	ret <4 x float> %r			ret <4 x float> %r
	}			}

	; TODO: Should load v4i32.

	define <8 x i32> @load_i32_insert_v8i32(i32* align 16 dereferenceable(16) %p) {			define <8 x i32> @load_i32_insert_v8i32(i32* align 16 dereferenceable(16) %p) {
	; CHECK-LABEL: @load_i32_insert_v8i32(			; CHECK-LABEL: @load_i32_insert_v8i32(
	; CHECK-NEXT: [[S:%.]] = load i32, i32 [[P:%.*]], align 4			; CHECK-NEXT: [[TMP1:%.]] = bitcast i32 [[P:%.]] to <4 x i32>
	; CHECK-NEXT: [[R:%.*]] = insertelement <8 x i32> undef, i32 [[S]], i32 0			; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 4
				; CHECK-NEXT: [[R:%.*]] = shufflevector <4 x i32> [[TMP2]], <4 x i32> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	; CHECK-NEXT: ret <8 x i32> [[R]]			; CHECK-NEXT: ret <8 x i32> [[R]]
	;			;
	%s = load i32, i32* %p, align 4			%s = load i32, i32* %p, align 4
	%r = insertelement <8 x i32> undef, i32 %s, i32 0			%r = insertelement <8 x i32> undef, i32 %s, i32 0
	ret <8 x i32> %r			ret <8 x i32> %r
	}			}

	; TODO: Should load v4i32.

	define <8 x i32> @casted_load_i32_insert_v8i32(<4 x i32>* align 4 dereferenceable(16) %p) {			define <8 x i32> @casted_load_i32_insert_v8i32(<4 x i32>* align 4 dereferenceable(16) %p) {
	; CHECK-LABEL: @casted_load_i32_insert_v8i32(			; CHECK-LABEL: @casted_load_i32_insert_v8i32(
	; CHECK-NEXT: [[B:%.]] = bitcast <4 x i32> [[P:%.]] to i32			; CHECK-NEXT: [[TMP1:%.]] = load <4 x i32>, <4 x i32> [[P:%.*]], align 4
	; CHECK-NEXT: [[S:%.]] = load i32, i32 [[B]], align 4			; CHECK-NEXT: [[R:%.*]] = shufflevector <4 x i32> [[TMP1]], <4 x i32> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	; CHECK-NEXT: [[R:%.*]] = insertelement <8 x i32> undef, i32 [[S]], i32 0
	; CHECK-NEXT: ret <8 x i32> [[R]]			; CHECK-NEXT: ret <8 x i32> [[R]]
	;			;
	%b = bitcast <4 x i32>* %p to i32*			%b = bitcast <4 x i32>* %p to i32*
	%s = load i32, i32* %b, align 4			%s = load i32, i32* %b, align 4
	%r = insertelement <8 x i32> undef, i32 %s, i32 0			%r = insertelement <8 x i32> undef, i32 %s, i32 0
	ret <8 x i32> %r			ret <8 x i32> %r
	}			}

	; TODO: Should load v4f32.

	define <8 x float> @load_f32_insert_v8f32(float* align 16 dereferenceable(16) %p) {			define <8 x float> @load_f32_insert_v8f32(float* align 16 dereferenceable(16) %p) {
	; CHECK-LABEL: @load_f32_insert_v8f32(			; CHECK-LABEL: @load_f32_insert_v8f32(
	; CHECK-NEXT: [[S:%.]] = load float, float [[P:%.*]], align 4			; CHECK-NEXT: [[TMP1:%.]] = bitcast float [[P:%.]] to <4 x float>
	; CHECK-NEXT: [[R:%.*]] = insertelement <8 x float> undef, float [[S]], i32 0			; CHECK-NEXT: [[TMP2:%.]] = load <4 x float>, <4 x float> [[TMP1]], align 4
				; CHECK-NEXT: [[R:%.*]] = shufflevector <4 x float> [[TMP2]], <4 x float> undef, <8 x i32> <i32 0, i32 1, i32 2, i32 3, i32 4, i32 5, i32 6, i32 7>
	; CHECK-NEXT: ret <8 x float> [[R]]			; CHECK-NEXT: ret <8 x float> [[R]]
	;			;
	%s = load float, float* %p, align 4			%s = load float, float* %p, align 4
	%r = insertelement <8 x float> undef, float %s, i32 0			%r = insertelement <8 x float> undef, float %s, i32 0
	ret <8 x float> %r			ret <8 x float> %r
	}			}

	; TODO: Should load v4f32.

	define <2 x float> @load_f32_insert_v2f32(float* align 16 dereferenceable(16) %p) {			define <2 x float> @load_f32_insert_v2f32(float* align 16 dereferenceable(16) %p) {
	; CHECK-LABEL: @load_f32_insert_v2f32(			; CHECK-LABEL: @load_f32_insert_v2f32(
	; CHECK-NEXT: [[S:%.]] = load float, float [[P:%.*]], align 4			; CHECK-NEXT: [[TMP1:%.]] = bitcast float [[P:%.]] to <4 x float>
	; CHECK-NEXT: [[R:%.*]] = insertelement <2 x float> undef, float [[S]], i32 0			; CHECK-NEXT: [[TMP2:%.]] = load <4 x float>, <4 x float> [[TMP1]], align 4
				; CHECK-NEXT: [[R:%.*]] = shufflevector <4 x float> [[TMP2]], <4 x float> undef, <2 x i32> <i32 0, i32 1>
	; CHECK-NEXT: ret <2 x float> [[R]]			; CHECK-NEXT: ret <2 x float> [[R]]
	;			;
	%s = load float, float* %p, align 4			%s = load float, float* %p, align 4
	%r = insertelement <2 x float> undef, float %s, i32 0			%r = insertelement <2 x float> undef, float %s, i32 0
	ret <2 x float> %r			ret <2 x float> %r
	}			}