This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
include/llvm/Transforms/Vectorize/
-
llvm/
-
Transforms/
-
Vectorize/
1/1
SLPVectorizer.h
-
lib/Transforms/Vectorize/
-
Transforms/
-
Vectorize/
3/4
SLPVectorizer.cpp
-
test/Transforms/SLPVectorizer/X86/
-
Transforms/
-
SLPVectorizer/
-
X86/
4/7
load-bitcast-vec.ll

Differential D64142

[SLP] try to create vector loads from bitcasted scalar pointers
AbandonedPublic

Authored by spatel on Jul 3 2019, 10:14 AM.

Download Raw Diff

Details

Reviewers

vporpo
ABataev
RKSimon
anton-afanasyev

Summary

This doesn't help the motivating cases in:
https://bugs.llvm.org/show_bug.cgi?id=16739
...yet, but I'd like to get feedback on the general approach.

The general idea is that if we have a legal vector pointer type, but we are bitcasting that pointer to only load a subset of a vector, then load the whole vector (if that is safe) and extract the subset of the vector.

This will allow SLP and/or instcombine to fold subsequent scalar ops together more easily because they will see extractelement ops from a single vector rather than incomplete parts of that vector.

Currently, this transform will make no overall difference to these most basic patterns because the backend (DAGCombiner) will narrow the loads back down to scalars via narrowExtractedVectorLoad().

Diff Detail

Event Timeline

spatel created this revision.Jul 3 2019, 10:14 AM

Herald added a project: Restricted Project. · View Herald TranscriptJul 3 2019, 10:14 AM

Herald added subscribers: jfb, hiraditya, mcrosier. · View Herald Transcript

ABataev added inline comments.Jul 3 2019, 10:37 AM

llvm/test/Transforms/SLPVectorizer/X86/load-bitcast-vec.ll
7–9	Seems to me, it must be masked load rather than just load. Plus, what about the cost? This does not look like cost optimal.

spatel marked an inline comment as done.Jul 3 2019, 10:47 AM

spatel added inline comments.

llvm/test/Transforms/SLPVectorizer/X86/load-bitcast-vec.ll
7–9	If the load is guaranteed dereferenceable, does that not allow speculated load of the entire vector? I'm open to suggestions about the cost calc. It's not clear to me if there's an existing TTI API for this or if we need to create a new one?

I am probably missing out something, but isn't this profitable only if we have more than one of these scalar loads extracting from the same vector load? Perhaps we could use these scalar loads as seeds and do short top-down SLP?
Also isn't it better to run this after vectorizeStoreChains() ?

In D64142#1568892, @vporpo wrote:

I am probably missing out something, but isn't this profitable only if we have more than one of these scalar loads extracting from the same vector load?

It's almost certainly more profitable with >1 load, but that's not a requirement for profitability from what I can tell. Here's a more realistic example for the single load case:

define <2 x i64> @load_splat(<4 x float>* dereferenceable(16) %p, <2 x i64> %y) {
  %bc = bitcast <4 x float>* %p to i64*
  %ld = load i64, i64* %bc, align 16
  %ins = insertelement <2 x i64> undef, i64 %ld, i32 0
  %splat = shufflevector <2 x i64> %ins, <2 x i64> undef, <2 x i32> zeroinitializer
  %add  = add <2 x i64> %splat, %y
  ret <2 x i64> %add
}

Current codegen for an x86 SSE target:

movq	(%rdi), %xmm1           # xmm1 = mem[0],zero
pshufd	$68, %xmm1, %xmm1       # xmm1 = xmm1[0,1,0,1]
paddq	%xmm1, %xmm0

With this patch, the IR is reduced to:

$ ./opt load-bitcast-vec.ll -S -mtriple=x86_64-- -slp-vectorizer -instcombine
define <2 x i64> @larger_scalar(<4 x float>* dereferenceable(16) %p, <2 x i64> %y) {
  %1 = bitcast <4 x float>* %p to <2 x i64>*
  %2 = load <2 x i64>, <2 x i64>* %1, align 16
  %splat = shufflevector <2 x i64> %2, <2 x i64> undef, <2 x i32> zeroinitializer
  %add = add <2 x i64> %splat, %y
  ret <2 x i64> %add
}

And codegen improves by folding the load:

pshufd	$68, (%rdi), %xmm1      # xmm1 = mem[0,1,0,1]
paddq	%xmm1, %xmm0

Perhaps we could use these scalar loads as seeds and do short top-down SLP?
Also isn't it better to run this after vectorizeStoreChains() ?

I don't have enough familiarity with SLP to know how to best fit these pieces together, but those seem like reasonable ideas. I was only trying to assess viability with this initial proposal - and not break anything. :)

Thanks for sharing the example.

Isn't this something that should be pattern-matched in instcombine/codegen and not in SLP?
What I mean is that if we have multiple of these loads, then this transformation should obviously be performed by the vectorizer. But if we only have one scalar load, then it looks a bit odd to do it in SLP.

In D64142#1568977, @vporpo wrote:

Isn't this something that should be pattern-matched in instcombine/codegen and not in SLP?

Great question. I've been wondering how to solve PR16739 for about 5 years now! Here are my current answers:

It's not possible for instcombine to create vector loads from scalar loads because we have no legality/cost model there. In fact, we have a request to do an opposing transform in instcombine in PR42424:

https://bugs.llvm.org/show_bug.cgi?id=42424 (but as I said there, I don't think instcombine is allowed to do that transform either)

It is possible to handle the most basic case in DAGCombiner, but it requires propagating the dereferenceable attribute/metadata through the transition from IR to the SDAG. There's partial precedence for that, but I'm not sure if it's enough to handle the general case. The disadvantage of waiting that long is that we may miss IR-based (instcombine) transforms and then have to recreate those in SDAG.

What I mean is that if we have multiple of these loads, then this transformation should obviously be performed by the vectorizer. But if we only have one scalar load, then it looks a bit odd to do it in SLP.

Yes, I'd like to extend this to handle the >1 load case, but I was starting with the minimal pattern and hoping to build on that. I'm imagining that if we have >1 load from a base pointer, then we'll group those together and create a sequence of extracts from a single vector load. If I need to show that as part of the initial patch, I'll extend this patch now.

spatel mentioned this in D64258: [InferFuncAttributes] extend 'dereferenceable' attribute based on loads.Jul 5 2019, 11:22 AM

Ping.

Unanswered questions:

Is there a better cost query than checking if the target has a vector register ( TTI->getRegisterBitWidth(true) ) that exceeds the load size?
Do we require that multiple scalar loads are subsumed by the vector load?

Ping * 2.

Ping * 3.

I personally think this seems to be going in the right direction,
though it isn't obvious without some more more complicated tests
that will show the further transforms this could allow.

llvm/test/Transforms/SLPVectorizer/X86/load-bitcast-vec.ll
7–9	I agree that there is no reason this should be a maskedload. Do we have opposite folds for this in dagcombine?

spatel marked an inline comment as done.Jul 25 2019, 4:35 AM

spatel added inline comments.

llvm/test/Transforms/SLPVectorizer/X86/load-bitcast-vec.ll
7–9	Yes - see narrowExtractedVectorLoad() in DAGCombiner.

lebedev.ri marked an inline comment as done.Jul 25 2019, 5:45 AM

lebedev.ri added inline comments.

llvm/test/Transforms/SLPVectorizer/X86/load-bitcast-vec.ll
7–9	Then as far i'm concerned this is zero-cost change.

ABataev added inline comments.Jul 25 2019, 6:35 AM

llvm/test/Transforms/SLPVectorizer/X86/load-bitcast-vec.ll
7–9	getCastInstrCost + getMemoryOpCost for scalar instructions. getMemoryOpCost + getExtractWithExtendCost for vector instructions. No?

lebedev.ri added inline comments.Jul 25 2019, 6:45 AM

llvm/test/Transforms/SLPVectorizer/X86/load-bitcast-vec.ll
7–9	Then as far i'm concerned this is zero-cost change. ... in the sense that if the further passes won't make more use of this load, it is guaranteed to be demoted back into simple scalar load.

Patch updated:
Use TTI cost model to compare costs of original and new load sequences.

Ping.

Ping * 2.

ABataev added inline comments.Aug 8 2019, 7:12 AM

llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h
86	Comment?
llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp
5075	Not sure this is the right place to call this function. I would suggest to do it at the end of vectorization. Plus, this should change the value of `Changed` variable.
5250	`Load->isSimple()` instead of `Load->isVolatile() \|\| Load->isAtomic()`
5268–5272	Why? Long vectors could be split into several smaller vector loads successfully.
5280–5306	Could you reuse the original logic with building the vectorization tree, cost calculation etc.? This looks like bikeshedding and does not allow to extend it for other ops.

Patch updated:

Added documentation comment for vectorizeLoads().
Use isSimple() to filter out 'volatile' and other loads that we don't want to alter.
Moved vectorization of loads to end of processing per block (not sure if that answers the request for "end of vectorization" though).
Removed check for load larger than vector register size (that was an attempt to not create something harmful, but now we are using the cost model).

It's not clear to me how to re-use the existing tree model/cost code, so I'm still looking into that. Suggestions appreciated.

dtemirbulatov added a subscriber: dtemirbulatov.Aug 9 2019, 2:00 AM

dtemirbulatov mentioned this in D57779: [SLP] Add support for throttling..Aug 19 2019, 4:57 AM

RKSimon added a reviewer: anton-afanasyev.Nov 22 2019, 8:35 AM

spatel mentioned this in D81766: [VectorCombine] try to create vector loads from scalar loads.Jun 12 2020, 2:29 PM

@spatel Has this been superseded by D81766?

Herald added a subscriber: steven.zhang. · View Herald TranscriptAug 8 2020, 3:40 AM

In D64142#2204671, @RKSimon wrote:

@spatel Has this been superseded by D81766?

Yes, we are keying off of a different pattern now, but the same motivation (though neither implementation would help PR16739 yet...follow-ups expected).

Revision Contents

Path

Size

llvm/

include/

llvm/

Transforms/

Vectorize/

SLPVectorizer.h

2 lines

lib/

Transforms/

Vectorize/

SLPVectorizer.cpp

80 lines

test/

Transforms/

SLPVectorizer/

X86/

load-bitcast-vec.ll

58 lines

Diff 211803

llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h

Show First 20 Lines • Show All 77 Lines • ▼ Show 20 Lines	public:

// Glue for old PM.		// Glue for old PM.
bool runImpl(Function &F, ScalarEvolution SE_, TargetTransformInfo TTI_,		bool runImpl(Function &F, ScalarEvolution SE_, TargetTransformInfo TTI_,
TargetLibraryInfo TLI_, AliasAnalysis AA_, LoopInfo *LI_,		TargetLibraryInfo TLI_, AliasAnalysis AA_, LoopInfo *LI_,
DominatorTree DT_, AssumptionCache AC_, DemandedBits *DB_,		DominatorTree DT_, AssumptionCache AC_, DemandedBits *DB_,
OptimizationRemarkEmitter *ORE_);		OptimizationRemarkEmitter *ORE_);

private:		private:
		void vectorizeLoads(BasicBlock *BB);
		ABataevUnsubmitted Done Reply Inline Actions Comment? ABataev: Comment?

/// Collect store and getelementptr instructions and organize them		/// Collect store and getelementptr instructions and organize them
/// according to the underlying object of their pointer operands. We sort the		/// according to the underlying object of their pointer operands. We sort the
/// instructions by their underlying objects to reduce the cost of		/// instructions by their underlying objects to reduce the cost of
/// consecutive access queries.		/// consecutive access queries.
///		///
/// TODO: We can further reduce this cost if we flush the chain creation		/// TODO: We can further reduce this cost if we flush the chain creation
/// every time we run into a memory barrier.		/// every time we run into a memory barrier.
void collectSeedInstructions(BasicBlock *BB);		void collectSeedInstructions(BasicBlock *BB);
▲ Show 20 Lines • Show All 62 Lines • Show Last 20 Lines

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show All 30 Lines
#include "llvm/ADT/SmallVector.h"		#include "llvm/ADT/SmallVector.h"
#include "llvm/ADT/Statistic.h"		#include "llvm/ADT/Statistic.h"
#include "llvm/ADT/iterator.h"		#include "llvm/ADT/iterator.h"
#include "llvm/ADT/iterator_range.h"		#include "llvm/ADT/iterator_range.h"
#include "llvm/Analysis/AliasAnalysis.h"		#include "llvm/Analysis/AliasAnalysis.h"
#include "llvm/Analysis/CodeMetrics.h"		#include "llvm/Analysis/CodeMetrics.h"
#include "llvm/Analysis/DemandedBits.h"		#include "llvm/Analysis/DemandedBits.h"
#include "llvm/Analysis/GlobalsModRef.h"		#include "llvm/Analysis/GlobalsModRef.h"
		#include "llvm/Analysis/Loads.h"
#include "llvm/Analysis/LoopAccessAnalysis.h"		#include "llvm/Analysis/LoopAccessAnalysis.h"
#include "llvm/Analysis/LoopInfo.h"		#include "llvm/Analysis/LoopInfo.h"
#include "llvm/Analysis/MemoryLocation.h"		#include "llvm/Analysis/MemoryLocation.h"
#include "llvm/Analysis/OptimizationRemarkEmitter.h"		#include "llvm/Analysis/OptimizationRemarkEmitter.h"
#include "llvm/Analysis/ScalarEvolution.h"		#include "llvm/Analysis/ScalarEvolution.h"
#include "llvm/Analysis/ScalarEvolutionExpressions.h"		#include "llvm/Analysis/ScalarEvolutionExpressions.h"
#include "llvm/Analysis/TargetLibraryInfo.h"		#include "llvm/Analysis/TargetLibraryInfo.h"
#include "llvm/Analysis/TargetTransformInfo.h"		#include "llvm/Analysis/TargetTransformInfo.h"
Show All 31 Lines
#include "llvm/Support/Compiler.h"		#include "llvm/Support/Compiler.h"
#include "llvm/Support/DOTGraphTraits.h"		#include "llvm/Support/DOTGraphTraits.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "llvm/Support/ErrorHandling.h"		#include "llvm/Support/ErrorHandling.h"
#include "llvm/Support/GraphWriter.h"		#include "llvm/Support/GraphWriter.h"
#include "llvm/Support/KnownBits.h"		#include "llvm/Support/KnownBits.h"
#include "llvm/Support/MathExtras.h"		#include "llvm/Support/MathExtras.h"
#include "llvm/Support/raw_ostream.h"		#include "llvm/Support/raw_ostream.h"
		#include "llvm/Transforms/Utils/Local.h"
#include "llvm/Transforms/Utils/LoopUtils.h"		#include "llvm/Transforms/Utils/LoopUtils.h"
#include "llvm/Transforms/Vectorize.h"		#include "llvm/Transforms/Vectorize.h"
#include <algorithm>		#include <algorithm>
#include <cassert>		#include <cassert>
#include <cstdint>		#include <cstdint>
#include <iterator>		#include <iterator>
#include <memory>		#include <memory>
#include <set>		#include <set>
▲ Show 20 Lines • Show All 4,971 Lines • ▼ Show 20 Lines	bool SLPVectorizerPass::runImpl(Function &F, ScalarEvolution *SE_,
// store instructions.		// store instructions.
BoUpSLP R(&F, SE, TTI, TLI, AA, LI, DT, AC, DB, DL, ORE_);		BoUpSLP R(&F, SE, TTI, TLI, AA, LI, DT, AC, DB, DL, ORE_);

// A general note: the vectorizer must use BoUpSLP::eraseInstruction() to		// A general note: the vectorizer must use BoUpSLP::eraseInstruction() to
// delete instructions.		// delete instructions.

// Scan the blocks in the function in post order.		// Scan the blocks in the function in post order.
for (auto BB : post_order(&F.getEntryBlock())) {		for (auto BB : post_order(&F.getEntryBlock())) {
		vectorizeLoads(BB);
		ABataevUnsubmitted Done Reply Inline Actions Not sure this is the right place to call this function. I would suggest to do it at the end of vectorization. Plus, this should change the value of `Changed` variable. ABataev: Not sure this is the right place to call this function. I would suggest to do it at the end of…
collectSeedInstructions(BB);		collectSeedInstructions(BB);

// Vectorize trees that end at stores.		// Vectorize trees that end at stores.
if (!Stores.empty()) {		if (!Stores.empty()) {
LLVM_DEBUG(dbgs() << "SLP: Found stores for " << Stores.size()		LLVM_DEBUG(dbgs() << "SLP: Found stores for " << Stores.size()
<< " underlying objects.\n");		<< " underlying objects.\n");
Changed \|= vectorizeStoreChains(R);		Changed \|= vectorizeStoreChains(R);
}		}
▲ Show 20 Lines • Show All 153 Lines • ▼ Show 20 Lines	for (unsigned Size = R.getMaxVecRegSize(); Size >= R.getMinVecRegSize();
break;		break;
}		}
}		}
}		}

return Changed;		return Changed;
}		}

		void SLPVectorizerPass::vectorizeLoads(BasicBlock *BB) {
		SmallVector<Instruction *, 8> DeadLoads;
		for (Instruction &I : *BB) {
		// Match regular loads.
		auto *Load = dyn_cast<LoadInst>(&I);
		if (!Load \|\| Load->isVolatile() \|\| Load->isAtomic())
		ABataevUnsubmitted Done Reply Inline Actions `Load->isSimple()` instead of `Load->isVolatile() \|\| Load->isAtomic()` ABataev: `Load->isSimple()` instead of `Load->isVolatile() \|\| Load->isAtomic()`
		continue;

		// Match a scalar load of a bitcasted vector pointer.
		// TODO: Extend this to match GEP with 0 or other offset.
		Instruction *PtrOp;
		Value *SrcPtr;
		if (!match(Load->getPointerOperand(),
		m_CombineAnd(m_Instruction(PtrOp), m_BitCast(m_Value(SrcPtr)))))
		continue;

		// TODO: Extend this to allow widening of a sub-vector (not scalar) load.
		auto *PtrOpTy = dyn_cast<PointerType>(PtrOp->getType());
		auto *SrcPtrTy = dyn_cast<PointerType>(SrcPtr->getType());
		if (!PtrOpTy \|\| !SrcPtrTy \|\| PtrOpTy->getElementType()->isVectorTy() \|\|
		!SrcPtrTy->getElementType()->isVectorTy())
		continue;

		// Do not create a vector load of an unsupported type.
		auto *VecTy = cast<VectorType>(SrcPtrTy->getElementType());
		unsigned VecSize = VecTy->getPrimitiveSizeInBits();
		if (VecSize > TTI->getRegisterBitWidth(true))
		continue;
		ABataevUnsubmitted Done Reply Inline Actions Why? Long vectors could be split into several smaller vector loads successfully. ABataev: Why? Long vectors could be split into several smaller vector loads successfully.

		// Check safety of replacing the scalar load with a larger vector load.
		unsigned Alignment = Load->getAlignment();
		if (!isSafeToLoadUnconditionally(SrcPtr, SrcPtrTy->getElementType(),
		Alignment, *DL, Load, DT))
		continue;

		// Original pattern: load (bitcast VecPtr to ScalarPtr)
		int OldCost = TTI->getMemoryOpCost(Instruction::Load,
		SrcPtrTy->getElementType(), Alignment,
		Load->getPointerAddressSpace());
		OldCost += TTI->getCastInstrCost(Instruction::BitCast, PtrOpTy, SrcPtrTy);

		// If needed, bitcast the vector type to match the load (scalar element).
		Type *VecLoadTy = VecTy;
		if (VecTy->getVectorElementType() != Load->getType()) {
		unsigned NumElts = VecSize / Load->getType()->getPrimitiveSizeInBits();
		VecLoadTy = VectorType::get(Load->getType(), NumElts);
		}

		// New pattern: extractelt (load [bitcast] VecPtr), 0
		int NewCost = TTI->getMemoryOpCost(Instruction::Load, VecLoadTy, Alignment,
		Load->getPointerAddressSpace());
		NewCost += TTI->getVectorInstrCost(Instruction::ExtractElement,
		SrcPtrTy->getElementType(), 0);
		if (VecLoadTy != VecTy)
		NewCost += TTI->getCastInstrCost(Instruction::BitCast,
		VecLoadTy->getPointerTo(), SrcPtrTy);

		// We can aggressively convert to the vector form because the backend will
		// invert this transform if it does not result in a larger performance win.
		if (OldCost < NewCost)
		continue;

		ABataevUnsubmitted Not Done Reply Inline Actions Could you reuse the original logic with building the vectorization tree, cost calculation etc.? This looks like bikeshedding and does not allow to extend it for other ops. ABataev: Could you reuse the original logic with building the vectorization tree, cost calculation etc.?
		// It is safe and profitable to load using the original vector pointer and
		// extract the scalar value from that:
		// load (bitcast VecPtr to ScalarPtr) --> extractelt (load VecPtr), 0
		IRBuilder<> Builder(Load);
		if (VecLoadTy != VecTy)
		SrcPtr = Builder.CreateBitCast(SrcPtr, VecLoadTy->getPointerTo());

		LoadInst *VecLd = Builder.CreateAlignedLoad(VecLoadTy, SrcPtr, Alignment);
		Value *ExtElt = Builder.CreateExtractElement(VecLd, Builder.getInt32(0));
		Load->replaceAllUsesWith(ExtElt);
		DeadLoads.push_back(Load);
		}
		RecursivelyDeleteTriviallyDeadInstructions(DeadLoads, TLI);
		}

void SLPVectorizerPass::collectSeedInstructions(BasicBlock *BB) {		void SLPVectorizerPass::collectSeedInstructions(BasicBlock *BB) {
// Initialize the collections. We will make a single pass over the block.		// Initialize the collections. We will make a single pass over the block.
Stores.clear();		Stores.clear();
GEPs.clear();		GEPs.clear();

// Visit the store and getelementptr instructions in BB and organize them in		// Visit the store and getelementptr instructions in BB and organize them in
// Stores and GEPs according to the underlying objects of their pointer		// Stores and GEPs according to the underlying objects of their pointer
// operands.		// operands.
▲ Show 20 Lines • Show All 1,674 Lines • Show Last 20 Lines

llvm/test/Transforms/SLPVectorizer/X86/load-bitcast-vec.ll

	; NOTE: Assertions have been autogenerated by utils/update_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_test_checks.py
	; RUN: opt < %s -slp-vectorizer -S -mtriple=x86_64-- -mattr=+avx \| FileCheck %s --check-prefixes=CHECK,AVX1			; RUN: opt < %s -slp-vectorizer -S -mtriple=x86_64-- -mattr=+sse2 \| FileCheck %s --check-prefixes=CHECK,SSE2
	; RUN: opt < %s -slp-vectorizer -S -mtriple=x86_64-- -mattr=+avx2 \| FileCheck %s --check-prefixes=CHECK,AVX2			; RUN: opt < %s -slp-vectorizer -S -mtriple=x86_64-- -mattr=+avx2 \| FileCheck %s --check-prefixes=CHECK,AVX2

	define float @matching_scalar(<4 x float>* dereferenceable(16) %p) {			define float @matching_scalar(<4 x float>* dereferenceable(16) %p) {
	; CHECK-LABEL: @matching_scalar(			; CHECK-LABEL: @matching_scalar(
	; CHECK-NEXT: [[BC:%.]] = bitcast <4 x float> [[P:%.]] to float			; CHECK-NEXT: [[TMP1:%.]] = load <4 x float>, <4 x float> [[P:%.*]], align 16
	; CHECK-NEXT: [[R:%.]] = load float, float [[BC]], align 16			; CHECK-NEXT: [[TMP2:%.*]] = extractelement <4 x float> [[TMP1]], i32 0
	; CHECK-NEXT: ret float [[R]]			; CHECK-NEXT: ret float [[TMP2]]
				ABataevUnsubmitted Not Done Reply Inline Actions Seems to me, it must be masked load rather than just load. Plus, what about the cost? This does not look like cost optimal. ABataev: Seems to me, it must be masked load rather than just load. Plus, what about the cost? This does…
				spatelAuthorUnsubmitted Done Reply Inline Actions If the load is guaranteed dereferenceable, does that not allow speculated load of the entire vector? I'm open to suggestions about the cost calc. It's not clear to me if there's an existing TTI API for this or if we need to create a new one? spatel: If the load is guaranteed dereferenceable, does that not allow speculated load of the entire…
				lebedev.riUnsubmitted Not Done Reply Inline Actions I agree that there is no reason this should be a maskedload. Do we have opposite folds for this in dagcombine? lebedev.ri: I agree that there is no reason this should be a maskedload. Do we have opposite folds for this…
				spatelAuthorUnsubmitted Done Reply Inline Actions Yes - see narrowExtractedVectorLoad() in DAGCombiner. spatel: Yes - see narrowExtractedVectorLoad() in DAGCombiner.
				lebedev.riUnsubmitted Done Reply Inline Actions Then as far i'm concerned this is zero-cost change. lebedev.ri: Then as far i'm concerned this is zero-cost change.
				ABataevUnsubmitted Done Reply Inline Actions getCastInstrCost + getMemoryOpCost for scalar instructions. getMemoryOpCost + getExtractWithExtendCost for vector instructions. No? ABataev: getCastInstrCost + getMemoryOpCost for scalar instructions. getMemoryOpCost +…
				lebedev.riUnsubmitted Not Done Reply Inline Actions Then as far i'm concerned this is zero-cost change. ... in the sense that if the further passes won't make more use of this load, it is guaranteed to be demoted back into simple scalar load. lebedev.ri: > Then as far i'm concerned this is zero-cost change. ... in the sense that if the further…
	;			;
	%bc = bitcast <4 x float>* %p to float*			%bc = bitcast <4 x float>* %p to float*
	%r = load float, float* %bc, align 16			%r = load float, float* %bc, align 16
	ret float %r			ret float %r
	}			}

	define i32 @nonmatching_scalar(<4 x float>* dereferenceable(16) %p) {			define i32 @nonmatching_scalar(<4 x float>* dereferenceable(16) %p) {
	; CHECK-LABEL: @nonmatching_scalar(			; CHECK-LABEL: @nonmatching_scalar(
	; CHECK-NEXT: [[BC:%.]] = bitcast <4 x float> [[P:%.]] to i32			; CHECK-NEXT: [[TMP1:%.]] = bitcast <4 x float> [[P:%.]] to <4 x i32>
	; CHECK-NEXT: [[R:%.]] = load i32, i32 [[BC]], align 16			; CHECK-NEXT: [[TMP2:%.]] = load <4 x i32>, <4 x i32> [[TMP1]], align 16
	; CHECK-NEXT: ret i32 [[R]]			; CHECK-NEXT: [[TMP3:%.*]] = extractelement <4 x i32> [[TMP2]], i32 0
				; CHECK-NEXT: ret i32 [[TMP3]]
	;			;
	%bc = bitcast <4 x float>* %p to i32*			%bc = bitcast <4 x float>* %p to i32*
	%r = load i32, i32* %bc, align 16			%r = load i32, i32* %bc, align 16
	ret i32 %r			ret i32 %r
	}			}

	define i64 @larger_scalar(<4 x float>* dereferenceable(16) %p) {			define i64 @larger_scalar(<4 x float>* dereferenceable(16) %p) {
	; CHECK-LABEL: @larger_scalar(			; CHECK-LABEL: @larger_scalar(
	; CHECK-NEXT: [[BC:%.]] = bitcast <4 x float> [[P:%.]] to i64			; CHECK-NEXT: [[TMP1:%.]] = bitcast <4 x float> [[P:%.]] to <2 x i64>
	; CHECK-NEXT: [[R:%.]] = load i64, i64 [[BC]], align 16			; CHECK-NEXT: [[TMP2:%.]] = load <2 x i64>, <2 x i64> [[TMP1]], align 16
	; CHECK-NEXT: ret i64 [[R]]			; CHECK-NEXT: [[TMP3:%.*]] = extractelement <2 x i64> [[TMP2]], i32 0
				; CHECK-NEXT: ret i64 [[TMP3]]
	;			;
	%bc = bitcast <4 x float>* %p to i64*			%bc = bitcast <4 x float>* %p to i64*
	%r = load i64, i64* %bc, align 16			%r = load i64, i64* %bc, align 16
	ret i64 %r			ret i64 %r
	}			}

	define i8 @smaller_scalar(<4 x float>* dereferenceable(16) %p) {			define i8 @smaller_scalar(<4 x float>* dereferenceable(16) %p) {
	; CHECK-LABEL: @smaller_scalar(			; CHECK-LABEL: @smaller_scalar(
	; CHECK-NEXT: [[BC:%.]] = bitcast <4 x float> [[P:%.]] to i8			; CHECK-NEXT: [[TMP1:%.]] = bitcast <4 x float> [[P:%.]] to <16 x i8>
	; CHECK-NEXT: [[R:%.]] = load i8, i8 [[BC]], align 16			; CHECK-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> [[TMP1]], align 16
	; CHECK-NEXT: ret i8 [[R]]			; CHECK-NEXT: [[TMP3:%.*]] = extractelement <16 x i8> [[TMP2]], i32 0
				; CHECK-NEXT: ret i8 [[TMP3]]
	;			;
	%bc = bitcast <4 x float>* %p to i8*			%bc = bitcast <4 x float>* %p to i8*
	%r = load i8, i8* %bc, align 16			%r = load i8, i8* %bc, align 16
	ret i8 %r			ret i8 %r
	}			}

				; Partial negative test - don't create an illegal load for an SSE target.

	define i8 @smaller_scalar_256bit_vec(<8 x float>* dereferenceable(32) %p) {			define i8 @smaller_scalar_256bit_vec(<8 x float>* dereferenceable(32) %p) {
	; CHECK-LABEL: @smaller_scalar_256bit_vec(			; SSE2-LABEL: @smaller_scalar_256bit_vec(
	; CHECK-NEXT: [[BC:%.]] = bitcast <8 x float> [[P:%.]] to i8			; SSE2-NEXT: [[BC:%.]] = bitcast <8 x float> [[P:%.]] to i8
	; CHECK-NEXT: [[R:%.]] = load i8, i8 [[BC]], align 32			; SSE2-NEXT: [[R:%.]] = load i8, i8 [[BC]], align 32
	; CHECK-NEXT: ret i8 [[R]]			; SSE2-NEXT: ret i8 [[R]]
				;
				; AVX2-LABEL: @smaller_scalar_256bit_vec(
				; AVX2-NEXT: [[TMP1:%.]] = bitcast <8 x float> [[P:%.]] to <32 x i8>
				; AVX2-NEXT: [[TMP2:%.]] = load <32 x i8>, <32 x i8> [[TMP1]], align 32
				; AVX2-NEXT: [[TMP3:%.*]] = extractelement <32 x i8> [[TMP2]], i32 0
				; AVX2-NEXT: ret i8 [[TMP3]]
	;			;
	%bc = bitcast <8 x float>* %p to i8*			%bc = bitcast <8 x float>* %p to i8*
	%r = load i8, i8* %bc, align 32			%r = load i8, i8* %bc, align 32
	ret i8 %r			ret i8 %r
	}			}

	define i8 @smaller_scalar_less_aligned(<4 x float>* dereferenceable(16) %p) {			define i8 @smaller_scalar_less_aligned(<4 x float>* dereferenceable(16) %p) {
	; CHECK-LABEL: @smaller_scalar_less_aligned(			; CHECK-LABEL: @smaller_scalar_less_aligned(
	; CHECK-NEXT: [[BC:%.]] = bitcast <4 x float> [[P:%.]] to i8			; CHECK-NEXT: [[TMP1:%.]] = bitcast <4 x float> [[P:%.]] to <16 x i8>
	; CHECK-NEXT: [[R:%.]] = load i8, i8 [[BC]], align 4			; CHECK-NEXT: [[TMP2:%.]] = load <16 x i8>, <16 x i8> [[TMP1]], align 4
	; CHECK-NEXT: ret i8 [[R]]			; CHECK-NEXT: [[TMP3:%.*]] = extractelement <16 x i8> [[TMP2]], i32 0
				; CHECK-NEXT: ret i8 [[TMP3]]
	;			;
	%bc = bitcast <4 x float>* %p to i8*			%bc = bitcast <4 x float>* %p to i8*
	%r = load i8, i8* %bc, align 4			%r = load i8, i8* %bc, align 4
	ret i8 %r			ret i8 %r
	}			}

				; Negative test - not enough dereferenceable bytes.

	define float @matching_scalar_small_deref(<4 x float>* dereferenceable(15) %p) {			define float @matching_scalar_small_deref(<4 x float>* dereferenceable(15) %p) {
	; CHECK-LABEL: @matching_scalar_small_deref(			; CHECK-LABEL: @matching_scalar_small_deref(
	; CHECK-NEXT: [[BC:%.]] = bitcast <4 x float> [[P:%.]] to float			; CHECK-NEXT: [[BC:%.]] = bitcast <4 x float> [[P:%.]] to float
	; CHECK-NEXT: [[R:%.]] = load float, float [[BC]], align 16			; CHECK-NEXT: [[R:%.]] = load float, float [[BC]], align 16
	; CHECK-NEXT: ret float [[R]]			; CHECK-NEXT: ret float [[R]]
	;			;
	%bc = bitcast <4 x float>* %p to float*			%bc = bitcast <4 x float>* %p to float*
	%r = load float, float* %bc, align 16			%r = load float, float* %bc, align 16
	ret float %r			ret float %r
	}			}

				; Negative test - don't transform volatile ops.

	define float @matching_scalar_volatile(<4 x float>* dereferenceable(16) %p) {			define float @matching_scalar_volatile(<4 x float>* dereferenceable(16) %p) {
	; CHECK-LABEL: @matching_scalar_volatile(			; CHECK-LABEL: @matching_scalar_volatile(
	; CHECK-NEXT: [[BC:%.]] = bitcast <4 x float> [[P:%.]] to float			; CHECK-NEXT: [[BC:%.]] = bitcast <4 x float> [[P:%.]] to float
	; CHECK-NEXT: [[R:%.]] = load volatile float, float [[BC]], align 16			; CHECK-NEXT: [[R:%.]] = load volatile float, float [[BC]], align 16
	; CHECK-NEXT: ret float [[R]]			; CHECK-NEXT: ret float [[R]]
	;			;
	%bc = bitcast <4 x float>* %p to float*			%bc = bitcast <4 x float>* %p to float*
	%r = load volatile float, float* %bc, align 16			%r = load volatile float, float* %bc, align 16
	ret float %r			ret float %r
	}			}

				; Negative test - not bitcast from vector type.

	define float @nonvector(double* dereferenceable(16) %p) {			define float @nonvector(double* dereferenceable(16) %p) {
	; CHECK-LABEL: @nonvector(			; CHECK-LABEL: @nonvector(
	; CHECK-NEXT: [[BC:%.]] = bitcast double [[P:%.]] to float			; CHECK-NEXT: [[BC:%.]] = bitcast double [[P:%.]] to float
	; CHECK-NEXT: [[R:%.]] = load float, float [[BC]], align 16			; CHECK-NEXT: [[R:%.]] = load float, float [[BC]], align 16
	; CHECK-NEXT: ret float [[R]]			; CHECK-NEXT: ret float [[R]]
	;			;
	%bc = bitcast double* %p to float*			%bc = bitcast double* %p to float*
	%r = load float, float* %bc, align 16			%r = load float, float* %bc, align 16
	ret float %r			ret float %r
	}			}

This is an archive of the discontinued LLVM Phabricator instance.

[SLP] try to create vector loads from bitcasted scalar pointersAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 211803

llvm/include/llvm/Transforms/Vectorize/SLPVectorizer.h

llvm/lib/Transforms/Vectorize/SLPVectorizer.cpp

llvm/test/Transforms/SLPVectorizer/X86/load-bitcast-vec.ll

[SLP] try to create vector loads from bitcasted scalar pointers
AbandonedPublic