This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
14/15
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
2/2
aarch64-matrix-umull-smull.ll
-
trunc-to-tbl.ll

Differential D135229

[AArch64] Extending lowering of 'trunc <(8|16) x i64> %x to <(8|16) x i8>' to use tbl instructions
ClosedPublic

Authored by nilanjana_basu on Oct 4 2022, 5:12 PM.

Download Raw Diff

Details

Reviewers

fhahn
t.p.northover
paquette

Commits

rG02d09ffc1b09: [AArch64] Extending lowering of 'trunc <(8|16) x i64> %x to <(8|16) x i8>' to…

Summary

[AArch64] Patch for lowering trunc instructions to 'tbl' for (8|16)xi32 -> (8|16)xi8 conversions in D133495 is extended to support trunc to tbl lowering for (8|16) x i64 to (8|16) x i8.

A microbenchmark for runtime has been added for all these cases in D136274

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

nilanjana_basu created this revision.Oct 4 2022, 5:12 PM

Herald added a project: Restricted Project. · View Herald TranscriptOct 4 2022, 5:12 PM

Herald added a subscriber: hiraditya. · View Herald Transcript

nilanjana_basu requested review of this revision.Oct 4 2022, 5:12 PM

Herald added a project: Restricted Project. · View Herald TranscriptOct 4 2022, 5:12 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

nilanjana_basu set the repository for this revision to rG LLVM Github Monorepo.Oct 4 2022, 5:47 PM

nilanjana_basu retitled this revision from Extending trunc lowering of 'trunc <8 x i64> %x to <8 x i8>' to use tbl.4 instruction to [AArch64] Extending trunc lowering of 'trunc <8 x i64> %x to <8 x i8>' to use tbl.4 instruction.Oct 4 2022, 6:01 PM

Herald added a subscriber: kristof.beyls. · View Herald TranscriptOct 4 2022, 6:01 PM

Ran git-clang-format & made minor change to reduce LoC.

nilanjana_basu added reviewers: fhahn, t.p.northover.Oct 4 2022, 6:05 PM

Removed an unused variable warning

Harbormaster completed remote builds in B190384: Diff 465246.Oct 4 2022, 7:56 PM

Extended the trunc lowering for other types like 16xi64, 16xi16, 8xi16

Harbormaster completed remote builds in B193763: Diff 469926.Oct 22 2022, 1:19 PM

The automated build tests failed for the previous patch because it was based on a previous commit for a unit test that isn't submitted yet. This patch fixes it by squashing the previous commit, removing the dependency & showing the final update.

Harbormaster completed remote builds in B193990: Diff 470224.Oct 24 2022, 1:00 PM

Ran clang-format since it was failing in the build report at https://buildkite.com/llvm-project/diff-checks/builds/133184

Harbormaster completed remote builds in B194302: Diff 470661.Oct 25 2022, 6:32 PM

nilanjana_basu added a reviewer: paquette.Oct 28 2022, 4:41 PM

nilanjana_basu edited the summary of this revision. (Show Details)Oct 28 2022, 4:57 PM

nilanjana_basu mentioned this in D137221: [MicroBenchmarks] Add benchmarks to check runtime of truncate or zero-extend vector operations in AArch64.Nov 1 2022, 6:31 PM

t.p.northover added inline comments.Nov 2 2022, 8:08 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

13405–13406

I think these are guaranteed to succeed by checks in the caller (and essential here), so cast<...> is probably better. Applies to some of the later dyn_casts too.

13458–13460

There's a lot of duplication in this switch, but it is pretty easy to eyeball for correctness because of that once you get what it's trying to do. So I'm torn, a loop like this would probably be shorter overall:

int ShuffleCount = 128/SrcElemSize;
SmallVector<int> ShuffleLanes;
for (int i = 0; i < ShuffleCount; ++i)
  ShuffleLanes.push_back(i);

SmallVector<Value *> Results;
while (ShuffleLanes.back() < NumElements) {
  Parts.push_back(Builder.CreateBitCast(Builder.CreateShuffleVector(TI->getOperand(0), ShuffleLanes), VecTy));
  for (int i = 0; i < ShuffleCount; ++i)
    ShuffleLanes[i] += ShuffleCount;
  if (Parts.size() == 4) {
    // Call tbl4, push result into Results, clear Parts.
  }
}

// Choose correct tbl (3 now a valid option) and call for rest of Parts, push to Results

// Shuffle-merge all of Results.

and allow the code to apply to a wider range of truncates. What are your views on the implementation?

nilanjana_basu mentioned this in rT3b44b6bdd3e8: [MicroBenchmarks] Add benchmarks to check runtime of truncate or zero-extend….Nov 2 2022, 2:05 PM

nilanjana_basu added a parent revision: D137293: [AArch64] Extra unit tests for trunc lowering of vectors.Nov 2 2022, 3:01 PM

Addressed comments by t.p.northover - refactored code to remove redundancy

nilanjana_basu marked an inline comment as done.Nov 3 2022, 7:39 PM

nilanjana_basu added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13458–13460	I refactored the code as you suggested, which can now apply to a few extra cases like 12xi32 or 4xi32. However, I haven't modified the old set of allowable cases since I don't know how relevant these few are. In my understanding, we get better performance when tbl2-tbl4 get triggered, as the number of generated instructions decrease. So, I need your opinion on whether we should allow 8xi16 conversions, since they generate a single tbl1 instruction?

Ran clang-format

Harbormaster completed remote builds in B196052: Diff 473110.Nov 3 2022, 8:44 PM

nilanjana_basu removed a parent revision: D137293: [AArch64] Extra unit tests for trunc lowering of vectors.Nov 4 2022, 5:22 PM

nilanjana_basu mentioned this in D138059: [MicroBenchmarks,AArch64] Added correctness test & other performance tests for truncate or zero-extend vector operations.Nov 15 2022, 1:13 PM

Added comments

Harbormaster completed remote builds in B198876: Diff 477022.Nov 22 2022, 12:18 AM

fhahn added inline comments.Nov 22 2022, 3:57 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13405	Is this guaranteed to be a fixed vector type? Could you add a variant of a test with truncates of scalable vectors (`<vscale x 16 x i8>` or something like that?
13419	It would be great if you could add a brief comment here explaining what kind of masks/shuffles are prepared here.
13444	store here seems ambiguous here, as we won't emit a store instruction, right?
13489	SmallVector?
13494	SmallVector?
llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll
676	Similar to D136722, it is likely not profitable to do this when converting to/from the next power-of-2.

fhahn added inline comments.Nov 22 2022, 10:44 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13405	I think it should be fine, I added a test in 4783345426da

Updated comments as mentioned in the reviews. Rebased on tests for this change prior to applying this patch.

Harbormaster completed remote builds in B199088: Diff 477347.Nov 22 2022, 5:00 PM

nilanjana_basu added a parent revision: D137293: [AArch64] Extra unit tests for trunc lowering of vectors.Nov 22 2022, 5:02 PM

Rebasing on commit of test cases prior to application of this patch

Removed case for 'trunc <(8|16)xi16> %x to <(8|16)xi8>' since it was adding more instructions to loop header, while not improving loop instruction count

Updated a comment

nilanjana_basu retitled this revision from [AArch64] Extending lowering of 'trunc <(8|16) x (i16|i64)> %x to <(8|16) x i8>' to use tbl instructions to [AArch64] Extending lowering of 'trunc <(8|16) x i64> %x to <(8|16) x i8>' to use tbl instructions.Nov 23 2022, 2:38 AM

nilanjana_basu edited the summary of this revision. (Show Details)

Removed (8|16)xi16 to (8|16)xi8 conversion because it wasn't showing benefits in instruction count, & additionally adding more instructions to the header. Updated comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13405	Since the source & destination types were checked for FixedVector once in the calling function optimizeExtendOrTruncateConversion(), I didn't check it here again.
13405	Since both the zext & trunc test for scalable vector goes through the optimizeExtendOrTruncateConversion() function, the zext test in 4783345426da should suffice for the trunc too. Let me know if you think it needs to be replicated for trunc vector too.
13444	I replaced the "store" with "save" to indicate that it is being stored in the compiler's internal vector data structure. Added a comment at the place of combining these results.
llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll
676	Since we only support a handful of vector type truncates in this implementation, only Yxi16->Yxi8 was a supported next power-of-2 conversion. Removed it.

Harbormaster completed remote builds in B199151: Diff 477432.Nov 23 2022, 3:32 AM

Rebasing on parent patch for tests

Harbormaster completed remote builds in B199602: Diff 478030.Nov 25 2022, 3:08 PM

nilanjana_basu edited the summary of this revision. (Show Details)Nov 25 2022, 3:54 PM

Trying to fix rebasing error

Harbormaster completed remote builds in B199605: Diff 478033.Nov 25 2022, 4:35 PM

nilanjana_basu removed a parent revision: D137293: [AArch64] Extra unit tests for trunc lowering of vectors.Nov 28 2022, 4:17 AM

nilanjana_basu mentioned this in rT08de51078b0a: [MicroBenchmarks,AArch64] Added correctness test & other performance tests for….Dec 1 2022, 10:09 PM

LGTM with the inline suggestions. Please wait a day or so with committing in case there are additional comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13414	could you add an assert to make sure the division happens without remainder?
13419	IIUC the only case that can happen here is that `Parts == 4`, right? Might be good to update the check.
13426	Could use `Builder.getInt8(....)`?

This revision is now accepted and ready to land.Dec 14 2022, 1:47 PM

Closed by commit rG02d09ffc1b09: [AArch64] Extending lowering of 'trunc <(8|16) x i64> %x to <(8|16) x i8>' to… (authored by nilanjana_basu). · Explain WhyDec 15 2022, 7:21 AM

This revision was automatically updated to reflect the committed changes.

nilanjana_basu added a commit: rG02d09ffc1b09: [AArch64] Extending lowering of 'trunc <(8|16) x i64> %x to <(8|16) x i8>' to….

nilanjana_basu mentioned this in rG795868285db9: [AArch64] Minor changes and sanity checks in relation to https://reviews.llvm..Dec 15 2022, 12:09 PM

Addressed the final comments in a separate commit.

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

140 lines

test/

CodeGen/

AArch64/

aarch64-matrix-umull-smull.ll

66 lines

trunc-to-tbl.ll

423 lines

Diff 477022

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 13,394 Lines • ▼ Show 20 Lines	static void createTblShuffleForZExt(ZExtInst *ZExt, bool IsLittleEndian) {
Value *Result = Builder.CreateShuffleVector(Op, FirstEltZero, Mask);		Value *Result = Builder.CreateShuffleVector(Op, FirstEltZero, Mask);
Result = Builder.CreateBitCast(Result, DstTy);		Result = Builder.CreateBitCast(Result, DstTy);
ZExt->replaceAllUsesWith(Result);		ZExt->replaceAllUsesWith(Result);
ZExt->eraseFromParent();		ZExt->eraseFromParent();
}		}

static void createTblForTrunc(TruncInst *TI, bool IsLittleEndian) {		static void createTblForTrunc(TruncInst *TI, bool IsLittleEndian) {
IRBuilder<> Builder(TI);		IRBuilder<> Builder(TI);
SmallVector<Value *> Parts;		SmallVector<Value *> Parts, Parts2;
		int NumElements = cast<FixedVectorType>(TI->getType())->getNumElements();
		auto *SrcTy = cast<FixedVectorType>(TI->getOperand(0)->getType());
		fhahnUnsubmitted Done Reply Inline Actions Is this guaranteed to be a fixed vector type? Could you add a variant of a test with truncates of scalable vectors (`<vscale x 16 x i8>` or something like that? fhahn: Is this guaranteed to be a fixed vector type? Could you add a variant of a test with truncates…
		fhahnUnsubmitted Done Reply Inline Actions I think it should be fine, I added a test in 4783345426da fhahn: I think it should be fine, I added a test in 4783345426da
		nilanjana_basuAuthorUnsubmitted Done Reply Inline Actions Since both the zext & trunc test for scalable vector goes through the optimizeExtendOrTruncateConversion() function, the zext test in 4783345426da should suffice for the trunc too. Let me know if you think it needs to be replicated for trunc vector too. nilanjana_basu: Since both the zext & trunc test for scalable vector goes through the…
		nilanjana_basuAuthorUnsubmitted Done Reply Inline Actions Since the source & destination types were checked for FixedVector once in the calling function optimizeExtendOrTruncateConversion(), I didn't check it here again. nilanjana_basu: Since the source & destination types were checked for FixedVector once in the calling function…
		auto *DstTy = cast<FixedVectorType>(TI->getType());
		t.p.northoverUnsubmitted Done Reply Inline Actions I think these are guaranteed to succeed by checks in the caller (and essential here), so `cast<...>` is probably better. Applies to some of the later `dyn_cast`s too. t.p.northover: I think these are guaranteed to succeed by checks in the caller (and essential here), so `cast<.
		assert(SrcTy->getElementType()->isIntegerTy() &&
		"Non-integer type source vector element is not supported");
		assert(DstTy->getElementType()->isIntegerTy(8) &&
		"Unsupported destination vector element type");
		unsigned SrcElemTySz =
		cast<IntegerType>(SrcTy->getElementType())->getBitWidth();
		unsigned TruncFactor =
		SrcElemTySz / cast<IntegerType>(DstTy->getElementType())->getBitWidth();
		fhahnUnsubmitted Done Reply Inline Actions could you add an assert to make sure the division happens without remainder? fhahn: could you add an assert to make sure the division happens without remainder?
		assert((SrcElemTySz == 16 \|\| SrcElemTySz == 32 \|\| SrcElemTySz == 64) &&
		"Unsupported source vector element type size");
Type *VecTy = FixedVectorType::get(Builder.getInt8Ty(), 16);		Type *VecTy = FixedVectorType::get(Builder.getInt8Ty(), 16);
Parts.push_back(Builder.CreateBitCast(
Builder.CreateShuffleVector(TI->getOperand(0), {0, 1, 2, 3}), VecTy));
Parts.push_back(Builder.CreateBitCast(
Builder.CreateShuffleVector(TI->getOperand(0), {4, 5, 6, 7}), VecTy));

Intrinsic::ID TblID = Intrinsic::aarch64_neon_tbl2;		SmallVector<Constant *, 16> MaskConst;
		fhahnUnsubmitted Done Reply Inline Actions It would be great if you could add a brief comment here explaining what kind of masks/shuffles are prepared here. fhahn: It would be great if you could add a brief comment here explaining what kind of masks/shuffles…
		fhahnUnsubmitted Done Reply Inline Actions IIUC the only case that can happen here is that `Parts == 4`, right? Might be good to update the check. fhahn: IIUC the only case that can happen here is that `Parts == 4`, right? Might be good to update…
unsigned NumElements = cast<FixedVectorType>(TI->getType())->getNumElements();		for (int Itr = 0; Itr < 16; Itr++) {
if (NumElements == 16) {		if (Itr < NumElements)
Parts.push_back(Builder.CreateBitCast(		MaskConst.push_back(ConstantInt::get(
Builder.CreateShuffleVector(TI->getOperand(0), {8, 9, 10, 11}), VecTy));		Builder.getInt8Ty(), IsLittleEndian
		? Itr * TruncFactor
		: Itr * TruncFactor + (TruncFactor - 1)));
		else
		fhahnUnsubmitted Done Reply Inline Actions Could use `Builder.getInt8(....)`? fhahn: Could use `Builder.getInt8(....)`?
		MaskConst.push_back(ConstantInt::get(Builder.getInt8Ty(), 255));
		}

		int MaxTblSz = 128 * 4;
		int MaxSrcSz = SrcElemTySz * NumElements;
		int ElemsPerTbl =
		(MaxTblSz > MaxSrcSz) ? NumElements : (MaxTblSz / SrcElemTySz);
		assert(ElemsPerTbl <= 16 &&
		"Maximum elements selected using TBL instruction cannot exceed 16!");

		int ShuffleCount = 128 / SrcElemTySz;
		SmallVector<int> ShuffleLanes;
		for (int i = 0; i < ShuffleCount; ++i)
		ShuffleLanes.push_back(i);

		// Create TBL's table of bytes in 1,2,3 or 4 FP/SIMD registers using shuffles
		// over the source vector. If TBL's maximum 4 FP/SIMD registers are saturated,
		// call TBL & store the result in a vector for combining later.
		fhahnUnsubmitted Done Reply Inline Actions store here seems ambiguous here, as we won't emit a store instruction, right? fhahn: store here seems ambiguous here, as we won't emit a store instruction, right?
		nilanjana_basuAuthorUnsubmitted Done Reply Inline Actions I replaced the "store" with "save" to indicate that it is being stored in the compiler's internal vector data structure. Added a comment at the place of combining these results. nilanjana_basu: I replaced the "store" with "save" to indicate that it is being stored in the compiler's…
		SmallVector<Value *> Results;
		while (ShuffleLanes.back() < NumElements) {
Parts.push_back(Builder.CreateBitCast(		Parts.push_back(Builder.CreateBitCast(
Builder.CreateShuffleVector(TI->getOperand(0), {12, 13, 14, 15}),		Builder.CreateShuffleVector(TI->getOperand(0), ShuffleLanes), VecTy));
VecTy));
TblID = Intrinsic::aarch64_neon_tbl4;		if (Parts.size() >= 4) {
		auto *F = Intrinsic::getDeclaration(TI->getModule(),
		Intrinsic::aarch64_neon_tbl4, VecTy);
		Parts.push_back(ConstantVector::get(MaskConst));
		Results.push_back(Builder.CreateCall(F, Parts));
		Parts.clear();
}		}
SmallVector<Constant *, 16> MaskConst;
for (unsigned Idx = 0; Idx < NumElements * 4; Idx += 4)
MaskConst.push_back(
ConstantInt::get(Builder.getInt8Ty(), IsLittleEndian ? Idx : Idx + 3));

for (unsigned Idx = NumElements * 4; Idx < 64; Idx += 4)		for (int i = 0; i < ShuffleCount; ++i)
MaskConst.push_back(ConstantInt::get(Builder.getInt8Ty(), 255));		ShuffleLanes[i] += ShuffleCount;
		}
		t.p.northoverUnsubmitted Done Reply Inline Actions There's a lot of duplication in this switch, but it is pretty easy to eyeball for correctness because of that once you get what it's trying to do. So I'm torn, a loop like this would probably be shorter overall: int ShuffleCount = 128/SrcElemSize; SmallVector<int> ShuffleLanes; for (int i = 0; i < ShuffleCount; ++i) ShuffleLanes.push_back(i); SmallVector<Value > Results; while (ShuffleLanes.back() < NumElements) { Parts.push_back(Builder.CreateBitCast(Builder.CreateShuffleVector(TI->getOperand(0), ShuffleLanes), VecTy)); for (int i = 0; i < ShuffleCount; ++i) ShuffleLanes[i] += ShuffleCount; if (Parts.size() == 4) { // Call tbl4, push result into Results, clear Parts. } } // Choose correct tbl (3 now a valid option) and call for rest of Parts, push to Results // Shuffle-merge all of Results. and allow the code to apply to a wider range of truncates. What are your views on the implementation? t.p.northover:* There's a lot of duplication in this switch, but it is pretty easy to eyeball for correctness…
		nilanjana_basuAuthorUnsubmitted Not Done Reply Inline Actions I refactored the code as you suggested, which can now apply to a few extra cases like 12xi32 or 4xi32. However, I haven't modified the old set of allowable cases since I don't know how relevant these few are. In my understanding, we get better performance when tbl2-tbl4 get triggered, as the number of generated instructions decrease. So, I need your opinion on whether we should allow 8xi16 conversions, since they generate a single tbl1 instruction? nilanjana_basu: I refactored the code as you suggested, which can now apply to a few extra cases like 12xi32 or…

		assert((Parts.empty() \|\| Results.empty()) &&
		"Lowering trunc for vectors requiring different TBL instructions is "
		"not supported!");
		if (!Parts.empty()) {
		Intrinsic::ID TblID;
		switch (Parts.size()) {
		case 1:
		TblID = Intrinsic::aarch64_neon_tbl1;
		break;
		case 2:
		TblID = Intrinsic::aarch64_neon_tbl2;
		break;
		case 3:
		TblID = Intrinsic::aarch64_neon_tbl3;
		break;
		}
		auto *F = Intrinsic::getDeclaration(TI->getModule(), TblID, VecTy);
Parts.push_back(ConstantVector::get(MaskConst));		Parts.push_back(ConstantVector::get(MaskConst));
auto *F =		Results.push_back(Builder.CreateCall(F, Parts));
Intrinsic::getDeclaration(TI->getModule(), TblID, Parts[0]->getType());		}
Value *Res = Builder.CreateCall(F, Parts);
		// For ease of combining results from TBL's, we support at most two TBLs
if (NumElements == 8)		assert(Results.size() <= 2 && "Trunc lowering does not support generation of "
Res = Builder.CreateShuffleVector(Res, {0, 1, 2, 3, 4, 5, 6, 7});		"more than 2 tbl instructions!");
TI->replaceAllUsesWith(Res);		Value *FinalResult = Results[0];
		if (Results.size() == 1) {
		if (ElemsPerTbl < 16) {
		std::vector<int> FinalMask(ElemsPerTbl);
		fhahnUnsubmitted Done Reply Inline Actions SmallVector? fhahn: SmallVector?
		std::iota(FinalMask.begin(), FinalMask.end(), 0);
		FinalResult = Builder.CreateShuffleVector(Results[0], FinalMask);
		}
		} else {
		std::vector<int> FinalMask(ElemsPerTbl * Results.size());
		fhahnUnsubmitted Done Reply Inline Actions SmallVector? fhahn: SmallVector?
		if (ElemsPerTbl < 16) {
		std::iota(FinalMask.begin(), FinalMask.begin() + ElemsPerTbl, 0);
		std::iota(FinalMask.begin() + ElemsPerTbl, FinalMask.end(), 16);
		} else {
		std::iota(FinalMask.begin(), FinalMask.end(), 0);
		}
		FinalResult =
		Builder.CreateShuffleVector(Results[0], Results[1], FinalMask);
		}

		TI->replaceAllUsesWith(FinalResult);
TI->eraseFromParent();		TI->eraseFromParent();
}		}

bool AArch64TargetLowering::optimizeExtendOrTruncateConversion(Instruction *I,		bool AArch64TargetLowering::optimizeExtendOrTruncateConversion(Instruction *I,
Loop *L) const {		Loop *L) const {
// Try to optimize conversions using tbl. This requires materializing constant		// Try to optimize conversions using tbl. This requires materializing constant
// index vectors, which can increase code size and add loads. Skip the		// index vectors, which can increase code size and add loads. Skip the
// transform unless the conversion is in a loop block guaranteed to execute		// transform unless the conversion is in a loop block guaranteed to execute
▲ Show 20 Lines • Show All 46 Lines • ▼ Show 20 Lines	auto *WideConv = Builder.CreateFPToUI(FPToUI->getOperand(0),
VectorType::getInteger(SrcTy));		VectorType::getInteger(SrcTy));
auto *TruncI = Builder.CreateTrunc(WideConv, DstTy);		auto *TruncI = Builder.CreateTrunc(WideConv, DstTy);
I->replaceAllUsesWith(TruncI);		I->replaceAllUsesWith(TruncI);
I->eraseFromParent();		I->eraseFromParent();
createTblForTrunc(cast<TruncInst>(TruncI), Subtarget->isLittleEndian());		createTblForTrunc(cast<TruncInst>(TruncI), Subtarget->isLittleEndian());
return true;		return true;
}		}

// Convert 'trunc <(8\|16) x i32> %x to <(8\|16) x i8>' to a single tbl.4		// Convert 'trunc <(8\|16) x (i16\|i32\|i64)> %x to <(8\|16) x i8>' using tbl
// instruction selecting the lowest 8 bits per lane of the input interpreted		// instructions instruction selecting the lowest 8 bits per lane of the input
// as 2 or 4 <4 x i32> vectors.		// interpreted as 1, 2 or 4 <4 x i32> vectors.
auto *TI = dyn_cast<TruncInst>(I);		auto *TI = dyn_cast<TruncInst>(I);
if (TI && (SrcTy->getNumElements() == 8 \|\| SrcTy->getNumElements() == 16) &&		if (TI && DstTy->getElementType()->isIntegerTy(8) &&
SrcTy->getElementType()->isIntegerTy(32) &&		((SrcTy->getElementType()->isIntegerTy(16) \|\|
DstTy->getElementType()->isIntegerTy(8)) {		SrcTy->getElementType()->isIntegerTy(32) \|\|
		SrcTy->getElementType()->isIntegerTy(64)) &&
		(SrcTy->getNumElements() == 16 \|\| SrcTy->getNumElements() == 8))) {
createTblForTrunc(TI, Subtarget->isLittleEndian());		createTblForTrunc(TI, Subtarget->isLittleEndian());
return true;		return true;
}		}

return false;		return false;
}		}

bool AArch64TargetLowering::hasPairedLoad(EVT LoadedType,		bool AArch64TargetLowering::hasPairedLoad(EVT LoadedType,
▲ Show 20 Lines • Show All 9,116 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll

	Show First 20 Lines • Show All 609 Lines • ▼ Show 20 Lines

	exit:			exit:
	ret void			ret void
	}			}

	define void @sink_v8z16_0(i32 %p, i32 %d, i64 %n, <16 x i8> %a) {			define void @sink_v8z16_0(i32 %p, i32 %d, i64 %n, <16 x i8> %a) {
	; CHECK-LABEL: sink_v8z16_0:			; CHECK-LABEL: sink_v8z16_0:
	; CHECK: // %bb.0: // %entry			; CHECK: // %bb.0: // %entry
				; CHECK-NEXT: adrp x9, .LCPI8_0
	; CHECK-NEXT: dup v0.8b, v0.b[0]			; CHECK-NEXT: dup v0.8b, v0.b[0]
	; CHECK-NEXT: mov x8, xzr			; CHECK-NEXT: mov x8, xzr
				; CHECK-NEXT: ldr q1, [x9, :lo12:.LCPI8_0]
	; CHECK-NEXT: .LBB8_1: // %loop			; CHECK-NEXT: .LBB8_1: // %loop
	; CHECK-NEXT: // =>This Inner Loop Header: Depth=1			; CHECK-NEXT: // =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: ldr d1, [x0]			; CHECK-NEXT: ldr d2, [x0]
	; CHECK-NEXT: add x8, x8, #8			; CHECK-NEXT: add x8, x8, #8
	; CHECK-NEXT: subs x2, x2, #8			; CHECK-NEXT: subs x2, x2, #8
	; CHECK-NEXT: umull v1.8h, v1.8b, v0.8b			; CHECK-NEXT: umull v2.8h, v2.8b, v0.8b
	; CHECK-NEXT: cmlt v1.8h, v1.8h, #0			; CHECK-NEXT: cmlt v2.8h, v2.8h, #0
	; CHECK-NEXT: xtn v1.8b, v1.8h			; CHECK-NEXT: tbl v2.16b, { v2.16b }, v1.16b
	; CHECK-NEXT: str d1, [x0], #32			; CHECK-NEXT: str d2, [x0], #32
	; CHECK-NEXT: b.ne .LBB8_1			; CHECK-NEXT: b.ne .LBB8_1
	; CHECK-NEXT: // %bb.2: // %exit			; CHECK-NEXT: // %bb.2: // %exit
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	entry:			entry:
	%ext = zext <16 x i8> %a to <16 x i16>			%ext = zext <16 x i8> %a to <16 x i16>
	%broadcast.splat = shufflevector <16 x i16> %ext, <16 x i16> poison, <8 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>			%broadcast.splat = shufflevector <16 x i16> %ext, <16 x i16> poison, <8 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>
	br label %loop			br label %loop

	loop:			loop:
	%index = phi i64 [ 0, %entry ], [ %index.next, %loop ]			%index = phi i64 [ 0, %entry ], [ %index.next, %loop ]
	%g = getelementptr inbounds i32, i32 *%p, i64 %index			%g = getelementptr inbounds i32, i32 *%p, i64 %index
	Show All 12 Lines

	exit:			exit:
	ret void			ret void
	}			}

	define void @sink_v16s16_8(i32 %p, i32 %d, i64 %n, <16 x i8> %a) {			define void @sink_v16s16_8(i32 %p, i32 %d, i64 %n, <16 x i8> %a) {
	; CHECK-LABEL: sink_v16s16_8:			; CHECK-LABEL: sink_v16s16_8:
	; CHECK: // %bb.0: // %entry			; CHECK: // %bb.0: // %entry
				; CHECK-NEXT: adrp x9, .LCPI9_0
	; CHECK-NEXT: dup v1.8b, v0.b[10]			; CHECK-NEXT: dup v1.8b, v0.b[10]
	; CHECK-NEXT: mov x8, xzr			; CHECK-NEXT: mov x8, xzr
	; CHECK-NEXT: dup v0.16b, v0.b[10]			; CHECK-NEXT: dup v0.16b, v0.b[10]
				; CHECK-NEXT: ldr q2, [x9, :lo12:.LCPI9_0]
	; CHECK-NEXT: .LBB9_1: // %loop			; CHECK-NEXT: .LBB9_1: // %loop
	; CHECK-NEXT: // =>This Inner Loop Header: Depth=1			; CHECK-NEXT: // =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: ldr q2, [x0]			; CHECK-NEXT: ldr q3, [x0]
	; CHECK-NEXT: add x8, x8, #8			; CHECK-NEXT: add x8, x8, #8
	; CHECK-NEXT: subs x2, x2, #8			; CHECK-NEXT: subs x2, x2, #8
	; CHECK-NEXT: smull2 v3.8h, v2.16b, v0.16b			; CHECK-NEXT: smull2 v4.8h, v3.16b, v0.16b
	; CHECK-NEXT: smull v2.8h, v2.8b, v1.8b			; CHECK-NEXT: smull v3.8h, v3.8b, v1.8b
	; CHECK-NEXT: cmlt v3.8h, v3.8h, #0			; CHECK-NEXT: cmlt v5.8h, v4.8h, #0
	; CHECK-NEXT: cmlt v2.8h, v2.8h, #0			; CHECK-NEXT: cmlt v4.8h, v3.8h, #0
	; CHECK-NEXT: uzp1 v2.16b, v2.16b, v3.16b			; CHECK-NEXT: tbl v3.16b, { v4.16b, v5.16b }, v2.16b
				fhahnUnsubmitted Done Reply Inline Actions Similar to D136722, it is likely not profitable to do this when converting to/from the next power-of-2. fhahn: Similar to D136722, it is likely not profitable to do this when converting to/from the next…
				nilanjana_basuAuthorUnsubmitted Done Reply Inline Actions Since we only support a handful of vector type truncates in this implementation, only Yxi16->Yxi8 was a supported next power-of-2 conversion. Removed it. nilanjana_basu: Since we only support a handful of vector type truncates in this implementation, only Yxi16…
	; CHECK-NEXT: str q2, [x0], #32			; CHECK-NEXT: str q3, [x0], #32
	; CHECK-NEXT: b.ne .LBB9_1			; CHECK-NEXT: b.ne .LBB9_1
	; CHECK-NEXT: // %bb.2: // %exit			; CHECK-NEXT: // %bb.2: // %exit
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	entry:			entry:
	%ext = sext <16 x i8> %a to <16 x i16>			%ext = sext <16 x i8> %a to <16 x i16>
	%broadcast.splat = shufflevector <16 x i16> %ext, <16 x i16> poison, <16 x i32> <i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10>			%broadcast.splat = shufflevector <16 x i16> %ext, <16 x i16> poison, <16 x i32> <i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10, i32 10>
	br label %loop			br label %loop

	loop:			loop:
	%index = phi i64 [ 0, %entry ], [ %index.next, %loop ]			%index = phi i64 [ 0, %entry ], [ %index.next, %loop ]
	%g = getelementptr inbounds i32, i32 *%p, i64 %index			%g = getelementptr inbounds i32, i32 *%p, i64 %index
	Show All 19 Lines

llvm/test/CodeGen/AArch64/trunc-to-tbl.ll

Show First 20 Lines • Show All 229 Lines • ▼ Show 20 Lines	loop:
%iv.next = add i64 %iv, 1		%iv.next = add i64 %iv, 1
%ec = icmp eq i64 %iv.next, 1000		%ec = icmp eq i64 %iv.next, 1000
br i1 %ec, label %loop, label %exit		br i1 %ec, label %loop, label %exit

exit:		exit:
ret void		ret void
}		}

		; CHECK-LABEL: lCPI3_0:
		; CHECK-NEXT: .byte 0 ; 0x0
		; CHECK-NEXT: .byte 8 ; 0x8
		; CHECK-NEXT: .byte 16 ; 0x10
		; CHECK-NEXT: .byte 24 ; 0x18
		; CHECK-NEXT: .byte 32 ; 0x20
		; CHECK-NEXT: .byte 40 ; 0x28
		; CHECK-NEXT: .byte 48 ; 0x30
		; CHECK-NEXT: .byte 56 ; 0x38
		; CHECK-NEXT: .byte 64 ; 0x40
		; CHECK-NEXT: .byte 72 ; 0x48
		; CHECK-NEXT: .byte 80 ; 0x50
		; CHECK-NEXT: .byte 88 ; 0x58
		; CHECK-NEXT: .byte 96 ; 0x60
		; CHECK-NEXT: .byte 104 ; 0x68
		; CHECK-NEXT: .byte 112 ; 0x70
		; CHECK-NEXT: .byte 120 ; 0x78

		; CHECK-BE-LABEL: .LCPI3_0:
		; CHECK-BE-NEXT: .byte 7 // 0x7
		; CHECK-BE-NEXT: .byte 15 // 0xf
		; CHECK-BE-NEXT: .byte 23 // 0x17
		; CHECK-BE-NEXT: .byte 31 // 0x1f
		; CHECK-BE-NEXT: .byte 39 // 0x27
		; CHECK-BE-NEXT: .byte 47 // 0x2f
		; CHECK-BE-NEXT: .byte 55 // 0x37
		; CHECK-BE-NEXT: .byte 63 // 0x3f
		; CHECK-BE-NEXT: .byte 71 // 0x47
		; CHECK-BE-NEXT: .byte 79 // 0x4f
		; CHECK-BE-NEXT: .byte 87 // 0x57
		; CHECK-BE-NEXT: .byte 95 // 0x5f
		; CHECK-BE-NEXT: .byte 103 // 0x67
		; CHECK-BE-NEXT: .byte 111 // 0x6f
		; CHECK-BE-NEXT: .byte 119 // 0x77
		; CHECK-BE-NEXT: .byte 127 // 0x7f
define void @trunc_v16i64_to_v16i8_in_loop(ptr %A, ptr %dst) {		define void @trunc_v16i64_to_v16i8_in_loop(ptr %A, ptr %dst) {
; CHECK-LABEL: trunc_v16i64_to_v16i8_in_loop:		; CHECK-LABEL: trunc_v16i64_to_v16i8_in_loop:
; CHECK: ; %bb.0: ; %entry		; CHECK: ; %bb.0: ; %entry
		; CHECK-NEXT: Lloh4:
		; CHECK-NEXT: adrp x9, lCPI3_0@PAGE
; CHECK-NEXT: mov x8, xzr		; CHECK-NEXT: mov x8, xzr
		; CHECK-NEXT: Lloh5:
		; CHECK-NEXT: ldr q0, [x9, lCPI3_0@PAGEOFF]
; CHECK-NEXT: LBB3_1: ; %loop		; CHECK-NEXT: LBB3_1: ; %loop
; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1		; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
; CHECK-NEXT: add x9, x0, x8, lsl #7		; CHECK-NEXT: add x9, x0, x8, lsl #7
; CHECK-NEXT: ldp q3, q2, [x9, #96]		; CHECK-NEXT: ldp q1, q2, [x9]
; CHECK-NEXT: ldp q1, q0, [x9, #32]		; CHECK-NEXT: ldp q3, q4, [x9, #32]
; CHECK-NEXT: uzp1.4s v2, v3, v2		; CHECK-NEXT: ldp q16, q17, [x9, #64]
; CHECK-NEXT: ldp q5, q4, [x9, #64]		; CHECK-NEXT: tbl.16b v1, { v1, v2, v3, v4 }, v0
; CHECK-NEXT: uzp1.4s v0, v1, v0		; CHECK-NEXT: ldp q18, q19, [x9, #96]
; CHECK-NEXT: ldp q3, q6, [x9]		; CHECK-NEXT: tbl.16b v2, { v16, v17, v18, v19 }, v0
; CHECK-NEXT: uzp1.4s v4, v5, v4		; CHECK-NEXT: mov.d v1[1], v2[0]
; CHECK-NEXT: uzp1.8h v2, v4, v2		; CHECK-NEXT: str q1, [x1, x8, lsl #4]
; CHECK-NEXT: uzp1.4s v1, v3, v6
; CHECK-NEXT: uzp1.8h v0, v1, v0
; CHECK-NEXT: uzp1.16b v0, v0, v2
; CHECK-NEXT: str q0, [x1, x8, lsl #4]
; CHECK-NEXT: add x8, x8, #1		; CHECK-NEXT: add x8, x8, #1
; CHECK-NEXT: cmp x8, #1000		; CHECK-NEXT: cmp x8, #1000
; CHECK-NEXT: b.eq LBB3_1		; CHECK-NEXT: b.eq LBB3_1
; CHECK-NEXT: ; %bb.2: ; %exit		; CHECK-NEXT: ; %bb.2: ; %exit
; CHECK-NEXT: ret		; CHECK-NEXT: ret

; CHECK-BE-LABEL: trunc_v16i64_to_v16i8_in_loop:		; CHECK-BE-LABEL: trunc_v16i64_to_v16i8_in_loop:
; CHECK-BE: // %bb.0: // %entry		; CHECK-BE: // %bb.0: // %entry
		; CHECK-BE-NEXT: adrp x8, .LCPI3_0
		; CHECK-BE-NEXT: add x8, x8, :lo12:.LCPI3_0
		; CHECK-BE-NEXT: ld1 { v0.16b }, [x8]
; CHECK-BE-NEXT: mov x8, xzr		; CHECK-BE-NEXT: mov x8, xzr
; CHECK-BE-NEXT: .LBB3_1: // %loop		; CHECK-BE-NEXT: .LBB3_1: // %loop
; CHECK-BE-NEXT: // =>This Inner Loop Header: Depth=1		; CHECK-BE-NEXT: // =>This Inner Loop Header: Depth=1
; CHECK-BE-NEXT: add x9, x0, x8, lsl #7		; CHECK-BE-NEXT: add x9, x0, x8, lsl #7
; CHECK-BE-NEXT: add x10, x9, #48		; CHECK-BE-NEXT: add x10, x9, #16
; CHECK-BE-NEXT: add x11, x9, #32		; CHECK-BE-NEXT: add x11, x9, #32
; CHECK-BE-NEXT: ld1 { v5.2d }, [x9]		; CHECK-BE-NEXT: ld1 { v1.16b }, [x9]
; CHECK-BE-NEXT: ld1 { v0.2d }, [x10]		; CHECK-BE-NEXT: ld1 { v2.16b }, [x10]
		; CHECK-BE-NEXT: add x10, x9, #48
		; CHECK-BE-NEXT: ld1 { v3.16b }, [x11]
		; CHECK-BE-NEXT: add x11, x9, #64
		; CHECK-BE-NEXT: ld1 { v4.16b }, [x10]
; CHECK-BE-NEXT: add x10, x9, #80		; CHECK-BE-NEXT: add x10, x9, #80
; CHECK-BE-NEXT: ld1 { v1.2d }, [x11]		; CHECK-BE-NEXT: ld1 { v16.16b }, [x11]
; CHECK-BE-NEXT: add x11, x9, #112		; CHECK-BE-NEXT: add x11, x9, #96
; CHECK-BE-NEXT: ld1 { v2.2d }, [x10]		; CHECK-BE-NEXT: add x9, x9, #112
; CHECK-BE-NEXT: add x10, x9, #96		; CHECK-BE-NEXT: ld1 { v17.16b }, [x10]
; CHECK-BE-NEXT: ld1 { v3.2d }, [x11]		; CHECK-BE-NEXT: tbl v1.16b, { v1.16b, v2.16b, v3.16b, v4.16b }, v0.16b
; CHECK-BE-NEXT: uzp1 v0.4s, v1.4s, v0.4s		; CHECK-BE-NEXT: ld1 { v18.16b }, [x11]
; CHECK-BE-NEXT: ld1 { v4.2d }, [x10]		; CHECK-BE-NEXT: ld1 { v19.16b }, [x9]
; CHECK-BE-NEXT: add x10, x9, #64
; CHECK-BE-NEXT: add x9, x9, #16
; CHECK-BE-NEXT: ld1 { v6.2d }, [x10]
; CHECK-BE-NEXT: ld1 { v7.2d }, [x9]
; CHECK-BE-NEXT: add x9, x1, x8, lsl #4		; CHECK-BE-NEXT: add x9, x1, x8, lsl #4
; CHECK-BE-NEXT: uzp1 v3.4s, v4.4s, v3.4s
; CHECK-BE-NEXT: add x8, x8, #1		; CHECK-BE-NEXT: add x8, x8, #1
; CHECK-BE-NEXT: cmp x8, #1000		; CHECK-BE-NEXT: cmp x8, #1000
; CHECK-BE-NEXT: uzp1 v2.4s, v6.4s, v2.4s		; CHECK-BE-NEXT: tbl v2.16b, { v16.16b, v17.16b, v18.16b, v19.16b }, v0.16b
; CHECK-BE-NEXT: uzp1 v1.4s, v5.4s, v7.4s		; CHECK-BE-NEXT: mov v1.d[1], v2.d[0]
; CHECK-BE-NEXT: uzp1 v2.8h, v2.8h, v3.8h		; CHECK-BE-NEXT: st1 { v1.16b }, [x9]
; CHECK-BE-NEXT: uzp1 v0.8h, v1.8h, v0.8h
; CHECK-BE-NEXT: uzp1 v0.16b, v0.16b, v2.16b
; CHECK-BE-NEXT: st1 { v0.16b }, [x9]
; CHECK-BE-NEXT: b.eq .LBB3_1		; CHECK-BE-NEXT: b.eq .LBB3_1
; CHECK-BE-NEXT: // %bb.2: // %exit		; CHECK-BE-NEXT: // %bb.2: // %exit
; CHECK-BE-NEXT: ret		; CHECK-BE-NEXT: ret

entry:		entry:
br label %loop		br label %loop

loop:		loop:
%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]		%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
%gep.A = getelementptr inbounds <16 x i64>, ptr %A, i64 %iv		%gep.A = getelementptr inbounds <16 x i64>, ptr %A, i64 %iv
%l.A = load <16 x i64>, ptr %gep.A		%l.A = load <16 x i64>, ptr %gep.A
%trunc = trunc <16 x i64> %l.A to <16 x i8>		%trunc = trunc <16 x i64> %l.A to <16 x i8>
%gep.dst = getelementptr inbounds <16 x i8>, ptr %dst, i64 %iv		%gep.dst = getelementptr inbounds <16 x i8>, ptr %dst, i64 %iv
store <16 x i8> %trunc, ptr %gep.dst		store <16 x i8> %trunc, ptr %gep.dst
%iv.next = add i64 %iv, 1		%iv.next = add i64 %iv, 1
%ec = icmp eq i64 %iv.next, 1000		%ec = icmp eq i64 %iv.next, 1000
br i1 %ec, label %loop, label %exit		br i1 %ec, label %loop, label %exit

exit:		exit:
ret void		ret void
}		}

		; CHECK-LABEL: lCPI4_0:
		; CHECK-NEXT: .byte 0 ; 0x0
		; CHECK-NEXT: .byte 8 ; 0x8
		; CHECK-NEXT: .byte 16 ; 0x10
		; CHECK-NEXT: .byte 24 ; 0x18
		; CHECK-NEXT: .byte 32 ; 0x20
		; CHECK-NEXT: .byte 40 ; 0x28
		; CHECK-NEXT: .byte 48 ; 0x30
		; CHECK-NEXT: .byte 56 ; 0x38
		; CHECK-NEXT: .byte 255 ; 0xff
		; CHECK-NEXT: .byte 255 ; 0xff
		; CHECK-NEXT: .byte 255 ; 0xff
		; CHECK-NEXT: .byte 255 ; 0xff
		; CHECK-NEXT: .byte 255 ; 0xff
		; CHECK-NEXT: .byte 255 ; 0xff
		; CHECK-NEXT: .byte 255 ; 0xff
		; CHECK-NEXT: .byte 255 ; 0xff

		; CHECK-BE-LABEL: .LCPI4_0:
		; CHECK-BE-NEXT: .byte 7 // 0x7
		; CHECK-BE-NEXT: .byte 15 // 0xf
		; CHECK-BE-NEXT: .byte 23 // 0x17
		; CHECK-BE-NEXT: .byte 31 // 0x1f
		; CHECK-BE-NEXT: .byte 39 // 0x27
		; CHECK-BE-NEXT: .byte 47 // 0x2f
		; CHECK-BE-NEXT: .byte 55 // 0x37
		; CHECK-BE-NEXT: .byte 63 // 0x3f
		; CHECK-BE-NEXT: .byte 255 // 0xff
		; CHECK-BE-NEXT: .byte 255 // 0xff
		; CHECK-BE-NEXT: .byte 255 // 0xff
		; CHECK-BE-NEXT: .byte 255 // 0xff
		; CHECK-BE-NEXT: .byte 255 // 0xff
		; CHECK-BE-NEXT: .byte 255 // 0xff
		; CHECK-BE-NEXT: .byte 255 // 0xff
		; CHECK-BE-NEXT: .byte 255 // 0xff
define void @trunc_v8i64_to_v8i8_in_loop(ptr %A, ptr %dst) {		define void @trunc_v8i64_to_v8i8_in_loop(ptr %A, ptr %dst) {
; CHECK-LABEL: trunc_v8i64_to_v8i8_in_loop:		; CHECK-LABEL: trunc_v8i64_to_v8i8_in_loop:
; CHECK: ; %bb.0: ; %entry		; CHECK: ; %bb.0: ; %entry
		; CHECK-NEXT: Lloh6:
		; CHECK-NEXT: adrp x9, lCPI4_0@PAGE
; CHECK-NEXT: mov x8, xzr		; CHECK-NEXT: mov x8, xzr
		; CHECK-NEXT: Lloh7:
		; CHECK-NEXT: ldr q0, [x9, lCPI4_0@PAGEOFF]
; CHECK-NEXT: LBB4_1: ; %loop		; CHECK-NEXT: LBB4_1: ; %loop
; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1		; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
; CHECK-NEXT: add x9, x0, x8, lsl #6		; CHECK-NEXT: add x9, x0, x8, lsl #6
; CHECK-NEXT: ldp q1, q0, [x9, #32]		; CHECK-NEXT: ldp q1, q2, [x9]
; CHECK-NEXT: ldp q3, q2, [x9]		; CHECK-NEXT: ldp q3, q4, [x9, #32]
; CHECK-NEXT: uzp1.4s v0, v1, v0		; CHECK-NEXT: tbl.16b v1, { v1, v2, v3, v4 }, v0
; CHECK-NEXT: uzp1.4s v1, v3, v2		; CHECK-NEXT: str d1, [x1, x8, lsl #3]
; CHECK-NEXT: uzp1.8h v0, v1, v0
; CHECK-NEXT: xtn.8b v0, v0
; CHECK-NEXT: str d0, [x1, x8, lsl #3]
; CHECK-NEXT: add x8, x8, #1		; CHECK-NEXT: add x8, x8, #1
; CHECK-NEXT: cmp x8, #1000		; CHECK-NEXT: cmp x8, #1000
; CHECK-NEXT: b.eq LBB4_1		; CHECK-NEXT: b.eq LBB4_1
; CHECK-NEXT: ; %bb.2: ; %exit		; CHECK-NEXT: ; %bb.2: ; %exit
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; CHECK-NEXT: .loh AdrpLdr Lloh6, Lloh7

; CHECK-BE-LABEL: trunc_v8i64_to_v8i8_in_loop:		; CHECK-BE-LABEL: trunc_v8i64_to_v8i8_in_loop:
; CHECK-BE: // %bb.0: // %entry		; CHECK-BE: // %bb.0: // %entry
		; CHECK-BE-NEXT: adrp x8, .LCPI4_0
		; CHECK-BE-NEXT: add x8, x8, :lo12:.LCPI4_0
		; CHECK-BE-NEXT: ld1 { v0.16b }, [x8]
; CHECK-BE-NEXT: mov x8, xzr		; CHECK-BE-NEXT: mov x8, xzr
; CHECK-BE-NEXT: .LBB4_1: // %loop		; CHECK-BE-NEXT: .LBB4_1: // %loop
; CHECK-BE-NEXT: // =>This Inner Loop Header: Depth=1		; CHECK-BE-NEXT: // =>This Inner Loop Header: Depth=1
; CHECK-BE-NEXT: add x9, x0, x8, lsl #6		; CHECK-BE-NEXT: add x9, x0, x8, lsl #6
; CHECK-BE-NEXT: add x10, x9, #48		; CHECK-BE-NEXT: add x10, x9, #16
; CHECK-BE-NEXT: ld1 { v1.2d }, [x9]		; CHECK-BE-NEXT: add x11, x9, #32
; CHECK-BE-NEXT: ld1 { v0.2d }, [x10]		; CHECK-BE-NEXT: ld1 { v1.16b }, [x9]
; CHECK-BE-NEXT: add x10, x9, #32		; CHECK-BE-NEXT: add x9, x9, #48
; CHECK-BE-NEXT: add x9, x9, #16		; CHECK-BE-NEXT: ld1 { v2.16b }, [x10]
; CHECK-BE-NEXT: ld1 { v2.2d }, [x10]		; CHECK-BE-NEXT: ld1 { v3.16b }, [x11]
; CHECK-BE-NEXT: ld1 { v3.2d }, [x9]		; CHECK-BE-NEXT: ld1 { v4.16b }, [x9]
; CHECK-BE-NEXT: add x9, x1, x8, lsl #3		; CHECK-BE-NEXT: add x9, x1, x8, lsl #3
; CHECK-BE-NEXT: add x8, x8, #1		; CHECK-BE-NEXT: add x8, x8, #1
; CHECK-BE-NEXT: cmp x8, #1000		; CHECK-BE-NEXT: cmp x8, #1000
; CHECK-BE-NEXT: uzp1 v0.4s, v2.4s, v0.4s		; CHECK-BE-NEXT: tbl v1.16b, { v1.16b, v2.16b, v3.16b, v4.16b }, v0.16b
; CHECK-BE-NEXT: uzp1 v1.4s, v1.4s, v3.4s		; CHECK-BE-NEXT: st1 { v1.8b }, [x9]
; CHECK-BE-NEXT: uzp1 v0.8h, v1.8h, v0.8h
; CHECK-BE-NEXT: xtn v0.8b, v0.8h
; CHECK-BE-NEXT: st1 { v0.8b }, [x9]
; CHECK-BE-NEXT: b.eq .LBB4_1		; CHECK-BE-NEXT: b.eq .LBB4_1
; CHECK-BE-NEXT: // %bb.2: // %exit		; CHECK-BE-NEXT: // %bb.2: // %exit
; CHECK-BE-NEXT: ret		; CHECK-BE-NEXT: ret

entry:		entry:
br label %loop		br label %loop

loop:		loop:
%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]		%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
%gep.A = getelementptr inbounds <8 x i64>, ptr %A, i64 %iv		%gep.A = getelementptr inbounds <8 x i64>, ptr %A, i64 %iv
%l.A = load <8 x i64>, ptr %gep.A		%l.A = load <8 x i64>, ptr %gep.A
▲ Show 20 Lines • Show All 180 Lines • ▼ Show 20 Lines	loop:
store <11 x i8> %trunc, ptr %gep.dst		store <11 x i8> %trunc, ptr %gep.dst
%iv.next = add i64 %iv, 1		%iv.next = add i64 %iv, 1
%ec = icmp eq i64 %iv.next, 1000		%ec = icmp eq i64 %iv.next, 1000
br i1 %ec, label %loop, label %exit		br i1 %ec, label %loop, label %exit

exit:		exit:
ret void		ret void
}		}

		; CHECK-LABEL: lCPI7_0:
		; CHECK-NEXT: .byte 0 ; 0x0
		; CHECK-NEXT: .byte 2 ; 0x2
		; CHECK-NEXT: .byte 4 ; 0x4
		; CHECK-NEXT: .byte 6 ; 0x6
		; CHECK-NEXT: .byte 8 ; 0x8
		; CHECK-NEXT: .byte 10 ; 0xa
		; CHECK-NEXT: .byte 12 ; 0xc
		; CHECK-NEXT: .byte 14 ; 0xe
		; CHECK-NEXT: .byte 16 ; 0x10
		; CHECK-NEXT: .byte 18 ; 0x12
		; CHECK-NEXT: .byte 20 ; 0x14
		; CHECK-NEXT: .byte 22 ; 0x16
		; CHECK-NEXT: .byte 24 ; 0x18
		; CHECK-NEXT: .byte 26 ; 0x1a
		; CHECK-NEXT: .byte 28 ; 0x1c
		; CHECK-NEXT: .byte 30 ; 0x1e

		; CHECK-BE-LABEL: .LCPI7_0:
		; CHECK-BE-NEXT: .byte 1 // 0x1
		; CHECK-BE-NEXT: .byte 3 // 0x3
		; CHECK-BE-NEXT: .byte 5 // 0x5
		; CHECK-BE-NEXT: .byte 7 // 0x7
		; CHECK-BE-NEXT: .byte 9 // 0x9
		; CHECK-BE-NEXT: .byte 11 // 0xb
		; CHECK-BE-NEXT: .byte 13 // 0xd
		; CHECK-BE-NEXT: .byte 15 // 0xf
		; CHECK-BE-NEXT: .byte 17 // 0x11
		; CHECK-BE-NEXT: .byte 19 // 0x13
		; CHECK-BE-NEXT: .byte 21 // 0x15
		; CHECK-BE-NEXT: .byte 23 // 0x17
		; CHECK-BE-NEXT: .byte 25 // 0x19
		; CHECK-BE-NEXT: .byte 27 // 0x1b
		; CHECK-BE-NEXT: .byte 29 // 0x1d
		; CHECK-BE-NEXT: .byte 31 // 0x1f


		define void @trunc_v16i16_to_v16i8_in_loop(ptr %A, ptr %dst) {
		; CHECK-LABEL: trunc_v16i16_to_v16i8_in_loop:
		; CHECK: ; %bb.0: ; %entry
		; CHECK-NEXT: Lloh8:
		; CHECK-NEXT: adrp x9, lCPI7_0@PAGE
		; CHECK-NEXT: mov x8, xzr
		; CHECK-NEXT: Lloh9:
		; CHECK-NEXT: ldr q0, [x9, lCPI7_0@PAGEOFF]
		; CHECK-NEXT: LBB7_1: ; %loop
		; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
		; CHECK-NEXT: add x9, x0, x8, lsl #5
		; CHECK-NEXT: ldp q1, q2, [x9]
		; CHECK-NEXT: tbl.16b v1, { v1, v2 }, v0
		; CHECK-NEXT: str q1, [x1, x8, lsl #4]
		; CHECK-NEXT: add x8, x8, #1
		; CHECK-NEXT: cmp x8, #1000
		; CHECK-NEXT: b.eq LBB7_1
		; CHECK-NEXT: ; %bb.2: ; %exit
		; CHECK-NEXT: ret

		; CHECK-BE-LABEL: trunc_v16i16_to_v16i8_in_loop:
		; CHECK-BE: // %bb.0: // %entry
		; CHECK-BE-NEXT: adrp x8, .LCPI7_0
		; CHECK-BE-NEXT: add x8, x8, :lo12:.LCPI7_0
		; CHECK-BE-NEXT: ld1 { v0.16b }, [x8]
		; CHECK-BE-NEXT: mov x8, xzr
		; CHECK-BE-NEXT: .LBB7_1: // %loop
		; CHECK-BE-NEXT: // =>This Inner Loop Header: Depth=1
		; CHECK-BE-NEXT: add x9, x0, x8, lsl #5
		; CHECK-BE-NEXT: add x10, x9, #16
		; CHECK-BE-NEXT: ld1 { v1.16b }, [x9]
		; CHECK-BE-NEXT: add x9, x1, x8, lsl #4
		; CHECK-BE-NEXT: add x8, x8, #1
		; CHECK-BE-NEXT: ld1 { v2.16b }, [x10]
		; CHECK-BE-NEXT: cmp x8, #1000
		; CHECK-BE-NEXT: tbl v1.16b, { v1.16b, v2.16b }, v0.16b
		; CHECK-BE-NEXT: st1 { v1.16b }, [x9]
		; CHECK-BE-NEXT: b.eq .LBB7_1
		; CHECK-BE-NEXT: // %bb.2: // %exit
		; CHECK-BE-NEXT: ret

		entry:
		br label %loop

		loop:
		%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
		%gep.A = getelementptr inbounds <16 x i16>, ptr %A, i64 %iv
		%l.A = load <16 x i16>, ptr %gep.A
		%trunc = trunc <16 x i16> %l.A to <16 x i8>
		%gep.dst = getelementptr inbounds <16 x i8>, ptr %dst, i64 %iv
		store <16 x i8> %trunc, ptr %gep.dst
		%iv.next = add i64 %iv, 1
		%ec = icmp eq i64 %iv.next, 1000
		br i1 %ec, label %loop, label %exit

		exit:
		ret void
		}

		; CHECK-LABEL: lCPI8_0:
		; CHECK-NEXT: .byte 0 ; 0x0
		; CHECK-NEXT: .byte 2 ; 0x2
		; CHECK-NEXT: .byte 4 ; 0x4
		; CHECK-NEXT: .byte 6 ; 0x6
		; CHECK-NEXT: .byte 8 ; 0x8
		; CHECK-NEXT: .byte 10 ; 0xa
		; CHECK-NEXT: .byte 12 ; 0xc
		; CHECK-NEXT: .byte 14 ; 0xe
		; CHECK-NEXT: .byte 255 ; 0xff
		; CHECK-NEXT: .byte 255 ; 0xff
		; CHECK-NEXT: .byte 255 ; 0xff
		; CHECK-NEXT: .byte 255 ; 0xff
		; CHECK-NEXT: .byte 255 ; 0xff
		; CHECK-NEXT: .byte 255 ; 0xff
		; CHECK-NEXT: .byte 255 ; 0xff
		; CHECK-NEXT: .byte 255 ; 0xff

		; CHECK-BE-LABEL: .LCPI8_0:
		; CHECK-BE-NEXT: .byte 1 // 0x1
		; CHECK-BE-NEXT: .byte 3 // 0x3
		; CHECK-BE-NEXT: .byte 5 // 0x5
		; CHECK-BE-NEXT: .byte 7 // 0x7
		; CHECK-BE-NEXT: .byte 9 // 0x9
		; CHECK-BE-NEXT: .byte 11 // 0xb
		; CHECK-BE-NEXT: .byte 13 // 0xd
		; CHECK-BE-NEXT: .byte 15 // 0xf
		; CHECK-BE-NEXT: .byte 255 // 0xff
		; CHECK-BE-NEXT: .byte 255 // 0xff
		; CHECK-BE-NEXT: .byte 255 // 0xff
		; CHECK-BE-NEXT: .byte 255 // 0xff
		; CHECK-BE-NEXT: .byte 255 // 0xff
		; CHECK-BE-NEXT: .byte 255 // 0xff
		; CHECK-BE-NEXT: .byte 255 // 0xff
		; CHECK-BE-NEXT: .byte 255 // 0xff

		define void @trunc_v8i16_to_v8i8_in_loop(ptr %A, ptr %dst) {
		; CHECK-LABEL: trunc_v8i16_to_v8i8_in_loop:
		; CHECK: ; %bb.0: ; %entry
		; CHECK-NEXT: Lloh10:
		; CHECK-NEXT: adrp x9, lCPI8_0@PAGE
		; CHECK-NEXT: mov x8, xzr
		; CHECK-NEXT: Lloh11:
		; CHECK-NEXT: ldr q0, [x9, lCPI8_0@PAGEOFF]
		; CHECK-NEXT: LBB8_1: ; %loop
		; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
		; CHECK-NEXT: ldr q1, [x0, x8, lsl #4]
		; CHECK-NEXT: tbl.16b v1, { v1 }, v0
		; CHECK-NEXT: str d1, [x1, x8, lsl #3]
		; CHECK-NEXT: add x8, x8, #1
		; CHECK-NEXT: cmp x8, #1000
		; CHECK-NEXT: b.eq LBB8_1
		; CHECK-NEXT: ; %bb.2: ; %exit
		; CHECK-NEXT: ret

		; CHECK-BE-LABEL: trunc_v8i16_to_v8i8_in_loop:
		; CHECK-BE: // %bb.0: // %entry
		; CHECK-BE-NEXT: adrp x8, .LCPI8_0
		; CHECK-BE-NEXT: add x8, x8, :lo12:.LCPI8_0
		; CHECK-BE-NEXT: ld1 { v0.16b }, [x8]
		; CHECK-BE-NEXT: mov x8, xzr
		; CHECK-BE-NEXT: .LBB8_1: // %loop
		; CHECK-BE-NEXT: // =>This Inner Loop Header: Depth=1
		; CHECK-BE-NEXT: add x9, x0, x8, lsl #4
		; CHECK-BE-NEXT: ld1 { v1.16b }, [x9]
		; CHECK-BE-NEXT: add x9, x1, x8, lsl #3
		; CHECK-BE-NEXT: add x8, x8, #1
		; CHECK-BE-NEXT: cmp x8, #1000
		; CHECK-BE-NEXT: tbl v1.16b, { v1.16b }, v0.16b
		; CHECK-BE-NEXT: st1 { v1.8b }, [x9]
		; CHECK-BE-NEXT: b.eq .LBB8_1
		; CHECK-BE-NEXT: // %bb.2: // %exit
		; CHECK-BE-NEXT: ret

		entry:
		br label %loop

		loop:
		%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
		%gep.A = getelementptr inbounds <8 x i16>, ptr %A, i64 %iv
		%l.A = load <8 x i16>, ptr %gep.A
		%trunc = trunc <8 x i16> %l.A to <8 x i8>
		%gep.dst = getelementptr inbounds <8 x i8>, ptr %dst, i64 %iv
		store <8 x i8> %trunc, ptr %gep.dst
		%iv.next = add i64 %iv, 1
		%ec = icmp eq i64 %iv.next, 1000
		br i1 %ec, label %loop, label %exit

		exit:
		ret void
		}

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Extending lowering of 'trunc <(8|16) x i64> %x to <(8|16) x i8>' to use tbl instructionsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 477022

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll

llvm/test/CodeGen/AArch64/trunc-to-tbl.ll

[AArch64] Extending lowering of 'trunc <(8|16) x i64> %x to <(8|16) x i8>' to use tbl instructions
ClosedPublic