This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
14/15
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
trunc-to-tbl.ll

Differential D135229

[AArch64] Extending lowering of 'trunc <(8|16) x i64> %x to <(8|16) x i8>' to use tbl instructions
ClosedPublic

Authored by nilanjana_basu on Oct 4 2022, 5:12 PM.

Download Raw Diff

Details

Reviewers

fhahn
t.p.northover
paquette

Commits

rG02d09ffc1b09: [AArch64] Extending lowering of 'trunc <(8|16) x i64> %x to <(8|16) x i8>' to…

Summary

[AArch64] Patch for lowering trunc instructions to 'tbl' for (8|16)xi32 -> (8|16)xi8 conversions in D133495 is extended to support trunc to tbl lowering for (8|16) x i64 to (8|16) x i8.

A microbenchmark for runtime has been added for all these cases in D136274

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

nilanjana_basu created this revision.Oct 4 2022, 5:12 PM

Herald added a project: Restricted Project. · View Herald TranscriptOct 4 2022, 5:12 PM

Herald added a subscriber: hiraditya. · View Herald Transcript

nilanjana_basu requested review of this revision.Oct 4 2022, 5:12 PM

Herald added a project: Restricted Project. · View Herald TranscriptOct 4 2022, 5:12 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

nilanjana_basu set the repository for this revision to rG LLVM Github Monorepo.Oct 4 2022, 5:47 PM

nilanjana_basu retitled this revision from Extending trunc lowering of 'trunc <8 x i64> %x to <8 x i8>' to use tbl.4 instruction to [AArch64] Extending trunc lowering of 'trunc <8 x i64> %x to <8 x i8>' to use tbl.4 instruction.Oct 4 2022, 6:01 PM

Herald added a subscriber: kristof.beyls. · View Herald TranscriptOct 4 2022, 6:01 PM

Ran git-clang-format & made minor change to reduce LoC.

nilanjana_basu added reviewers: fhahn, t.p.northover.Oct 4 2022, 6:05 PM

Removed an unused variable warning

Harbormaster completed remote builds in B190384: Diff 465246.Oct 4 2022, 7:56 PM

Extended the trunc lowering for other types like 16xi64, 16xi16, 8xi16

Harbormaster completed remote builds in B193763: Diff 469926.Oct 22 2022, 1:19 PM

The automated build tests failed for the previous patch because it was based on a previous commit for a unit test that isn't submitted yet. This patch fixes it by squashing the previous commit, removing the dependency & showing the final update.

Harbormaster completed remote builds in B193990: Diff 470224.Oct 24 2022, 1:00 PM

Ran clang-format since it was failing in the build report at https://buildkite.com/llvm-project/diff-checks/builds/133184

Harbormaster completed remote builds in B194302: Diff 470661.Oct 25 2022, 6:32 PM

nilanjana_basu added a reviewer: paquette.Oct 28 2022, 4:41 PM

nilanjana_basu edited the summary of this revision. (Show Details)Oct 28 2022, 4:57 PM

nilanjana_basu mentioned this in D137221: [MicroBenchmarks] Add benchmarks to check runtime of truncate or zero-extend vector operations in AArch64.Nov 1 2022, 6:31 PM

t.p.northover added inline comments.Nov 2 2022, 8:08 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

13924–13925

I think these are guaranteed to succeed by checks in the caller (and essential here), so cast<...> is probably better. Applies to some of the later dyn_casts too.

14004–14007

There's a lot of duplication in this switch, but it is pretty easy to eyeball for correctness because of that once you get what it's trying to do. So I'm torn, a loop like this would probably be shorter overall:

int ShuffleCount = 128/SrcElemSize;
SmallVector<int> ShuffleLanes;
for (int i = 0; i < ShuffleCount; ++i)
  ShuffleLanes.push_back(i);

SmallVector<Value *> Results;
while (ShuffleLanes.back() < NumElements) {
  Parts.push_back(Builder.CreateBitCast(Builder.CreateShuffleVector(TI->getOperand(0), ShuffleLanes), VecTy));
  for (int i = 0; i < ShuffleCount; ++i)
    ShuffleLanes[i] += ShuffleCount;
  if (Parts.size() == 4) {
    // Call tbl4, push result into Results, clear Parts.
  }
}

// Choose correct tbl (3 now a valid option) and call for rest of Parts, push to Results

// Shuffle-merge all of Results.

and allow the code to apply to a wider range of truncates. What are your views on the implementation?

nilanjana_basu mentioned this in rT3b44b6bdd3e8: [MicroBenchmarks] Add benchmarks to check runtime of truncate or zero-extend….Nov 2 2022, 2:05 PM

nilanjana_basu added a parent revision: D137293: [AArch64] Extra unit tests for trunc lowering of vectors.Nov 2 2022, 3:01 PM

Addressed comments by t.p.northover - refactored code to remove redundancy

nilanjana_basu marked an inline comment as done.Nov 3 2022, 7:39 PM

nilanjana_basu added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
14004–14007	I refactored the code as you suggested, which can now apply to a few extra cases like 12xi32 or 4xi32. However, I haven't modified the old set of allowable cases since I don't know how relevant these few are. In my understanding, we get better performance when tbl2-tbl4 get triggered, as the number of generated instructions decrease. So, I need your opinion on whether we should allow 8xi16 conversions, since they generate a single tbl1 instruction?

Ran clang-format

Harbormaster completed remote builds in B196052: Diff 473110.Nov 3 2022, 8:44 PM

nilanjana_basu removed a parent revision: D137293: [AArch64] Extra unit tests for trunc lowering of vectors.Nov 4 2022, 5:22 PM

nilanjana_basu mentioned this in D138059: [MicroBenchmarks,AArch64] Added correctness test & other performance tests for truncate or zero-extend vector operations.Nov 15 2022, 1:13 PM

Added comments

Harbormaster completed remote builds in B198876: Diff 477022.Nov 22 2022, 12:18 AM

fhahn added inline comments.Nov 22 2022, 3:57 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13924	Is this guaranteed to be a fixed vector type? Could you add a variant of a test with truncates of scalable vectors (`<vscale x 16 x i8>` or something like that?
13942–13943	It would be great if you could add a brief comment here explaining what kind of masks/shuffles are prepared here.
13967	store here seems ambiguous here, as we won't emit a store instruction, right?
14036	SmallVector?
14041	SmallVector?
llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll
676 ↗	(On Diff #477022)	Similar to D136722, it is likely not profitable to do this when converting to/from the next power-of-2.

fhahn added inline comments.Nov 22 2022, 10:44 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13924	I think it should be fine, I added a test in 4783345426da

Updated comments as mentioned in the reviews. Rebased on tests for this change prior to applying this patch.

Harbormaster completed remote builds in B199088: Diff 477347.Nov 22 2022, 5:00 PM

nilanjana_basu added a parent revision: D137293: [AArch64] Extra unit tests for trunc lowering of vectors.Nov 22 2022, 5:02 PM

Rebasing on commit of test cases prior to application of this patch

Removed case for 'trunc <(8|16)xi16> %x to <(8|16)xi8>' since it was adding more instructions to loop header, while not improving loop instruction count

Updated a comment

nilanjana_basu retitled this revision from [AArch64] Extending lowering of 'trunc <(8|16) x (i16|i64)> %x to <(8|16) x i8>' to use tbl instructions to [AArch64] Extending lowering of 'trunc <(8|16) x i64> %x to <(8|16) x i8>' to use tbl instructions.Nov 23 2022, 2:38 AM

nilanjana_basu edited the summary of this revision. (Show Details)

Removed (8|16)xi16 to (8|16)xi8 conversion because it wasn't showing benefits in instruction count, & additionally adding more instructions to the header. Updated comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13924	Since the source & destination types were checked for FixedVector once in the calling function optimizeExtendOrTruncateConversion(), I didn't check it here again.
13924	Since both the zext & trunc test for scalable vector goes through the optimizeExtendOrTruncateConversion() function, the zext test in 4783345426da should suffice for the trunc too. Let me know if you think it needs to be replicated for trunc vector too.
13967	I replaced the "store" with "save" to indicate that it is being stored in the compiler's internal vector data structure. Added a comment at the place of combining these results.
llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll
676 ↗	(On Diff #477022)	Since we only support a handful of vector type truncates in this implementation, only Yxi16->Yxi8 was a supported next power-of-2 conversion. Removed it.

Harbormaster completed remote builds in B199151: Diff 477432.Nov 23 2022, 3:32 AM

Rebasing on parent patch for tests

Harbormaster completed remote builds in B199602: Diff 478030.Nov 25 2022, 3:08 PM

nilanjana_basu edited the summary of this revision. (Show Details)Nov 25 2022, 3:54 PM

Trying to fix rebasing error

Harbormaster completed remote builds in B199605: Diff 478033.Nov 25 2022, 4:35 PM

nilanjana_basu removed a parent revision: D137293: [AArch64] Extra unit tests for trunc lowering of vectors.Nov 28 2022, 4:17 AM

nilanjana_basu mentioned this in rT08de51078b0a: [MicroBenchmarks,AArch64] Added correctness test & other performance tests for….Dec 1 2022, 10:09 PM

LGTM with the inline suggestions. Please wait a day or so with committing in case there are additional comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13933	could you add an assert to make sure the division happens without remainder?
13938	IIUC the only case that can happen here is that `Parts == 4`, right? Might be good to update the check.
13945	Could use `Builder.getInt8(....)`?

This revision is now accepted and ready to land.Dec 14 2022, 1:47 PM

Closed by commit rG02d09ffc1b09: [AArch64] Extending lowering of 'trunc <(8|16) x i64> %x to <(8|16) x i8>' to… (authored by nilanjana_basu). · Explain WhyDec 15 2022, 7:21 AM

This revision was automatically updated to reflect the committed changes.

nilanjana_basu added a commit: rG02d09ffc1b09: [AArch64] Extending lowering of 'trunc <(8|16) x i64> %x to <(8|16) x i8>' to….

nilanjana_basu mentioned this in rG795868285db9: [AArch64] Minor changes and sanity checks in relation to https://reviews.llvm..Dec 15 2022, 12:09 PM

Addressed the final comments in a separate commit.

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

146 lines

test/

CodeGen/

AArch64/

trunc-to-tbl.ll

379 lines

Diff 483181

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 13,914 Lines • ▼ Show 20 Lines	static void createTblShuffleForZExt(ZExtInst *ZExt, bool IsLittleEndian) {
Result = Builder.CreateBitCast(Result, DstTy);		Result = Builder.CreateBitCast(Result, DstTy);
ZExt->replaceAllUsesWith(Result);		ZExt->replaceAllUsesWith(Result);
ZExt->eraseFromParent();		ZExt->eraseFromParent();
}		}

static void createTblForTrunc(TruncInst *TI, bool IsLittleEndian) {		static void createTblForTrunc(TruncInst *TI, bool IsLittleEndian) {
IRBuilder<> Builder(TI);		IRBuilder<> Builder(TI);
SmallVector<Value *> Parts;		SmallVector<Value *> Parts;
		int NumElements = cast<FixedVectorType>(TI->getType())->getNumElements();
		auto *SrcTy = cast<FixedVectorType>(TI->getOperand(0)->getType());
		fhahnUnsubmitted Done Reply Inline Actions Is this guaranteed to be a fixed vector type? Could you add a variant of a test with truncates of scalable vectors (`<vscale x 16 x i8>` or something like that? fhahn: Is this guaranteed to be a fixed vector type? Could you add a variant of a test with truncates…
		fhahnUnsubmitted Done Reply Inline Actions I think it should be fine, I added a test in 4783345426da fhahn: I think it should be fine, I added a test in 4783345426da
		nilanjana_basuAuthorUnsubmitted Done Reply Inline Actions Since both the zext & trunc test for scalable vector goes through the optimizeExtendOrTruncateConversion() function, the zext test in 4783345426da should suffice for the trunc too. Let me know if you think it needs to be replicated for trunc vector too. nilanjana_basu: Since both the zext & trunc test for scalable vector goes through the…
		nilanjana_basuAuthorUnsubmitted Done Reply Inline Actions Since the source & destination types were checked for FixedVector once in the calling function optimizeExtendOrTruncateConversion(), I didn't check it here again. nilanjana_basu: Since the source & destination types were checked for FixedVector once in the calling function…
		auto *DstTy = cast<FixedVectorType>(TI->getType());
		t.p.northoverUnsubmitted Done Reply Inline Actions I think these are guaranteed to succeed by checks in the caller (and essential here), so `cast<...>` is probably better. Applies to some of the later `dyn_cast`s too. t.p.northover: I think these are guaranteed to succeed by checks in the caller (and essential here), so `cast<.
		assert(SrcTy->getElementType()->isIntegerTy() &&
		"Non-integer type source vector element is not supported");
		assert(DstTy->getElementType()->isIntegerTy(8) &&
		"Unsupported destination vector element type");
		unsigned SrcElemTySz =
		cast<IntegerType>(SrcTy->getElementType())->getBitWidth();
		unsigned TruncFactor =
		SrcElemTySz / cast<IntegerType>(DstTy->getElementType())->getBitWidth();
		fhahnUnsubmitted Done Reply Inline Actions could you add an assert to make sure the division happens without remainder? fhahn: could you add an assert to make sure the division happens without remainder?
		assert((SrcElemTySz == 16 \|\| SrcElemTySz == 32 \|\| SrcElemTySz == 64) &&
		"Unsupported source vector element type size");
Type *VecTy = FixedVectorType::get(Builder.getInt8Ty(), 16);		Type *VecTy = FixedVectorType::get(Builder.getInt8Ty(), 16);
Parts.push_back(Builder.CreateBitCast(
Builder.CreateShuffleVector(TI->getOperand(0), {0, 1, 2, 3}), VecTy));
Parts.push_back(Builder.CreateBitCast(
Builder.CreateShuffleVector(TI->getOperand(0), {4, 5, 6, 7}), VecTy));

Intrinsic::ID TblID = Intrinsic::aarch64_neon_tbl2;		// Create a mask to choose every nth byte from the source vector table of
		fhahnUnsubmitted Done Reply Inline Actions IIUC the only case that can happen here is that `Parts == 4`, right? Might be good to update the check. fhahn: IIUC the only case that can happen here is that `Parts == 4`, right? Might be good to update…
unsigned NumElements = cast<FixedVectorType>(TI->getType())->getNumElements();		// bytes to create the truncated destination vector, where 'n' is the truncate
if (NumElements == 16) {		// ratio. For example, for a truncate from Yxi64 to Yxi8, choose
Parts.push_back(Builder.CreateBitCast(		// 0,8,16,..Y*8th bytes for the little-endian format
Builder.CreateShuffleVector(TI->getOperand(0), {8, 9, 10, 11}), VecTy));		SmallVector<Constant *, 16> MaskConst;
		for (int Itr = 0; Itr < 16; Itr++) {
		fhahnUnsubmitted Done Reply Inline Actions It would be great if you could add a brief comment here explaining what kind of masks/shuffles are prepared here. fhahn: It would be great if you could add a brief comment here explaining what kind of masks/shuffles…
		if (Itr < NumElements)
		MaskConst.push_back(ConstantInt::get(
		fhahnUnsubmitted Done Reply Inline Actions Could use `Builder.getInt8(....)`? fhahn: Could use `Builder.getInt8(....)`?
		Builder.getInt8Ty(), IsLittleEndian
		? Itr * TruncFactor
		: Itr * TruncFactor + (TruncFactor - 1)));
		else
		MaskConst.push_back(ConstantInt::get(Builder.getInt8Ty(), 255));
		}

		int MaxTblSz = 128 * 4;
		int MaxSrcSz = SrcElemTySz * NumElements;
		int ElemsPerTbl =
		(MaxTblSz > MaxSrcSz) ? NumElements : (MaxTblSz / SrcElemTySz);
		assert(ElemsPerTbl <= 16 &&
		"Maximum elements selected using TBL instruction cannot exceed 16!");

		int ShuffleCount = 128 / SrcElemTySz;
		SmallVector<int> ShuffleLanes;
		for (int i = 0; i < ShuffleCount; ++i)
		ShuffleLanes.push_back(i);

		// Create TBL's table of bytes in 1,2,3 or 4 FP/SIMD registers using shuffles
		// over the source vector. If TBL's maximum 4 FP/SIMD registers are saturated,
		// call TBL & save the result in a vector of TBL results for combining later.
		fhahnUnsubmitted Done Reply Inline Actions store here seems ambiguous here, as we won't emit a store instruction, right? fhahn: store here seems ambiguous here, as we won't emit a store instruction, right?
		nilanjana_basuAuthorUnsubmitted Done Reply Inline Actions I replaced the "store" with "save" to indicate that it is being stored in the compiler's internal vector data structure. Added a comment at the place of combining these results. nilanjana_basu: I replaced the "store" with "save" to indicate that it is being stored in the compiler's…
		SmallVector<Value *> Results;
		while (ShuffleLanes.back() < NumElements) {
Parts.push_back(Builder.CreateBitCast(		Parts.push_back(Builder.CreateBitCast(
Builder.CreateShuffleVector(TI->getOperand(0), {12, 13, 14, 15}),		Builder.CreateShuffleVector(TI->getOperand(0), ShuffleLanes), VecTy));
VecTy));
TblID = Intrinsic::aarch64_neon_tbl4;		if (Parts.size() >= 4) {
		auto *F = Intrinsic::getDeclaration(TI->getModule(),
		Intrinsic::aarch64_neon_tbl4, VecTy);
		Parts.push_back(ConstantVector::get(MaskConst));
		Results.push_back(Builder.CreateCall(F, Parts));
		Parts.clear();
}		}
SmallVector<Constant *, 16> MaskConst;
for (unsigned Idx = 0; Idx < NumElements * 4; Idx += 4)
MaskConst.push_back(
ConstantInt::get(Builder.getInt8Ty(), IsLittleEndian ? Idx : Idx + 3));

for (unsigned Idx = NumElements * 4; Idx < 64; Idx += 4)		for (int i = 0; i < ShuffleCount; ++i)
MaskConst.push_back(ConstantInt::get(Builder.getInt8Ty(), 255));		ShuffleLanes[i] += ShuffleCount;
		}

		assert((Parts.empty() \|\| Results.empty()) &&
		"Lowering trunc for vectors requiring different TBL instructions is "
		"not supported!");
		// Call TBL for the residual table bytes present in 1,2, or 3 FP/SIMD
		// registers
		if (!Parts.empty()) {
		Intrinsic::ID TblID;
		switch (Parts.size()) {
		case 1:
		TblID = Intrinsic::aarch64_neon_tbl1;
		break;
		case 2:
		TblID = Intrinsic::aarch64_neon_tbl2;
		break;
		case 3:
		TblID = Intrinsic::aarch64_neon_tbl3;
		break;
		}

		auto *F = Intrinsic::getDeclaration(TI->getModule(), TblID, VecTy);
Parts.push_back(ConstantVector::get(MaskConst));		Parts.push_back(ConstantVector::get(MaskConst));
auto *F =		Results.push_back(Builder.CreateCall(F, Parts));
Intrinsic::getDeclaration(TI->getModule(), TblID, Parts[0]->getType());		}
		t.p.northoverUnsubmitted Done Reply Inline Actions There's a lot of duplication in this switch, but it is pretty easy to eyeball for correctness because of that once you get what it's trying to do. So I'm torn, a loop like this would probably be shorter overall: int ShuffleCount = 128/SrcElemSize; SmallVector<int> ShuffleLanes; for (int i = 0; i < ShuffleCount; ++i) ShuffleLanes.push_back(i); SmallVector<Value > Results; while (ShuffleLanes.back() < NumElements) { Parts.push_back(Builder.CreateBitCast(Builder.CreateShuffleVector(TI->getOperand(0), ShuffleLanes), VecTy)); for (int i = 0; i < ShuffleCount; ++i) ShuffleLanes[i] += ShuffleCount; if (Parts.size() == 4) { // Call tbl4, push result into Results, clear Parts. } } // Choose correct tbl (3 now a valid option) and call for rest of Parts, push to Results // Shuffle-merge all of Results. and allow the code to apply to a wider range of truncates. What are your views on the implementation? t.p.northover:* There's a lot of duplication in this switch, but it is pretty easy to eyeball for correctness…
		nilanjana_basuAuthorUnsubmitted Not Done Reply Inline Actions I refactored the code as you suggested, which can now apply to a few extra cases like 12xi32 or 4xi32. However, I haven't modified the old set of allowable cases since I don't know how relevant these few are. In my understanding, we get better performance when tbl2-tbl4 get triggered, as the number of generated instructions decrease. So, I need your opinion on whether we should allow 8xi16 conversions, since they generate a single tbl1 instruction? nilanjana_basu: I refactored the code as you suggested, which can now apply to a few extra cases like 12xi32 or…
Value *Res = Builder.CreateCall(F, Parts);
		// Extract the destination vector from TBL result(s) after combining them
if (NumElements == 8)		// where applicable. Currently, at most two TBLs are supported.
Res = Builder.CreateShuffleVector(Res, {0, 1, 2, 3, 4, 5, 6, 7});		assert(Results.size() <= 2 && "Trunc lowering does not support generation of "
TI->replaceAllUsesWith(Res);		"more than 2 tbl instructions!");
		Value *FinalResult = Results[0];
		if (Results.size() == 1) {
		if (ElemsPerTbl < 16) {
		SmallVector<int> FinalMask(ElemsPerTbl);
		std::iota(FinalMask.begin(), FinalMask.end(), 0);
		FinalResult = Builder.CreateShuffleVector(Results[0], FinalMask);
		}
		} else {
		SmallVector<int> FinalMask(ElemsPerTbl * Results.size());
		if (ElemsPerTbl < 16) {
		std::iota(FinalMask.begin(), FinalMask.begin() + ElemsPerTbl, 0);
		std::iota(FinalMask.begin() + ElemsPerTbl, FinalMask.end(), 16);
		} else {
		std::iota(FinalMask.begin(), FinalMask.end(), 0);
		}
		FinalResult =
		Builder.CreateShuffleVector(Results[0], Results[1], FinalMask);
		}

		TI->replaceAllUsesWith(FinalResult);
TI->eraseFromParent();		TI->eraseFromParent();
}		}

bool AArch64TargetLowering::optimizeExtendOrTruncateConversion(Instruction *I,		bool AArch64TargetLowering::optimizeExtendOrTruncateConversion(Instruction *I,
		fhahnUnsubmitted Done Reply Inline Actions SmallVector? fhahn: SmallVector?
Loop *L) const {		Loop *L) const {
// Try to optimize conversions using tbl. This requires materializing constant		// Try to optimize conversions using tbl. This requires materializing constant
// index vectors, which can increase code size and add loads. Skip the		// index vectors, which can increase code size and add loads. Skip the
// transform unless the conversion is in a loop block guaranteed to execute		// transform unless the conversion is in a loop block guaranteed to execute
// and we are not optimizing for size.		// and we are not optimizing for size.
		fhahnUnsubmitted Done Reply Inline Actions SmallVector? fhahn: SmallVector?
Function *F = I->getParent()->getParent();		Function *F = I->getParent()->getParent();
if (!L \|\| L->getHeader() != I->getParent() \|\| F->hasMinSize() \|\|		if (!L \|\| L->getHeader() != I->getParent() \|\| F->hasMinSize() \|\|
F->hasOptSize())		F->hasOptSize())
return false;		return false;

auto *SrcTy = dyn_cast<FixedVectorType>(I->getOperand(0)->getType());		auto *SrcTy = dyn_cast<FixedVectorType>(I->getOperand(0)->getType());
auto *DstTy = dyn_cast<FixedVectorType>(I->getType());		auto *DstTy = dyn_cast<FixedVectorType>(I->getType());
if (!SrcTy \|\| !DstTy)		if (!SrcTy \|\| !DstTy)
Show All 36 Lines	auto *WideConv = Builder.CreateFPToUI(FPToUI->getOperand(0),
VectorType::getInteger(SrcTy));		VectorType::getInteger(SrcTy));
auto *TruncI = Builder.CreateTrunc(WideConv, DstTy);		auto *TruncI = Builder.CreateTrunc(WideConv, DstTy);
I->replaceAllUsesWith(TruncI);		I->replaceAllUsesWith(TruncI);
I->eraseFromParent();		I->eraseFromParent();
createTblForTrunc(cast<TruncInst>(TruncI), Subtarget->isLittleEndian());		createTblForTrunc(cast<TruncInst>(TruncI), Subtarget->isLittleEndian());
return true;		return true;
}		}

// Convert 'trunc <(8\|16) x i32> %x to <(8\|16) x i8>' to a single tbl.4		// Convert 'trunc <(8\|16) x (i32\|i64)> %x to <(8\|16) x i8>' to an appropriate
// instruction selecting the lowest 8 bits per lane of the input interpreted		// tbl instruction selecting the lowest/highest (little/big endian) 8 bits
// as 2 or 4 <4 x i32> vectors.		// per lane of the input that is represented using 1,2,3 or 4 128-bit table
		// registers
auto *TI = dyn_cast<TruncInst>(I);		auto *TI = dyn_cast<TruncInst>(I);
if (TI && (SrcTy->getNumElements() == 8 \|\| SrcTy->getNumElements() == 16) &&		if (TI && DstTy->getElementType()->isIntegerTy(8) &&
SrcTy->getElementType()->isIntegerTy(32) &&		((SrcTy->getElementType()->isIntegerTy(32) \|\|
DstTy->getElementType()->isIntegerTy(8)) {		SrcTy->getElementType()->isIntegerTy(64)) &&
		(SrcTy->getNumElements() == 16 \|\| SrcTy->getNumElements() == 8))) {
createTblForTrunc(TI, Subtarget->isLittleEndian());		createTblForTrunc(TI, Subtarget->isLittleEndian());
return true;		return true;
}		}

return false;		return false;
}		}

bool AArch64TargetLowering::hasPairedLoad(EVT LoadedType,		bool AArch64TargetLowering::hasPairedLoad(EVT LoadedType,
▲ Show 20 Lines • Show All 9,566 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/trunc-to-tbl.ll

Show First 20 Lines • Show All 229 Lines • ▼ Show 20 Lines	loop:
%iv.next = add i64 %iv, 1		%iv.next = add i64 %iv, 1
%ec = icmp eq i64 %iv.next, 1000		%ec = icmp eq i64 %iv.next, 1000
br i1 %ec, label %loop, label %exit		br i1 %ec, label %loop, label %exit

exit:		exit:
ret void		ret void
}		}

		; CHECK-LABEL: lCPI3_0:
		; CHECK-NEXT: .byte 0 ; 0x0
		; CHECK-NEXT: .byte 8 ; 0x8
		; CHECK-NEXT: .byte 16 ; 0x10
		; CHECK-NEXT: .byte 24 ; 0x18
		; CHECK-NEXT: .byte 32 ; 0x20
		; CHECK-NEXT: .byte 40 ; 0x28
		; CHECK-NEXT: .byte 48 ; 0x30
		; CHECK-NEXT: .byte 56 ; 0x38
		; CHECK-NEXT: .byte 64 ; 0x40
		; CHECK-NEXT: .byte 72 ; 0x48
		; CHECK-NEXT: .byte 80 ; 0x50
		; CHECK-NEXT: .byte 88 ; 0x58
		; CHECK-NEXT: .byte 96 ; 0x60
		; CHECK-NEXT: .byte 104 ; 0x68
		; CHECK-NEXT: .byte 112 ; 0x70
		; CHECK-NEXT: .byte 120 ; 0x78

		; CHECK-BE-LABEL: .LCPI3_0:
		; CHECK-BE-NEXT: .byte 7 // 0x7
		; CHECK-BE-NEXT: .byte 15 // 0xf
		; CHECK-BE-NEXT: .byte 23 // 0x17
		; CHECK-BE-NEXT: .byte 31 // 0x1f
		; CHECK-BE-NEXT: .byte 39 // 0x27
		; CHECK-BE-NEXT: .byte 47 // 0x2f
		; CHECK-BE-NEXT: .byte 55 // 0x37
		; CHECK-BE-NEXT: .byte 63 // 0x3f
		; CHECK-BE-NEXT: .byte 71 // 0x47
		; CHECK-BE-NEXT: .byte 79 // 0x4f
		; CHECK-BE-NEXT: .byte 87 // 0x57
		; CHECK-BE-NEXT: .byte 95 // 0x5f
		; CHECK-BE-NEXT: .byte 103 // 0x67
		; CHECK-BE-NEXT: .byte 111 // 0x6f
		; CHECK-BE-NEXT: .byte 119 // 0x77
		; CHECK-BE-NEXT: .byte 127 // 0x7f
define void @trunc_v16i64_to_v16i8_in_loop(ptr %A, ptr %dst) {		define void @trunc_v16i64_to_v16i8_in_loop(ptr %A, ptr %dst) {
; CHECK-LABEL: trunc_v16i64_to_v16i8_in_loop:		; CHECK-LABEL: trunc_v16i64_to_v16i8_in_loop:
; CHECK: ; %bb.0: ; %entry		; CHECK: ; %bb.0: ; %entry
		; CHECK-NEXT: Lloh4:
		; CHECK-NEXT: adrp x9, lCPI3_0@PAGE
; CHECK-NEXT: mov x8, xzr		; CHECK-NEXT: mov x8, xzr
		; CHECK-NEXT: Lloh5:
		; CHECK-NEXT: ldr q0, [x9, lCPI3_0@PAGEOFF]
; CHECK-NEXT: LBB3_1: ; %loop		; CHECK-NEXT: LBB3_1: ; %loop
; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1		; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
; CHECK-NEXT: add x9, x0, x8, lsl #7		; CHECK-NEXT: add x9, x0, x8, lsl #7
; CHECK-NEXT: ldp q3, q2, [x9, #96]		; CHECK-NEXT: ldp q1, q2, [x9]
; CHECK-NEXT: ldp q1, q0, [x9, #32]		; CHECK-NEXT: ldp q3, q4, [x9, #32]
; CHECK-NEXT: uzp1.4s v2, v3, v2		; CHECK-NEXT: ldp q16, q17, [x9, #64]
; CHECK-NEXT: ldp q5, q4, [x9, #64]		; CHECK-NEXT: tbl.16b v1, { v1, v2, v3, v4 }, v0
; CHECK-NEXT: uzp1.4s v0, v1, v0		; CHECK-NEXT: ldp q18, q19, [x9, #96]
; CHECK-NEXT: ldp q3, q6, [x9]		; CHECK-NEXT: tbl.16b v2, { v16, v17, v18, v19 }, v0
; CHECK-NEXT: uzp1.4s v4, v5, v4		; CHECK-NEXT: mov.d v1[1], v2[0]
; CHECK-NEXT: uzp1.8h v2, v4, v2		; CHECK-NEXT: str q1, [x1, x8, lsl #4]
; CHECK-NEXT: uzp1.4s v1, v3, v6
; CHECK-NEXT: uzp1.8h v0, v1, v0
; CHECK-NEXT: uzp1.16b v0, v0, v2
; CHECK-NEXT: str q0, [x1, x8, lsl #4]
; CHECK-NEXT: add x8, x8, #1		; CHECK-NEXT: add x8, x8, #1
; CHECK-NEXT: cmp x8, #1000		; CHECK-NEXT: cmp x8, #1000
; CHECK-NEXT: b.eq LBB3_1		; CHECK-NEXT: b.eq LBB3_1
; CHECK-NEXT: ; %bb.2: ; %exit		; CHECK-NEXT: ; %bb.2: ; %exit
; CHECK-NEXT: ret		; CHECK-NEXT: ret
;
; CHECK-BE-LABEL: trunc_v16i64_to_v16i8_in_loop:		; CHECK-BE-LABEL: trunc_v16i64_to_v16i8_in_loop:
; CHECK-BE: // %bb.0: // %entry		; CHECK-BE: // %bb.0: // %entry
		; CHECK-BE-NEXT: adrp x8, .LCPI3_0
		; CHECK-BE-NEXT: add x8, x8, :lo12:.LCPI3_0
		; CHECK-BE-NEXT: ld1 { v0.16b }, [x8]
; CHECK-BE-NEXT: mov x8, xzr		; CHECK-BE-NEXT: mov x8, xzr
; CHECK-BE-NEXT: .LBB3_1: // %loop		; CHECK-BE-NEXT: .LBB3_1: // %loop
; CHECK-BE-NEXT: // =>This Inner Loop Header: Depth=1		; CHECK-BE-NEXT: // =>This Inner Loop Header: Depth=1
; CHECK-BE-NEXT: add x9, x0, x8, lsl #7		; CHECK-BE-NEXT: add x9, x0, x8, lsl #7
; CHECK-BE-NEXT: add x10, x9, #48		; CHECK-BE-NEXT: add x10, x9, #16
; CHECK-BE-NEXT: add x11, x9, #32		; CHECK-BE-NEXT: add x11, x9, #32
; CHECK-BE-NEXT: ld1 { v5.2d }, [x9]		; CHECK-BE-NEXT: ld1 { v1.16b }, [x9]
; CHECK-BE-NEXT: ld1 { v0.2d }, [x10]		; CHECK-BE-NEXT: ld1 { v2.16b }, [x10]
		; CHECK-BE-NEXT: add x10, x9, #48
		; CHECK-BE-NEXT: ld1 { v3.16b }, [x11]
		; CHECK-BE-NEXT: add x11, x9, #64
		; CHECK-BE-NEXT: ld1 { v4.16b }, [x10]
; CHECK-BE-NEXT: add x10, x9, #80		; CHECK-BE-NEXT: add x10, x9, #80
; CHECK-BE-NEXT: ld1 { v1.2d }, [x11]		; CHECK-BE-NEXT: ld1 { v16.16b }, [x11]
; CHECK-BE-NEXT: add x11, x9, #112		; CHECK-BE-NEXT: add x11, x9, #96
; CHECK-BE-NEXT: ld1 { v2.2d }, [x10]		; CHECK-BE-NEXT: add x9, x9, #112
; CHECK-BE-NEXT: add x10, x9, #96		; CHECK-BE-NEXT: ld1 { v17.16b }, [x10]
; CHECK-BE-NEXT: ld1 { v3.2d }, [x11]		; CHECK-BE-NEXT: tbl v1.16b, { v1.16b, v2.16b, v3.16b, v4.16b }, v0.16b
; CHECK-BE-NEXT: uzp1 v0.4s, v1.4s, v0.4s		; CHECK-BE-NEXT: ld1 { v18.16b }, [x11]
; CHECK-BE-NEXT: ld1 { v4.2d }, [x10]		; CHECK-BE-NEXT: ld1 { v19.16b }, [x9]
; CHECK-BE-NEXT: add x10, x9, #64
; CHECK-BE-NEXT: add x9, x9, #16
; CHECK-BE-NEXT: ld1 { v6.2d }, [x10]
; CHECK-BE-NEXT: ld1 { v7.2d }, [x9]
; CHECK-BE-NEXT: add x9, x1, x8, lsl #4		; CHECK-BE-NEXT: add x9, x1, x8, lsl #4
; CHECK-BE-NEXT: uzp1 v3.4s, v4.4s, v3.4s
; CHECK-BE-NEXT: add x8, x8, #1		; CHECK-BE-NEXT: add x8, x8, #1
; CHECK-BE-NEXT: cmp x8, #1000		; CHECK-BE-NEXT: cmp x8, #1000
; CHECK-BE-NEXT: uzp1 v2.4s, v6.4s, v2.4s		; CHECK-BE-NEXT: tbl v2.16b, { v16.16b, v17.16b, v18.16b, v19.16b }, v0.16b
; CHECK-BE-NEXT: uzp1 v1.4s, v5.4s, v7.4s		; CHECK-BE-NEXT: mov v1.d[1], v2.d[0]
; CHECK-BE-NEXT: uzp1 v2.8h, v2.8h, v3.8h		; CHECK-BE-NEXT: st1 { v1.16b }, [x9]
; CHECK-BE-NEXT: uzp1 v0.8h, v1.8h, v0.8h
; CHECK-BE-NEXT: uzp1 v0.16b, v0.16b, v2.16b
; CHECK-BE-NEXT: st1 { v0.16b }, [x9]
; CHECK-BE-NEXT: b.eq .LBB3_1		; CHECK-BE-NEXT: b.eq .LBB3_1
; CHECK-BE-NEXT: // %bb.2: // %exit		; CHECK-BE-NEXT: // %bb.2: // %exit
; CHECK-BE-NEXT: ret		; CHECK-BE-NEXT: ret

entry:		entry:
br label %loop		br label %loop

loop:		loop:
%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]		%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
%gep.A = getelementptr inbounds <16 x i64>, ptr %A, i64 %iv		%gep.A = getelementptr inbounds <16 x i64>, ptr %A, i64 %iv
%l.A = load <16 x i64>, ptr %gep.A		%l.A = load <16 x i64>, ptr %gep.A
%trunc = trunc <16 x i64> %l.A to <16 x i8>		%trunc = trunc <16 x i64> %l.A to <16 x i8>
%gep.dst = getelementptr inbounds <16 x i8>, ptr %dst, i64 %iv		%gep.dst = getelementptr inbounds <16 x i8>, ptr %dst, i64 %iv
store <16 x i8> %trunc, ptr %gep.dst		store <16 x i8> %trunc, ptr %gep.dst
%iv.next = add i64 %iv, 1		%iv.next = add i64 %iv, 1
%ec = icmp eq i64 %iv.next, 1000		%ec = icmp eq i64 %iv.next, 1000
br i1 %ec, label %loop, label %exit		br i1 %ec, label %loop, label %exit

exit:		exit:
ret void		ret void
}		}

		; CHECK-LABEL: lCPI4_0:
		; CHECK-NEXT: .byte 0 ; 0x0
		; CHECK-NEXT: .byte 8 ; 0x8
		; CHECK-NEXT: .byte 16 ; 0x10
		; CHECK-NEXT: .byte 24 ; 0x18
		; CHECK-NEXT: .byte 32 ; 0x20
		; CHECK-NEXT: .byte 40 ; 0x28
		; CHECK-NEXT: .byte 48 ; 0x30
		; CHECK-NEXT: .byte 56 ; 0x38
		; CHECK-NEXT: .byte 255 ; 0xff
		; CHECK-NEXT: .byte 255 ; 0xff
		; CHECK-NEXT: .byte 255 ; 0xff
		; CHECK-NEXT: .byte 255 ; 0xff
		; CHECK-NEXT: .byte 255 ; 0xff
		; CHECK-NEXT: .byte 255 ; 0xff
		; CHECK-NEXT: .byte 255 ; 0xff
		; CHECK-NEXT: .byte 255 ; 0xff

		; CHECK-BE-LABEL: .LCPI4_0:
		; CHECK-BE-NEXT: .byte 7 // 0x7
		; CHECK-BE-NEXT: .byte 15 // 0xf
		; CHECK-BE-NEXT: .byte 23 // 0x17
		; CHECK-BE-NEXT: .byte 31 // 0x1f
		; CHECK-BE-NEXT: .byte 39 // 0x27
		; CHECK-BE-NEXT: .byte 47 // 0x2f
		; CHECK-BE-NEXT: .byte 55 // 0x37
		; CHECK-BE-NEXT: .byte 63 // 0x3f
		; CHECK-BE-NEXT: .byte 255 // 0xff
		; CHECK-BE-NEXT: .byte 255 // 0xff
		; CHECK-BE-NEXT: .byte 255 // 0xff
		; CHECK-BE-NEXT: .byte 255 // 0xff
		; CHECK-BE-NEXT: .byte 255 // 0xff
		; CHECK-BE-NEXT: .byte 255 // 0xff
		; CHECK-BE-NEXT: .byte 255 // 0xff
		; CHECK-BE-NEXT: .byte 255 // 0xff
define void @trunc_v8i64_to_v8i8_in_loop(ptr %A, ptr %dst) {		define void @trunc_v8i64_to_v8i8_in_loop(ptr %A, ptr %dst) {
; CHECK-LABEL: trunc_v8i64_to_v8i8_in_loop:		; CHECK-LABEL: trunc_v8i64_to_v8i8_in_loop:
; CHECK: ; %bb.0: ; %entry		; CHECK: ; %bb.0: ; %entry
		; CHECK-NEXT: Lloh6:
		; CHECK-NEXT: adrp x9, lCPI4_0@PAGE
; CHECK-NEXT: mov x8, xzr		; CHECK-NEXT: mov x8, xzr
		; CHECK-NEXT: Lloh7:
		; CHECK-NEXT: ldr q0, [x9, lCPI4_0@PAGEOFF]
; CHECK-NEXT: LBB4_1: ; %loop		; CHECK-NEXT: LBB4_1: ; %loop
; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1		; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
; CHECK-NEXT: add x9, x0, x8, lsl #6		; CHECK-NEXT: add x9, x0, x8, lsl #6
; CHECK-NEXT: ldp q1, q0, [x9, #32]		; CHECK-NEXT: ldp q1, q2, [x9]
; CHECK-NEXT: ldp q3, q2, [x9]		; CHECK-NEXT: ldp q3, q4, [x9, #32]
; CHECK-NEXT: uzp1.4s v0, v1, v0		; CHECK-NEXT: tbl.16b v1, { v1, v2, v3, v4 }, v0
; CHECK-NEXT: uzp1.4s v1, v3, v2		; CHECK-NEXT: str d1, [x1, x8, lsl #3]
; CHECK-NEXT: uzp1.8h v0, v1, v0
; CHECK-NEXT: xtn.8b v0, v0
; CHECK-NEXT: str d0, [x1, x8, lsl #3]
; CHECK-NEXT: add x8, x8, #1		; CHECK-NEXT: add x8, x8, #1
; CHECK-NEXT: cmp x8, #1000		; CHECK-NEXT: cmp x8, #1000
; CHECK-NEXT: b.eq LBB4_1		; CHECK-NEXT: b.eq LBB4_1
; CHECK-NEXT: ; %bb.2: ; %exit		; CHECK-NEXT: ; %bb.2: ; %exit
; CHECK-NEXT: ret		; CHECK-NEXT: ret
;		; CHECK-NEXT: .loh AdrpLdr Lloh6, Lloh7

; CHECK-BE-LABEL: trunc_v8i64_to_v8i8_in_loop:		; CHECK-BE-LABEL: trunc_v8i64_to_v8i8_in_loop:
; CHECK-BE: // %bb.0: // %entry		; CHECK-BE: // %bb.0: // %entry
		; CHECK-BE-NEXT: adrp x8, .LCPI4_0
		; CHECK-BE-NEXT: add x8, x8, :lo12:.LCPI4_0
		; CHECK-BE-NEXT: ld1 { v0.16b }, [x8]
; CHECK-BE-NEXT: mov x8, xzr		; CHECK-BE-NEXT: mov x8, xzr
; CHECK-BE-NEXT: .LBB4_1: // %loop		; CHECK-BE-NEXT: .LBB4_1: // %loop
; CHECK-BE-NEXT: // =>This Inner Loop Header: Depth=1		; CHECK-BE-NEXT: // =>This Inner Loop Header: Depth=1
; CHECK-BE-NEXT: add x9, x0, x8, lsl #6		; CHECK-BE-NEXT: add x9, x0, x8, lsl #6
; CHECK-BE-NEXT: add x10, x9, #48		; CHECK-BE-NEXT: add x10, x9, #16
; CHECK-BE-NEXT: ld1 { v1.2d }, [x9]		; CHECK-BE-NEXT: add x11, x9, #32
; CHECK-BE-NEXT: ld1 { v0.2d }, [x10]		; CHECK-BE-NEXT: ld1 { v1.16b }, [x9]
; CHECK-BE-NEXT: add x10, x9, #32		; CHECK-BE-NEXT: add x9, x9, #48
; CHECK-BE-NEXT: add x9, x9, #16		; CHECK-BE-NEXT: ld1 { v2.16b }, [x10]
; CHECK-BE-NEXT: ld1 { v2.2d }, [x10]		; CHECK-BE-NEXT: ld1 { v3.16b }, [x11]
; CHECK-BE-NEXT: ld1 { v3.2d }, [x9]		; CHECK-BE-NEXT: ld1 { v4.16b }, [x9]
; CHECK-BE-NEXT: add x9, x1, x8, lsl #3		; CHECK-BE-NEXT: add x9, x1, x8, lsl #3
; CHECK-BE-NEXT: add x8, x8, #1		; CHECK-BE-NEXT: add x8, x8, #1
; CHECK-BE-NEXT: cmp x8, #1000		; CHECK-BE-NEXT: cmp x8, #1000
; CHECK-BE-NEXT: uzp1 v0.4s, v2.4s, v0.4s		; CHECK-BE-NEXT: tbl v1.16b, { v1.16b, v2.16b, v3.16b, v4.16b }, v0.16b
; CHECK-BE-NEXT: uzp1 v1.4s, v1.4s, v3.4s		; CHECK-BE-NEXT: st1 { v1.8b }, [x9]
; CHECK-BE-NEXT: uzp1 v0.8h, v1.8h, v0.8h
; CHECK-BE-NEXT: xtn v0.8b, v0.8h
; CHECK-BE-NEXT: st1 { v0.8b }, [x9]
; CHECK-BE-NEXT: b.eq .LBB4_1		; CHECK-BE-NEXT: b.eq .LBB4_1
; CHECK-BE-NEXT: // %bb.2: // %exit		; CHECK-BE-NEXT: // %bb.2: // %exit
; CHECK-BE-NEXT: ret		; CHECK-BE-NEXT: ret

entry:		entry:
br label %loop		br label %loop

loop:		loop:
%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]		%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
%gep.A = getelementptr inbounds <8 x i64>, ptr %A, i64 %iv		%gep.A = getelementptr inbounds <8 x i64>, ptr %A, i64 %iv
%l.A = load <8 x i64>, ptr %gep.A		%l.A = load <8 x i64>, ptr %gep.A
▲ Show 20 Lines • Show All 184 Lines • ▼ Show 20 Lines

exit:		exit:
ret void		ret void
}		}

define void @trunc_v16i16_to_v16i8_in_loop(ptr %A, ptr %dst) {		define void @trunc_v16i16_to_v16i8_in_loop(ptr %A, ptr %dst) {
; CHECK-LABEL: trunc_v16i16_to_v16i8_in_loop:		; CHECK-LABEL: trunc_v16i16_to_v16i8_in_loop:
; CHECK: ; %bb.0: ; %entry		; CHECK: ; %bb.0: ; %entry
; CHECK-NEXT: mov x8, xzr		; CHECK-NEXT: mov x8, xzr
; CHECK-NEXT: LBB7_1: ; %loop		; CHECK-NEXT: LBB7_1: ; %loop
; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1		; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
; CHECK-NEXT: add x9, x0, x8, lsl #5		; CHECK-NEXT: add x9, x0, x8, lsl #5
; CHECK-NEXT: ldp q1, q0, [x9]		; CHECK-NEXT: ldp q1, q0, [x9]
; CHECK-NEXT: uzp1.16b v0, v1, v0		; CHECK-NEXT: uzp1.16b v0, v1, v0
; CHECK-NEXT: str q0, [x1, x8, lsl #4]		; CHECK-NEXT: str q0, [x1, x8, lsl #4]
; CHECK-NEXT: add x8, x8, #1		; CHECK-NEXT: add x8, x8, #1
; CHECK-NEXT: cmp x8, #1000		; CHECK-NEXT: cmp x8, #1000
; CHECK-NEXT: b.eq LBB7_1		; CHECK-NEXT: b.eq LBB7_1
; CHECK-NEXT: ; %bb.2: ; %exit		; CHECK-NEXT: ; %bb.2: ; %exit
; CHECK-NEXT: ret		; CHECK-NEXT: ret


; CHECK-BE-LABEL: trunc_v16i16_to_v16i8_in_loop:		; CHECK-BE-LABEL: trunc_v16i16_to_v16i8_in_loop:
; CHECK-BE: // %bb.0: // %entry		; CHECK-BE: // %bb.0: // %entry
; CHECK-BE-NEXT: mov x8, xzr		; CHECK-BE-NEXT: mov x8, xzr
; CHECK-BE-NEXT: .LBB7_1: // %loop		; CHECK-BE-NEXT: .LBB7_1: // %loop
; CHECK-BE-NEXT: // =>This Inner Loop Header: Depth=1		; CHECK-BE-NEXT: // =>This Inner Loop Header: Depth=1
; CHECK-BE-NEXT: add x9, x0, x8, lsl #5		; CHECK-BE-NEXT: add x9, x0, x8, lsl #5
; CHECK-BE-NEXT: add x10, x9, #16		; CHECK-BE-NEXT: add x10, x9, #16
; CHECK-BE-NEXT: ld1 { v0.8h }, [x9]		; CHECK-BE-NEXT: ld1 { v0.8h }, [x9]
; CHECK-BE-NEXT: add x9, x1, x8, lsl #4		; CHECK-BE-NEXT: add x9, x1, x8, lsl #4
; CHECK-BE-NEXT: add x8, x8, #1		; CHECK-BE-NEXT: add x8, x8, #1
; CHECK-BE-NEXT: ld1 { v1.8h }, [x10]		; CHECK-BE-NEXT: ld1 { v1.8h }, [x10]
; CHECK-BE-NEXT: cmp x8, #1000		; CHECK-BE-NEXT: cmp x8, #1000
; CHECK-BE-NEXT: uzp1 v0.16b, v0.16b, v1.16b		; CHECK-BE-NEXT: uzp1 v0.16b, v0.16b, v1.16b
; CHECK-BE-NEXT: st1 { v0.16b }, [x9]		; CHECK-BE-NEXT: st1 { v0.16b }, [x9]
; CHECK-BE-NEXT: b.eq .LBB7_1		; CHECK-BE-NEXT: b.eq .LBB7_1
; CHECK-BE-NEXT: // %bb.2: // %exit		; CHECK-BE-NEXT: // %bb.2: // %exit
; CHECK-BE-NEXT: ret		; CHECK-BE-NEXT: ret


entry:		entry:
br label %loop		br label %loop

loop:		loop:
%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]		%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
%gep.A = getelementptr inbounds <16 x i16>, ptr %A, i64 %iv		%gep.A = getelementptr inbounds <16 x i16>, ptr %A, i64 %iv
%l.A = load <16 x i16>, ptr %gep.A		%l.A = load <16 x i16>, ptr %gep.A
%trunc = trunc <16 x i16> %l.A to <16 x i8>		%trunc = trunc <16 x i16> %l.A to <16 x i8>
%gep.dst = getelementptr inbounds <16 x i8>, ptr %dst, i64 %iv		%gep.dst = getelementptr inbounds <16 x i8>, ptr %dst, i64 %iv
store <16 x i8> %trunc, ptr %gep.dst		store <16 x i8> %trunc, ptr %gep.dst
%iv.next = add i64 %iv, 1		%iv.next = add i64 %iv, 1
%ec = icmp eq i64 %iv.next, 1000		%ec = icmp eq i64 %iv.next, 1000
br i1 %ec, label %loop, label %exit		br i1 %ec, label %loop, label %exit

exit:		exit:
ret void		ret void
}		}

define void @trunc_v8i16_to_v8i8_in_loop(ptr %A, ptr %dst) {		define void @trunc_v8i16_to_v8i8_in_loop(ptr %A, ptr %dst) {
; CHECK-LABEL: trunc_v8i16_to_v8i8_in_loop:		; CHECK-LABEL: trunc_v8i16_to_v8i8_in_loop:
; CHECK: ; %bb.0: ; %entry		; CHECK: ; %bb.0: ; %entry
; CHECK-NEXT: mov x8, xzr		; CHECK-NEXT: mov x8, xzr
; CHECK-NEXT: LBB8_1: ; %loop		; CHECK-NEXT: LBB8_1: ; %loop
; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1		; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
; CHECK-NEXT: ldr q0, [x0, x8, lsl #4]		; CHECK-NEXT: ldr q0, [x0, x8, lsl #4]
; CHECK-NEXT: xtn.8b v0, v0		; CHECK-NEXT: xtn.8b v0, v0
; CHECK-NEXT: str d0, [x1, x8, lsl #3]		; CHECK-NEXT: str d0, [x1, x8, lsl #3]
; CHECK-NEXT: add x8, x8, #1		; CHECK-NEXT: add x8, x8, #1
; CHECK-NEXT: cmp x8, #1000		; CHECK-NEXT: cmp x8, #1000
; CHECK-NEXT: b.eq LBB8_1		; CHECK-NEXT: b.eq LBB8_1
; CHECK-NEXT: ; %bb.2: ; %exit		; CHECK-NEXT: ; %bb.2: ; %exit
; CHECK-NEXT: ret		; CHECK-NEXT: ret


; CHECK-BE-LABEL: trunc_v8i16_to_v8i8_in_loop:		; CHECK-BE-LABEL: trunc_v8i16_to_v8i8_in_loop:
; CHECK-BE: // %bb.0: // %entry		; CHECK-BE: // %bb.0: // %entry
; CHECK-BE-NEXT: mov x8, xzr		; CHECK-BE-NEXT: mov x8, xzr
; CHECK-BE-NEXT: .LBB8_1: // %loop		; CHECK-BE-NEXT: .LBB8_1: // %loop
; CHECK-BE-NEXT: // =>This Inner Loop Header: Depth=1		; CHECK-BE-NEXT: // =>This Inner Loop Header: Depth=1
; CHECK-BE-NEXT: add x9, x0, x8, lsl #4		; CHECK-BE-NEXT: add x9, x0, x8, lsl #4
; CHECK-BE-NEXT: ld1 { v0.8h }, [x9]		; CHECK-BE-NEXT: ld1 { v0.8h }, [x9]
; CHECK-BE-NEXT: add x9, x1, x8, lsl #3		; CHECK-BE-NEXT: add x9, x1, x8, lsl #3
; CHECK-BE-NEXT: add x8, x8, #1		; CHECK-BE-NEXT: add x8, x8, #1
; CHECK-BE-NEXT: cmp x8, #1000		; CHECK-BE-NEXT: cmp x8, #1000
; CHECK-BE-NEXT: xtn v0.8b, v0.8h		; CHECK-BE-NEXT: xtn v0.8b, v0.8h
; CHECK-BE-NEXT: st1 { v0.8b }, [x9]		; CHECK-BE-NEXT: st1 { v0.8b }, [x9]
; CHECK-BE-NEXT: b.eq .LBB8_1		; CHECK-BE-NEXT: b.eq .LBB8_1
; CHECK-BE-NEXT: // %bb.2: // %exit		; CHECK-BE-NEXT: // %bb.2: // %exit
; CHECK-BE-NEXT: ret		; CHECK-BE-NEXT: ret


entry:		entry:
br label %loop		br label %loop

loop:		loop:
%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]		%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
%gep.A = getelementptr inbounds <8 x i16>, ptr %A, i64 %iv		%gep.A = getelementptr inbounds <8 x i16>, ptr %A, i64 %iv
%l.A = load <8 x i16>, ptr %gep.A		%l.A = load <8 x i16>, ptr %gep.A
%trunc = trunc <8 x i16> %l.A to <8 x i8>		%trunc = trunc <8 x i16> %l.A to <8 x i8>
Show All 10 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Extending lowering of 'trunc <(8|16) x i64> %x to <(8|16) x i8>' to use tbl instructionsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 483181

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/test/CodeGen/AArch64/trunc-to-tbl.ll

[AArch64] Extending lowering of 'trunc <(8|16) x i64> %x to <(8|16) x i8>' to use tbl instructions
ClosedPublic