This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
14/15
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
trunc-to-tbl.ll

Differential D135229

[AArch64] Extending lowering of 'trunc <(8|16) x i64> %x to <(8|16) x i8>' to use tbl instructions
ClosedPublic

Authored by nilanjana_basu on Oct 4 2022, 5:12 PM.

Download Raw Diff

Details

Reviewers

fhahn
t.p.northover
paquette

Commits

rG02d09ffc1b09: [AArch64] Extending lowering of 'trunc <(8|16) x i64> %x to <(8|16) x i8>' to…

Summary

[AArch64] Patch for lowering trunc instructions to 'tbl' for (8|16)xi32 -> (8|16)xi8 conversions in D133495 is extended to support trunc to tbl lowering for (8|16) x i64 to (8|16) x i8.

A microbenchmark for runtime has been added for all these cases in D136274

Diff Detail

Event Timeline

nilanjana_basu created this revision.Oct 4 2022, 5:12 PM

Herald added a project: Restricted Project. · View Herald TranscriptOct 4 2022, 5:12 PM

Herald added a subscriber: hiraditya. · View Herald Transcript

nilanjana_basu requested review of this revision.Oct 4 2022, 5:12 PM

Herald added a project: Restricted Project. · View Herald TranscriptOct 4 2022, 5:12 PM

Herald added a subscriber: llvm-commits. · View Herald Transcript

nilanjana_basu set the repository for this revision to rG LLVM Github Monorepo.Oct 4 2022, 5:47 PM

nilanjana_basu retitled this revision from Extending trunc lowering of 'trunc <8 x i64> %x to <8 x i8>' to use tbl.4 instruction to [AArch64] Extending trunc lowering of 'trunc <8 x i64> %x to <8 x i8>' to use tbl.4 instruction.Oct 4 2022, 6:01 PM

Herald added a subscriber: kristof.beyls. · View Herald TranscriptOct 4 2022, 6:01 PM

Ran git-clang-format & made minor change to reduce LoC.

nilanjana_basu added reviewers: fhahn, t.p.northover.Oct 4 2022, 6:05 PM

Removed an unused variable warning

Harbormaster completed remote builds in B190384: Diff 465246.Oct 4 2022, 7:56 PM

Extended the trunc lowering for other types like 16xi64, 16xi16, 8xi16

Harbormaster completed remote builds in B193763: Diff 469926.Oct 22 2022, 1:19 PM

The automated build tests failed for the previous patch because it was based on a previous commit for a unit test that isn't submitted yet. This patch fixes it by squashing the previous commit, removing the dependency & showing the final update.

Harbormaster completed remote builds in B193990: Diff 470224.Oct 24 2022, 1:00 PM

Ran clang-format since it was failing in the build report at https://buildkite.com/llvm-project/diff-checks/builds/133184

Harbormaster completed remote builds in B194302: Diff 470661.Oct 25 2022, 6:32 PM

nilanjana_basu added a reviewer: paquette.Oct 28 2022, 4:41 PM

nilanjana_basu edited the summary of this revision. (Show Details)Oct 28 2022, 4:57 PM

nilanjana_basu mentioned this in D137221: [MicroBenchmarks] Add benchmarks to check runtime of truncate or zero-extend vector operations in AArch64.Nov 1 2022, 6:31 PM

t.p.northover added inline comments.Nov 2 2022, 8:08 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

13405–13406

I think these are guaranteed to succeed by checks in the caller (and essential here), so cast<...> is probably better. Applies to some of the later dyn_casts too.

13453–13463

There's a lot of duplication in this switch, but it is pretty easy to eyeball for correctness because of that once you get what it's trying to do. So I'm torn, a loop like this would probably be shorter overall:

int ShuffleCount = 128/SrcElemSize;
SmallVector<int> ShuffleLanes;
for (int i = 0; i < ShuffleCount; ++i)
  ShuffleLanes.push_back(i);

SmallVector<Value *> Results;
while (ShuffleLanes.back() < NumElements) {
  Parts.push_back(Builder.CreateBitCast(Builder.CreateShuffleVector(TI->getOperand(0), ShuffleLanes), VecTy));
  for (int i = 0; i < ShuffleCount; ++i)
    ShuffleLanes[i] += ShuffleCount;
  if (Parts.size() == 4) {
    // Call tbl4, push result into Results, clear Parts.
  }
}

// Choose correct tbl (3 now a valid option) and call for rest of Parts, push to Results

// Shuffle-merge all of Results.

and allow the code to apply to a wider range of truncates. What are your views on the implementation?

nilanjana_basu mentioned this in rT3b44b6bdd3e8: [MicroBenchmarks] Add benchmarks to check runtime of truncate or zero-extend….Nov 2 2022, 2:05 PM

nilanjana_basu added a parent revision: D137293: [AArch64] Extra unit tests for trunc lowering of vectors.Nov 2 2022, 3:01 PM

Addressed comments by t.p.northover - refactored code to remove redundancy

nilanjana_basu marked an inline comment as done.Nov 3 2022, 7:39 PM

nilanjana_basu added inline comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13453–13463	I refactored the code as you suggested, which can now apply to a few extra cases like 12xi32 or 4xi32. However, I haven't modified the old set of allowable cases since I don't know how relevant these few are. In my understanding, we get better performance when tbl2-tbl4 get triggered, as the number of generated instructions decrease. So, I need your opinion on whether we should allow 8xi16 conversions, since they generate a single tbl1 instruction?

Ran clang-format

Harbormaster completed remote builds in B196052: Diff 473110.Nov 3 2022, 8:44 PM

nilanjana_basu removed a parent revision: D137293: [AArch64] Extra unit tests for trunc lowering of vectors.Nov 4 2022, 5:22 PM

nilanjana_basu mentioned this in D138059: [MicroBenchmarks,AArch64] Added correctness test & other performance tests for truncate or zero-extend vector operations.Nov 15 2022, 1:13 PM

Added comments

Harbormaster completed remote builds in B198876: Diff 477022.Nov 22 2022, 12:18 AM

fhahn added inline comments.Nov 22 2022, 3:57 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13405	Is this guaranteed to be a fixed vector type? Could you add a variant of a test with truncates of scalable vectors (`<vscale x 16 x i8>` or something like that?
13451–13452	It would be great if you could add a brief comment here explaining what kind of masks/shuffles are prepared here.
13475	store here seems ambiguous here, as we won't emit a store instruction, right?
13495	SmallVector?
13500	SmallVector?
llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll
676 ↗	(On Diff #477022)	Similar to D136722, it is likely not profitable to do this when converting to/from the next power-of-2.

fhahn added inline comments.Nov 22 2022, 10:44 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13405	I think it should be fine, I added a test in 4783345426da

Updated comments as mentioned in the reviews. Rebased on tests for this change prior to applying this patch.

Harbormaster completed remote builds in B199088: Diff 477347.Nov 22 2022, 5:00 PM

nilanjana_basu added a parent revision: D137293: [AArch64] Extra unit tests for trunc lowering of vectors.Nov 22 2022, 5:02 PM

Rebasing on commit of test cases prior to application of this patch

Removed case for 'trunc <(8|16)xi16> %x to <(8|16)xi8>' since it was adding more instructions to loop header, while not improving loop instruction count

Updated a comment

nilanjana_basu retitled this revision from [AArch64] Extending lowering of 'trunc <(8|16) x (i16|i64)> %x to <(8|16) x i8>' to use tbl instructions to [AArch64] Extending lowering of 'trunc <(8|16) x i64> %x to <(8|16) x i8>' to use tbl instructions.Nov 23 2022, 2:38 AM

nilanjana_basu edited the summary of this revision. (Show Details)

Removed (8|16)xi16 to (8|16)xi8 conversion because it wasn't showing benefits in instruction count, & additionally adding more instructions to the header. Updated comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13405	Since the source & destination types were checked for FixedVector once in the calling function optimizeExtendOrTruncateConversion(), I didn't check it here again.
13405	Since both the zext & trunc test for scalable vector goes through the optimizeExtendOrTruncateConversion() function, the zext test in 4783345426da should suffice for the trunc too. Let me know if you think it needs to be replicated for trunc vector too.
13475	I replaced the "store" with "save" to indicate that it is being stored in the compiler's internal vector data structure. Added a comment at the place of combining these results.
llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll
676 ↗	(On Diff #477022)	Since we only support a handful of vector type truncates in this implementation, only Yxi16->Yxi8 was a supported next power-of-2 conversion. Removed it.

Harbormaster completed remote builds in B199151: Diff 477432.Nov 23 2022, 3:32 AM

Rebasing on parent patch for tests

Harbormaster completed remote builds in B199602: Diff 478030.Nov 25 2022, 3:08 PM

nilanjana_basu edited the summary of this revision. (Show Details)Nov 25 2022, 3:54 PM

Trying to fix rebasing error

Harbormaster completed remote builds in B199605: Diff 478033.Nov 25 2022, 4:35 PM

nilanjana_basu removed a parent revision: D137293: [AArch64] Extra unit tests for trunc lowering of vectors.Nov 28 2022, 4:17 AM

nilanjana_basu mentioned this in rT08de51078b0a: [MicroBenchmarks,AArch64] Added correctness test & other performance tests for….Dec 1 2022, 10:09 PM

LGTM with the inline suggestions. Please wait a day or so with committing in case there are additional comments.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
13414	could you add an assert to make sure the division happens without remainder?
13436	Could use `Builder.getInt8(....)`?
13443	IIUC the only case that can happen here is that `Parts == 4`, right? Might be good to update the check.

This revision is now accepted and ready to land.Dec 14 2022, 1:47 PM

Closed by commit rG02d09ffc1b09: [AArch64] Extending lowering of 'trunc <(8|16) x i64> %x to <(8|16) x i8>' to… (authored by nilanjana_basu). · Explain WhyDec 15 2022, 7:21 AM

This revision was automatically updated to reflect the committed changes.

nilanjana_basu added a commit: rG02d09ffc1b09: [AArch64] Extending lowering of 'trunc <(8|16) x i64> %x to <(8|16) x i8>' to….

nilanjana_basu mentioned this in rG795868285db9: [AArch64] Minor changes and sanity checks in relation to https://reviews.llvm..Dec 15 2022, 12:09 PM

Addressed the final comments in a separate commit.

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.cpp

81 lines

test/

CodeGen/

AArch64/

trunc-to-tbl.ll

113 lines

Diff 465244

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 13,395 Lines • ▼ Show 20 Lines	static void createTblShuffleForZExt(ZExtInst *ZExt, bool IsLittleEndian) {
Result = Builder.CreateBitCast(Result, DstTy);		Result = Builder.CreateBitCast(Result, DstTy);
ZExt->replaceAllUsesWith(Result);		ZExt->replaceAllUsesWith(Result);
ZExt->eraseFromParent();		ZExt->eraseFromParent();
}		}

static void createTblForTrunc(TruncInst *TI, bool IsLittleEndian) {		static void createTblForTrunc(TruncInst *TI, bool IsLittleEndian) {
IRBuilder<> Builder(TI);		IRBuilder<> Builder(TI);
SmallVector<Value *> Parts;		SmallVector<Value *> Parts;
		unsigned NumElements = cast<FixedVectorType>(TI->getType())->getNumElements();
		auto *SrcTy = dyn_cast<FixedVectorType>(TI->getOperand(0)->getType());
		fhahnUnsubmitted Done Reply Inline Actions Is this guaranteed to be a fixed vector type? Could you add a variant of a test with truncates of scalable vectors (`<vscale x 16 x i8>` or something like that? fhahn: Is this guaranteed to be a fixed vector type? Could you add a variant of a test with truncates…
		fhahnUnsubmitted Done Reply Inline Actions I think it should be fine, I added a test in 4783345426da fhahn: I think it should be fine, I added a test in 4783345426da
		nilanjana_basuAuthorUnsubmitted Done Reply Inline Actions Since both the zext & trunc test for scalable vector goes through the optimizeExtendOrTruncateConversion() function, the zext test in 4783345426da should suffice for the trunc too. Let me know if you think it needs to be replicated for trunc vector too. nilanjana_basu: Since both the zext & trunc test for scalable vector goes through the…
		nilanjana_basuAuthorUnsubmitted Done Reply Inline Actions Since the source & destination types were checked for FixedVector once in the calling function optimizeExtendOrTruncateConversion(), I didn't check it here again. nilanjana_basu: Since the source & destination types were checked for FixedVector once in the calling function…
		auto *DstTy = dyn_cast<FixedVectorType>(TI->getType());
		t.p.northoverUnsubmitted Done Reply Inline Actions I think these are guaranteed to succeed by checks in the caller (and essential here), so `cast<...>` is probably better. Applies to some of the later `dyn_cast`s too. t.p.northover: I think these are guaranteed to succeed by checks in the caller (and essential here), so `cast<.
		assert(SrcTy->getElementType()->isIntegerTy() &&
		"Non-integer type source vector element is not supported");
		assert(DstTy->getElementType()->isIntegerTy(8) &&
		"Unsupported destination vector element type");
		unsigned SrcElemTySz =
		dyn_cast<IntegerType>(SrcTy->getElementType())->getBitWidth();
		unsigned TruncFactor =
		SrcElemTySz /
		fhahnUnsubmitted Done Reply Inline Actions could you add an assert to make sure the division happens without remainder? fhahn: could you add an assert to make sure the division happens without remainder?
		dyn_cast<IntegerType>(DstTy->getElementType())->getBitWidth();
		assert((SrcElemTySz == 32 \|\| SrcElemTySz == 64) &&
		"Unsupported source vector element type size");
Type *VecTy = FixedVectorType::get(Builder.getInt8Ty(), 16);		Type *VecTy = FixedVectorType::get(Builder.getInt8Ty(), 16);
		Intrinsic::ID TblID = Intrinsic::aarch64_neon_tbl2;

		if (SrcElemTySz == 64 \|\| (SrcElemTySz == 32 && NumElements == 16))
		TblID = Intrinsic::aarch64_neon_tbl4;

		switch (SrcElemTySz) {
		case 32:
Parts.push_back(Builder.CreateBitCast(		Parts.push_back(Builder.CreateBitCast(
Builder.CreateShuffleVector(TI->getOperand(0), {0, 1, 2, 3}), VecTy));		Builder.CreateShuffleVector(TI->getOperand(0), {0, 1, 2, 3}), VecTy));
Parts.push_back(Builder.CreateBitCast(		Parts.push_back(Builder.CreateBitCast(
Builder.CreateShuffleVector(TI->getOperand(0), {4, 5, 6, 7}), VecTy));		Builder.CreateShuffleVector(TI->getOperand(0), {4, 5, 6, 7}), VecTy));

Intrinsic::ID TblID = Intrinsic::aarch64_neon_tbl2;
unsigned NumElements = cast<FixedVectorType>(TI->getType())->getNumElements();
if (NumElements == 16) {		if (NumElements == 16) {
		TblID = Intrinsic::aarch64_neon_tbl4;
Parts.push_back(Builder.CreateBitCast(		Parts.push_back(Builder.CreateBitCast(
Builder.CreateShuffleVector(TI->getOperand(0), {8, 9, 10, 11}), VecTy));		Builder.CreateShuffleVector(TI->getOperand(0), {8, 9, 10, 11}),
		VecTy));
Parts.push_back(Builder.CreateBitCast(		Parts.push_back(Builder.CreateBitCast(
Builder.CreateShuffleVector(TI->getOperand(0), {12, 13, 14, 15}),		Builder.CreateShuffleVector(TI->getOperand(0), {12, 13, 14, 15}),
		fhahnUnsubmitted Done Reply Inline Actions Could use `Builder.getInt8(....)`? fhahn: Could use `Builder.getInt8(....)`?
VecTy));		VecTy));
		}
		break;
		case 64:
TblID = Intrinsic::aarch64_neon_tbl4;		TblID = Intrinsic::aarch64_neon_tbl4;
		Parts.push_back(Builder.CreateBitCast(
		Builder.CreateShuffleVector(TI->getOperand(0), {0, 1}), VecTy));
		fhahnUnsubmitted Done Reply Inline Actions IIUC the only case that can happen here is that `Parts == 4`, right? Might be good to update the check. fhahn: IIUC the only case that can happen here is that `Parts == 4`, right? Might be good to update…
		Parts.push_back(Builder.CreateBitCast(
		Builder.CreateShuffleVector(TI->getOperand(0), {2, 3}), VecTy));
		Parts.push_back(Builder.CreateBitCast(
		Builder.CreateShuffleVector(TI->getOperand(0), {4, 5}), VecTy));
		Parts.push_back(Builder.CreateBitCast(
		Builder.CreateShuffleVector(TI->getOperand(0), {6, 7}), VecTy));
		break;
}		}
SmallVector<Constant *, 16> MaskConst;
for (unsigned Idx = 0; Idx < NumElements * 4; Idx += 4)
MaskConst.push_back(
ConstantInt::get(Builder.getInt8Ty(), IsLittleEndian ? Idx : Idx + 3));

		fhahnUnsubmitted Done Reply Inline Actions It would be great if you could add a brief comment here explaining what kind of masks/shuffles are prepared here. fhahn: It would be great if you could add a brief comment here explaining what kind of masks/shuffles…
for (unsigned Idx = NumElements * 4; Idx < 64; Idx += 4)		SmallVector<Constant *, 16> MaskConst;
		unsigned Idx = 0;
		for (unsigned Itr = 0; Itr < 16; Itr++) {
		if (Itr < NumElements)
		MaskConst.push_back(ConstantInt::get(
		Builder.getInt8Ty(), IsLittleEndian
		? Itr * TruncFactor
		: Itr * TruncFactor + (TruncFactor - 1)));
		else
MaskConst.push_back(ConstantInt::get(Builder.getInt8Ty(), 255));		MaskConst.push_back(ConstantInt::get(Builder.getInt8Ty(), 255));
		}
		t.p.northoverUnsubmitted Done Reply Inline Actions There's a lot of duplication in this switch, but it is pretty easy to eyeball for correctness because of that once you get what it's trying to do. So I'm torn, a loop like this would probably be shorter overall: int ShuffleCount = 128/SrcElemSize; SmallVector<int> ShuffleLanes; for (int i = 0; i < ShuffleCount; ++i) ShuffleLanes.push_back(i); SmallVector<Value > Results; while (ShuffleLanes.back() < NumElements) { Parts.push_back(Builder.CreateBitCast(Builder.CreateShuffleVector(TI->getOperand(0), ShuffleLanes), VecTy)); for (int i = 0; i < ShuffleCount; ++i) ShuffleLanes[i] += ShuffleCount; if (Parts.size() == 4) { // Call tbl4, push result into Results, clear Parts. } } // Choose correct tbl (3 now a valid option) and call for rest of Parts, push to Results // Shuffle-merge all of Results. and allow the code to apply to a wider range of truncates. What are your views on the implementation? t.p.northover:* There's a lot of duplication in this switch, but it is pretty easy to eyeball for correctness…
		nilanjana_basuAuthorUnsubmitted Not Done Reply Inline Actions I refactored the code as you suggested, which can now apply to a few extra cases like 12xi32 or 4xi32. However, I haven't modified the old set of allowable cases since I don't know how relevant these few are. In my understanding, we get better performance when tbl2-tbl4 get triggered, as the number of generated instructions decrease. So, I need your opinion on whether we should allow 8xi16 conversions, since they generate a single tbl1 instruction? nilanjana_basu: I refactored the code as you suggested, which can now apply to a few extra cases like 12xi32 or…

Parts.push_back(ConstantVector::get(MaskConst));		Parts.push_back(ConstantVector::get(MaskConst));
auto *F =		auto *F =
Intrinsic::getDeclaration(TI->getModule(), TblID, Parts[0]->getType());		Intrinsic::getDeclaration(TI->getModule(), TblID, Parts[0]->getType());
Value *Res = Builder.CreateCall(F, Parts);		Value *Res = Builder.CreateCall(F, Parts);

if (NumElements == 8)		if (NumElements == 8)
Res = Builder.CreateShuffleVector(Res, {0, 1, 2, 3, 4, 5, 6, 7});		Res = Builder.CreateShuffleVector(Res, {0, 1, 2, 3, 4, 5, 6, 7});
TI->replaceAllUsesWith(Res);		TI->replaceAllUsesWith(Res);
TI->eraseFromParent();		TI->eraseFromParent();
}		}

		fhahnUnsubmitted Done Reply Inline Actions store here seems ambiguous here, as we won't emit a store instruction, right? fhahn: store here seems ambiguous here, as we won't emit a store instruction, right?
		nilanjana_basuAuthorUnsubmitted Done Reply Inline Actions I replaced the "store" with "save" to indicate that it is being stored in the compiler's internal vector data structure. Added a comment at the place of combining these results. nilanjana_basu: I replaced the "store" with "save" to indicate that it is being stored in the compiler's…
bool AArch64TargetLowering::optimizeExtendOrTruncateConversion(Instruction *I,		bool AArch64TargetLowering::optimizeExtendOrTruncateConversion(Instruction *I,
Loop *L) const {		Loop *L) const {
// Try to optimize conversions using tbl. This requires materializing constant		// Try to optimize conversions using tbl. This requires materializing constant
// index vectors, which can increase code size and add loads. Skip the		// index vectors, which can increase code size and add loads. Skip the
// transform unless the conversion is in a loop block guaranteed to execute		// transform unless the conversion is in a loop block guaranteed to execute
// and we are not optimizing for size.		// and we are not optimizing for size.
Function *F = I->getParent()->getParent();		Function *F = I->getParent()->getParent();
if (!L \|\| L->getHeader() != I->getParent() \|\| F->hasMinSize() \|\|		if (!L \|\| L->getHeader() != I->getParent() \|\| F->hasMinSize() \|\|
F->hasOptSize())		F->hasOptSize())
return false;		return false;

auto *SrcTy = dyn_cast<FixedVectorType>(I->getOperand(0)->getType());		auto *SrcTy = dyn_cast<FixedVectorType>(I->getOperand(0)->getType());
auto *DstTy = dyn_cast<FixedVectorType>(I->getType());		auto *DstTy = dyn_cast<FixedVectorType>(I->getType());
if (!SrcTy \|\| !DstTy)		if (!SrcTy \|\| !DstTy)
return false;		return false;

// Convert 'zext <(8\|16) x i8> %x to <(8\|16) x i32>' to a shuffle that can be		// Convert 'zext <(8\|16) x i8> %x to <(8\|16) x i32>' to a shuffle that can be
// lowered to either 2 or 4 tbl instructions to insert the original i8		// lowered to either 2 or 4 tbl instructions to insert the original i8
// elements into i32 lanes.		// elements into i32 lanes.
auto *ZExt = dyn_cast<ZExtInst>(I);		auto *ZExt = dyn_cast<ZExtInst>(I);
		fhahnUnsubmitted Done Reply Inline Actions SmallVector? fhahn: SmallVector?
if (ZExt && (SrcTy->getNumElements() == 8 \|\| SrcTy->getNumElements() == 16) &&		if (ZExt && (SrcTy->getNumElements() == 8 \|\| SrcTy->getNumElements() == 16) &&
SrcTy->getElementType()->isIntegerTy(8) &&		SrcTy->getElementType()->isIntegerTy(8) &&
DstTy->getElementType()->isIntegerTy(32)) {		DstTy->getElementType()->isIntegerTy(32)) {
createTblShuffleForZExt(ZExt, Subtarget->isLittleEndian());		createTblShuffleForZExt(ZExt, Subtarget->isLittleEndian());
return true;		return true;
		fhahnUnsubmitted Done Reply Inline Actions SmallVector? fhahn: SmallVector?
}		}

auto *UIToFP = dyn_cast<UIToFPInst>(I);		auto *UIToFP = dyn_cast<UIToFPInst>(I);
if (UIToFP &&		if (UIToFP &&
(SrcTy->getNumElements() == 8 \|\| SrcTy->getNumElements() == 16) &&		(SrcTy->getNumElements() == 8 \|\| SrcTy->getNumElements() == 16) &&
SrcTy->getElementType()->isIntegerTy(8) &&		SrcTy->getElementType()->isIntegerTy(8) &&
DstTy->getElementType()->isFloatTy()) {		DstTy->getElementType()->isFloatTy()) {
IRBuilder<> Builder(I);		IRBuilder<> Builder(I);
Show All 18 Lines	auto *WideConv = Builder.CreateFPToUI(FPToUI->getOperand(0),
VectorType::getInteger(SrcTy));		VectorType::getInteger(SrcTy));
auto *TruncI = Builder.CreateTrunc(WideConv, DstTy);		auto *TruncI = Builder.CreateTrunc(WideConv, DstTy);
I->replaceAllUsesWith(TruncI);		I->replaceAllUsesWith(TruncI);
I->eraseFromParent();		I->eraseFromParent();
createTblForTrunc(cast<TruncInst>(TruncI), Subtarget->isLittleEndian());		createTblForTrunc(cast<TruncInst>(TruncI), Subtarget->isLittleEndian());
return true;		return true;
}		}

// Convert 'trunc <(8\|16) x i32> %x to <(8\|16) x i8>' to a single tbl.4		// Convert 'trunc <(8\|16) x i32> %x to <(8\|16) x i8>'
		// or 'trunc <8 x i64> %x to <8 x i8> to a single tbl.4
// instruction selecting the lowest 8 bits per lane of the input interpreted		// instruction selecting the lowest 8 bits per lane of the input interpreted
// as 2 or 4 <4 x i32> vectors.		// as 2 or 4 <4 x i32> vectors.
auto *TI = dyn_cast<TruncInst>(I);		auto *TI = dyn_cast<TruncInst>(I);
if (TI && (SrcTy->getNumElements() == 8 \|\| SrcTy->getNumElements() == 16) &&
SrcTy->getElementType()->isIntegerTy(32) &&		if (TI && DstTy->getElementType()->isIntegerTy(8) &&
DstTy->getElementType()->isIntegerTy(8)) {		((SrcTy->getElementType()->isIntegerTy(32) &&
		(SrcTy->getNumElements() == 8 \|\| SrcTy->getNumElements() == 16)) \|\|
		(SrcTy->getElementType()->isIntegerTy(64) &&
		SrcTy->getNumElements() == 8))) {
createTblForTrunc(TI, Subtarget->isLittleEndian());		createTblForTrunc(TI, Subtarget->isLittleEndian());
return true;		return true;
}		}

return false;		return false;
}		}

bool AArch64TargetLowering::hasPairedLoad(EVT LoadedType,		bool AArch64TargetLowering::hasPairedLoad(EVT LoadedType,
▲ Show 20 Lines • Show All 9,116 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/trunc-to-tbl.ll

Show First 20 Lines • Show All 308 Lines • ▼ Show 20 Lines	loop:
%iv.next = add i64 %iv, 1		%iv.next = add i64 %iv, 1
%ec = icmp eq i64 %iv.next, 1000		%ec = icmp eq i64 %iv.next, 1000
br i1 %ec, label %loop, label %exit		br i1 %ec, label %loop, label %exit

exit:		exit:
ret void		ret void
}		}

		; CHECK-LABEL: lCPI4_0:
		; CHECK-NEXT: .byte 0 ; 0x0
		; CHECK-NEXT: .byte 8 ; 0x8
		; CHECK-NEXT: .byte 16 ; 0x10
		; CHECK-NEXT: .byte 24 ; 0x18
		; CHECK-NEXT: .byte 32 ; 0x20
		; CHECK-NEXT: .byte 40 ; 0x28
		; CHECK-NEXT: .byte 48 ; 0x30
		; CHECK-NEXT: .byte 56 ; 0x38
		; CHECK-NEXT: .byte 255 ; 0xff
		; CHECK-NEXT: .byte 255 ; 0xff
		; CHECK-NEXT: .byte 255 ; 0xff
		; CHECK-NEXT: .byte 255 ; 0xff
		; CHECK-NEXT: .byte 255 ; 0xff
		; CHECK-NEXT: .byte 255 ; 0xff
		; CHECK-NEXT: .byte 255 ; 0xff
		; CHECK-NEXT: .byte 255 ; 0xff

		; CHECK-BE-LABEL: .LCPI4_0:
		; CHECK-BE-NEXT: .byte 7 // 0x7
		; CHECK-BE-NEXT: .byte 15 // 0xf
		; CHECK-BE-NEXT: .byte 23 // 0x17
		; CHECK-BE-NEXT: .byte 31 // 0x1f
		; CHECK-BE-NEXT: .byte 39 // 0x27
		; CHECK-BE-NEXT: .byte 47 // 0x2f
		; CHECK-BE-NEXT: .byte 55 // 0x37
		; CHECK-BE-NEXT: .byte 63 // 0x3f
		; CHECK-BE-NEXT: .byte 255 // 0xff
		; CHECK-BE-NEXT: .byte 255 // 0xff
		; CHECK-BE-NEXT: .byte 255 // 0xff
		; CHECK-BE-NEXT: .byte 255 // 0xff
		; CHECK-BE-NEXT: .byte 255 // 0xff
		; CHECK-BE-NEXT: .byte 255 // 0xff
		; CHECK-BE-NEXT: .byte 255 // 0xff
		; CHECK-BE-NEXT: .byte 255 // 0xff
define void @trunc_v8i64_to_v8i8_in_loop(ptr %A, ptr %dst) {		define void @trunc_v8i64_to_v8i8_in_loop(ptr %A, ptr %dst) {
; CHECK-LABEL: trunc_v8i64_to_v8i8_in_loop:		; CHECK-LABEL: trunc_v8i64_to_v8i8_in_loop:
; CHECK: ; %bb.0: ; %entry		; CHECK: ; %bb.0: ; %entry
		; CHECK-NEXT: Lloh4:
		; CHECK-NEXT: adrp x9, lCPI4_0@PAGE
; CHECK-NEXT: mov x8, xzr		; CHECK-NEXT: mov x8, xzr
		; CHECK-NEXT: Lloh5:
		; CHECK-NEXT: ldr q0, [x9, lCPI4_0@PAGEOFF]
; CHECK-NEXT: LBB4_1: ; %loop		; CHECK-NEXT: LBB4_1: ; %loop
; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1		; CHECK-NEXT: ; =>This Inner Loop Header: Depth=1
; CHECK-NEXT: add x9, x0, x8, lsl #6		; CHECK-NEXT: add x9, x0, x8, lsl #6
; CHECK-NEXT: ldp q1, q0, [x9, #32]		; CHECK-NEXT: ldp q1, q2, [x9]
; CHECK-NEXT: ldp q3, q2, [x9]		; CHECK-NEXT: ldp q3, q4, [x9, #32]
; CHECK-NEXT: uzp1.4s v0, v1, v0		; CHECK-NEXT: tbl.16b v1, { v1, v2, v3, v4 }, v0
; CHECK-NEXT: uzp1.4s v1, v3, v2		; CHECK-NEXT: str d1, [x1, x8, lsl #3]
; CHECK-NEXT: uzp1.8h v0, v1, v0
; CHECK-NEXT: xtn.8b v0, v0
; CHECK-NEXT: str d0, [x1, x8, lsl #3]
; CHECK-NEXT: add x8, x8, #1		; CHECK-NEXT: add x8, x8, #1
; CHECK-NEXT: cmp x8, #1000		; CHECK-NEXT: cmp x8, #1000
; CHECK-NEXT: b.eq LBB4_1		; CHECK-NEXT: b.eq LBB4_1
; CHECK-NEXT: ; %bb.2: ; %exit		; CHECK-NEXT: ; %bb.2: ; %exit
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; CHECK-NEXT: .loh AdrpLdr Lloh4, Lloh5

; CHECK-BE-LABEL: trunc_v8i64_to_v8i8_in_loop:		; CHECK-BE-LABEL: trunc_v8i64_to_v8i8_in_loop:
; CHECK-BE: // %bb.0: // %entry		; CHECK-BE: // %bb.0: // %entry
		; CHECK-BE-NEXT: adrp x8, .LCPI4_0
		; CHECK-BE-NEXT: add x8, x8, :lo12:.LCPI4_0
		; CHECK-BE-NEXT: ld1 { v0.16b }, [x8]
; CHECK-BE-NEXT: mov x8, xzr		; CHECK-BE-NEXT: mov x8, xzr
; CHECK-BE-NEXT: .LBB4_1: // %loop		; CHECK-BE-NEXT: .LBB4_1: // %loop
; CHECK-BE-NEXT: // =>This Inner Loop Header: Depth=1		; CHECK-BE-NEXT: // =>This Inner Loop Header: Depth=1
; CHECK-BE-NEXT: add x9, x0, x8, lsl #6		; CHECK-BE-NEXT: add x9, x0, x8, lsl #6
; CHECK-BE-NEXT: add x10, x9, #48		; CHECK-BE-NEXT: add x10, x9, #16
; CHECK-BE-NEXT: ld1 { v1.2d }, [x9]		; CHECK-BE-NEXT: add x11, x9, #32
; CHECK-BE-NEXT: ld1 { v0.2d }, [x10]		; CHECK-BE-NEXT: ld1 { v1.16b }, [x9]
; CHECK-BE-NEXT: add x10, x9, #32		; CHECK-BE-NEXT: add x9, x9, #48
; CHECK-BE-NEXT: add x9, x9, #16		; CHECK-BE-NEXT: ld1 { v2.16b }, [x10]
; CHECK-BE-NEXT: ld1 { v2.2d }, [x10]		; CHECK-BE-NEXT: ld1 { v3.16b }, [x11]
; CHECK-BE-NEXT: ld1 { v3.2d }, [x9]		; CHECK-BE-NEXT: ld1 { v4.16b }, [x9]
; CHECK-BE-NEXT: add x9, x1, x8, lsl #3		; CHECK-BE-NEXT: add x9, x1, x8, lsl #3
; CHECK-BE-NEXT: add x8, x8, #1		; CHECK-BE-NEXT: add x8, x8, #1
; CHECK-BE-NEXT: cmp x8, #1000		; CHECK-BE-NEXT: cmp x8, #1000
; CHECK-BE-NEXT: uzp1 v0.4s, v2.4s, v0.4s		; CHECK-BE-NEXT: tbl v1.16b, { v1.16b, v2.16b, v3.16b, v4.16b }, v0.16b
; CHECK-BE-NEXT: uzp1 v1.4s, v1.4s, v3.4s		; CHECK-BE-NEXT: st1 { v1.8b }, [x9]
; CHECK-BE-NEXT: uzp1 v0.8h, v1.8h, v0.8h
; CHECK-BE-NEXT: xtn v0.8b, v0.8h
; CHECK-BE-NEXT: st1 { v0.8b }, [x9]
; CHECK-BE-NEXT: b.eq .LBB4_1		; CHECK-BE-NEXT: b.eq .LBB4_1
; CHECK-BE-NEXT: // %bb.2: // %exit		; CHECK-BE-NEXT: // %bb.2: // %exit
; CHECK-BE-NEXT: ret		; CHECK-BE-NEXT: ret

entry:		entry:
br label %loop		br label %loop

loop:		loop:
%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]		%iv = phi i64 [ 0, %entry ], [ %iv.next, %loop ]
%gep.A = getelementptr inbounds <8 x i64>, ptr %A, i64 %iv		%gep.A = getelementptr inbounds <8 x i64>, ptr %A, i64 %iv
%l.A = load <8 x i64>, ptr %gep.A		%l.A = load <8 x i64>, ptr %gep.A
▲ Show 20 Lines • Show All 188 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Extending lowering of 'trunc <(8|16) x i64> %x to <(8|16) x i8>' to use tbl instructionsClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 465244

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

llvm/test/CodeGen/AArch64/trunc-to-tbl.ll

[AArch64] Extending lowering of 'trunc <(8|16) x i64> %x to <(8|16) x i8>' to use tbl instructions
ClosedPublic