This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/PowerPC/
-
Target/
-
PowerPC/
2/4
PPCISelLowering.cpp
-
test/CodeGen/PowerPC/
-
CodeGen/
-
PowerPC/
-
perfect-shuffle.ll

Differential D116801

[PowerPC] Avoid perfect shuffle when mask has multiple uses
AbandonedPublic

Authored by qiucf on Jan 7 2022, 2:13 AM.

Download Raw Diff

Details

Reviewers

nemanjai
shchenz
jsji

Group Reviewers

Restricted Project

Summary

The perfect shuffle (only enabled in big endian yet) may transform a shuffle vector into multiple merge/inserts, but when the shuffle mask is shared between multiple shuffles, it's better to use a single load with multiple vperm.

An obvious blocker is the mask is not operand of vector_shuffle in DAG, so I have to record all masks and check number of uses of each mask.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

qiucf created this revision.Jan 7 2022, 2:13 AM

Herald added subscribers: kbarton, hiraditya. · View Herald TranscriptJan 7 2022, 2:13 AM

qiucf requested review of this revision.Jan 7 2022, 2:13 AM

Herald added a project: Restricted Project. · View Herald TranscriptJan 7 2022, 2:13 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

Harbormaster completed remote builds in B142042: Diff 398085.Jan 7 2022, 2:43 AM

shchenz added inline comments.Jan 9 2022, 6:44 PM

llvm/lib/Target/PowerPC/PPCISelLowering.cpp
10050	Please fix all the Lint warnings.
10063	Can we first check `isLittleEndian` and then `isFourElementShuffle` and then `MaskMap[PermMask].second` to improve the compile time?

qiucf marked an inline comment as done.Jan 9 2022, 9:21 PM

qiucf added inline comments.

llvm/lib/Target/PowerPC/PPCISelLowering.cpp
10063	`isFourElementShuffle` is also computed, when we think it's unprofitable to do perfect shuffle, it's meaningless to compute it. Do you mean we check endian first, and only calculate `MaskMap` on big endian? That may improve compile time, but only before little-endian perfect shuffles are implemented.

Fix clang-format warnings

Harbormaster completed remote builds in B142353: Diff 398502.Jan 9 2022, 10:04 PM

shchenz added inline comments.Jan 10 2022, 12:31 AM

llvm/lib/Target/PowerPC/PPCISelLowering.cpp
10063	yes, right, I mean we first check the expressions which have lower complexity.

This method isn't sound, because there're many other optimizations (before perfect shuffle and vperm), these lowered shuffles shouldn't be counted.

qiucf updated this revision to Diff 401508.Jan 19 2022, 9:51 PM

Harbormaster completed remote builds in B144485: Diff 401508.Jan 19 2022, 10:27 PM

qiucf updated this revision to Diff 401527.Jan 20 2022, 12:10 AM

Harbormaster completed remote builds in B144501: Diff 401527.Jan 20 2022, 1:02 AM

This adds a whole lot of computation on the DAG in addition to having thread safety issues and the gain is very small. I am not in favour of something similar to this.

qiucf mentioned this in D120072: [PowerPC] Add option to disable perfect shuffle.Feb 17 2022, 10:01 AM

Abandon this in favor of D121082.

Herald added a project: Restricted Project. · View Herald TranscriptMar 6 2022, 7:15 PM

jsji mentioned this in D121082: [PowerPC] Disable perfect shuffle by default.Mar 13 2022, 8:00 PM

Revision Contents

Path

Size

llvm/

lib/

Target/

PowerPC/

PPCISelLowering.cpp

120 lines

test/

CodeGen/

PowerPC/

perfect-shuffle.ll

11 lines

Diff 398085

llvm/lib/Target/PowerPC/PPCISelLowering.cpp

	Show First 20 Lines • Show All 91 Lines • ▼ Show 20 Lines
	PPC::isVMRGEOShuffleMask(SVOp, true, ShuffleKind, DAG) \|\|			PPC::isVMRGEOShuffleMask(SVOp, true, ShuffleKind, DAG) \|\|
	PPC::isVMRGEOShuffleMask(SVOp, false, ShuffleKind, DAG))))			PPC::isVMRGEOShuffleMask(SVOp, false, ShuffleKind, DAG))))
	return Op;			return Op;

	// Check to see if this is a shuffle of 4-byte values. If so, we can use our			// Check to see if this is a shuffle of 4-byte values. If so, we can use our
	// perfect shuffle table to emit an optimal matching sequence.			// perfect shuffle table to emit an optimal matching sequence.
	ArrayRef<int> PermMask = SVOp->getMask();			ArrayRef<int> PermMask = SVOp->getMask();

	unsigned PFIndexes[4];			// Record all shuffle masks in the DAG. The first integer in value part is
	bool isFourElementShuffle = true;			// occurance of the mask, and the second means whether it occured multiple
	for (unsigned i = 0; i != 4 && isFourElementShuffle; ++i) { // Element number			// times, because the integer will decrement and all entries should be removed
	unsigned EltNo = 8; // Start out undef.			// after visiting current DAG.
	for (unsigned j = 0; j != 4; ++j) { // Intra-element byte.			static DenseMap<ArrayRef<int>, std::pair<int, bool>> MaskMap;
	if (PermMask[i*4+j] < 0)
	continue; // Undef, ignore it.			if (MaskMap.find(PermMask) == MaskMap.end()) {
				for (const SDNode &Node: DAG.allnodes()) {
				Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - for (const SDNode &Node: DAG.allnodes()) { + for (const SDNode &Node : DAG.allnodes()) { Lint: Pre-merge checks: clang-format: please reformat the code ``` - for (const SDNode &Node: DAG.allnodes()) { +…
				shchenzUnsubmitted Done Reply Inline Actions Please fix all the Lint warnings. shchenz: Please fix all the Lint warnings.
	unsigned ByteSource = PermMask[i*4+j];			if (const auto *Shuffle = dyn_cast<ShuffleVectorSDNode>(&Node)) {
	if ((ByteSource & 3) != j) {			ArrayRef<int> TheMask = Shuffle->getMask();
	isFourElementShuffle = false;			if (MaskMap.find(TheMask) == MaskMap.end())
	break;			MaskMap.insert(std::make_pair(TheMask, std::make_pair(0, false)));
				if (++MaskMap[TheMask].first > 1)
				MaskMap[TheMask].second = true;
	}			}
				}
				}

	if (EltNo == 8) {			// When there're multiple shuffles sharing the same mask, we think it's
	EltNo = ByteSource/4;			// cheaper to use load+vperm.
	} else if (EltNo != ByteSource/4) {			if (!MaskMap[PermMask].second) {
				shchenzUnsubmitted Not Done Reply Inline Actions Can we first check `isLittleEndian` and then `isFourElementShuffle` and then `MaskMap[PermMask].second` to improve the compile time? shchenz: Can we first check `isLittleEndian` and then `isFourElementShuffle` and then `MaskMap[PermMask].
				qiucfAuthorUnsubmitted Done Reply Inline Actions `isFourElementShuffle` is also computed, when we think it's unprofitable to do perfect shuffle, it's meaningless to compute it. Do you mean we check endian first, and only calculate `MaskMap` on big endian? That may improve compile time, but only before little-endian perfect shuffles are implemented. qiucf: `isFourElementShuffle` is also computed, when we think it's unprofitable to do perfect shuffle…
				shchenzUnsubmitted Not Done Reply Inline Actions yes, right, I mean we first check the expressions which have lower complexity. shchenz: yes, right, I mean we first check the expressions which have lower complexity.
	isFourElementShuffle = false;			unsigned PFIndexes[4];
	break;			bool isFourElementShuffle = true;
				for (unsigned i = 0; i != 4 && isFourElementShuffle; ++i) { // Element number
				Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - for (unsigned i = 0; i != 4 && isFourElementShuffle; ++i) { // Element number - unsigned EltNo = 8; // Start out undef. - for (unsigned j = 0; j != 4; ++j) { // Intra-element byte. - if (PermMask[i4+j] < 0) - continue; // Undef, ignore it. - - unsigned ByteSource = PermMask[i4+j]; + for (unsigned i = 0; i != 4 && isFourElementShuffle; + ++i) { // Element number + unsigned EltNo = 8; // Start out undef. 5 diff lines are omitted. See full path. Lint: Pre-merge checks: clang-format: please reformat the code ``` - for (unsigned i = 0; i != 4 &&…
				unsigned EltNo = 8; // Start out undef.
				for (unsigned j = 0; j != 4; ++j) { // Intra-element byte.
				if (PermMask[i*4+j] < 0)
				continue; // Undef, ignore it.

				unsigned ByteSource = PermMask[i*4+j];
				if ((ByteSource & 3) != j) {
				isFourElementShuffle = false;
				break;
				}

				if (EltNo == 8) {
				EltNo = ByteSource/4;
				Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - EltNo = ByteSource/4; - } else if (EltNo != ByteSource/4) { + EltNo = ByteSource / 4; + } else if (EltNo != ByteSource / 4) { Lint: Pre-merge checks: clang-format: please reformat the code ``` - EltNo = ByteSource/4; - } else if…
				} else if (EltNo != ByteSource/4) {
				isFourElementShuffle = false;
				break;
				}
	}			}
				PFIndexes[i] = EltNo;
				}

				// Remove entry of this mask from map since it's unique in current DAG.
				MaskMap.erase(PermMask);

				// If this shuffle can be expressed as a shuffle of 4-byte elements, use the
				// perfect shuffle vector to determine if it is cost effective to do this as
				// discrete instructions, or whether we should use a vperm.
				// For now, we skip this for little endian until such time as we have a
				// little-endian perfect shuffle table.
				if (isFourElementShuffle && !isLittleEndian) {
				// Compute the index in the perfect shuffle table.
				unsigned PFTableIndex =
				Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - unsigned PFTableIndex = - PFIndexes[0]999+PFIndexes[1]99+PFIndexes[2]9+PFIndexes[3]; + unsigned PFTableIndex = PFIndexes[0] * 9 * 9 * 9 + PFIndexes[1] * 9 * 9 + + PFIndexes[2] * 9 + PFIndexes[3]; Lint: Pre-merge checks: clang-format: please reformat the code ``` - unsigned PFTableIndex = - PFIndexes…
				PFIndexes[0]999+PFIndexes[1]99+PFIndexes[2]9+PFIndexes[3];

				unsigned PFEntry = PerfectShuffleTable[PFTableIndex];
				unsigned Cost = (PFEntry >> 30);
				Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - unsigned Cost = (PFEntry >> 30); + unsigned Cost = (PFEntry >> 30); Lint: Pre-merge checks: clang-format: please reformat the code ``` - unsigned Cost = (PFEntry >> 30); +…

				// Determining when to avoid vperm is tricky. Many things affect the cost
				// of vperm, particularly how many times the perm mask needs to be computed.
				Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - // of vperm, particularly how many times the perm mask needs to be computed. - // For example, if the perm mask can be hoisted out of a loop or is already - // used (perhaps because there are multiple permutes with the same shuffle - // mask?) the vperm has a cost of 1. OTOH, hoisting the permute mask out of - // the loop requires an extra register. + // of vperm, particularly how many times the perm mask needs to be + // computed. For example, if the perm mask can be hoisted out of a loop or + // is already used (perhaps because there are multiple permutes with the + // same shuffle mask?) the vperm has a cost of 1. OTOH, hoisting the + // permute mask out of the loop requires an extra register. Lint: Pre-merge checks: clang-format: please reformat the code ``` - // of vperm, particularly how many times the…
				// For example, if the perm mask can be hoisted out of a loop or is already
				// used (perhaps because there are multiple permutes with the same shuffle
				// mask?) the vperm has a cost of 1. OTOH, hoisting the permute mask out of
				// the loop requires an extra register.
				//
				// As a compromise, we only emit discrete instructions if the shuffle can be
				Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - // As a compromise, we only emit discrete instructions if the shuffle can be - // generated in 3 or fewer operations. When we have loop information + // As a compromise, we only emit discrete instructions if the shuffle can + // be generated in 3 or fewer operations. When we have loop information Lint: Pre-merge checks: clang-format: please reformat the code ``` - // As a compromise, we only emit discrete…
				// generated in 3 or fewer operations. When we have loop information
				// available, if this block is within a loop, we should avoid using vperm
				// for 3-operation perms and use a constant pool load instead.
				if (Cost < 3)
				return GeneratePerfectShuffle(PFEntry, V1, V2, DAG, dl);
	}			}
	PFIndexes[i] = EltNo;
	}

	// If this shuffle can be expressed as a shuffle of 4-byte elements, use the
	// perfect shuffle vector to determine if it is cost effective to do this as
	// discrete instructions, or whether we should use a vperm.
	// For now, we skip this for little endian until such time as we have a
	// little-endian perfect shuffle table.
	if (isFourElementShuffle && !isLittleEndian) {
	// Compute the index in the perfect shuffle table.
	unsigned PFTableIndex =
	PFIndexes[0]999+PFIndexes[1]99+PFIndexes[2]9+PFIndexes[3];

	unsigned PFEntry = PerfectShuffleTable[PFTableIndex];
	unsigned Cost = (PFEntry >> 30);

	// Determining when to avoid vperm is tricky. Many things affect the cost
	// of vperm, particularly how many times the perm mask needs to be computed.
	// For example, if the perm mask can be hoisted out of a loop or is already
	// used (perhaps because there are multiple permutes with the same shuffle
	// mask?) the vperm has a cost of 1. OTOH, hoisting the permute mask out of
	// the loop requires an extra register.
	//
	// As a compromise, we only emit discrete instructions if the shuffle can be
	// generated in 3 or fewer operations. When we have loop information
	// available, if this block is within a loop, we should avoid using vperm
	// for 3-operation perms and use a constant pool load instead.
	if (Cost < 3)
	return GeneratePerfectShuffle(PFEntry, V1, V2, DAG, dl);
	}			}

				if (--MaskMap[PermMask].first == 0)
				MaskMap.erase(PermMask);

	// Lower this to a VPERM(V1, V2, V3) expression, where V3 is a constant			// Lower this to a VPERM(V1, V2, V3) expression, where V3 is a constant
	// vector that will get spilled to the constant pool.			// vector that will get spilled to the constant pool.
	if (V2.isUndef()) V2 = V1;			if (V2.isUndef()) V2 = V1;

	// The SHUFFLE_VECTOR mask is almost exactly what we want for vperm, except			// The SHUFFLE_VECTOR mask is almost exactly what we want for vperm, except
	// that it is in input element units, not in bytes. Convert now.			// that it is in input element units, not in bytes. Convert now.

	// For little endian, the order of the input vectors is reversed, and			// For little endian, the order of the input vectors is reversed, and
	▲ Show 20 Lines • Show All 91 Lines • Show Last 20 Lines

llvm/test/CodeGen/PowerPC/perfect-shuffle.ll

Show All 32 Lines	; LE-NEXT: blr
%shuf = shufflevector <16 x i8> %v1, <16 x i8> %v2, <16 x i32> <i32 4, i32 5, i32 6, i32 7, i32 12, i32 13, i32 14, i32 15, i32 20, i32 21, i32 22, i32 23, i32 28, i32 29, i32 30, i32 31>		%shuf = shufflevector <16 x i8> %v1, <16 x i8> %v2, <16 x i32> <i32 4, i32 5, i32 6, i32 7, i32 12, i32 13, i32 14, i32 15, i32 20, i32 21, i32 22, i32 23, i32 28, i32 29, i32 30, i32 31>
%cast = bitcast <16 x i8> %shuf to <4 x float>		%cast = bitcast <16 x i8> %shuf to <4 x float>
ret <4 x float> %cast		ret <4 x float> %cast
}		}

define <4 x float> @shuffle3(<16 x i8> %v1, <16 x i8> %v2, <16 x i8> %v3, <16 x i8> %v4) {		define <4 x float> @shuffle3(<16 x i8> %v1, <16 x i8> %v2, <16 x i8> %v3, <16 x i8> %v4) {
; BE-LABEL: shuffle3:		; BE-LABEL: shuffle3:
; BE: # %bb.0:		; BE: # %bb.0:
; BE-NEXT: vmrglw 0, 2, 3		; BE-NEXT: addis 3, 2, .LCPI2_0@toc@ha
; BE-NEXT: vmrghw 2, 2, 3		; BE-NEXT: addi 3, 3, .LCPI2_0@toc@l
; BE-NEXT: vmrglw 3, 4, 5		; BE-NEXT: lxv 32, 0(3)
; BE-NEXT: vmrghw 4, 4, 5		; BE-NEXT: vperm 2, 2, 3, 0
; BE-NEXT: vmrghw 2, 2, 0		; BE-NEXT: vperm 3, 4, 5, 0
; BE-NEXT: vmrghw 3, 4, 3
; BE-NEXT: xvaddsp 34, 34, 35		; BE-NEXT: xvaddsp 34, 34, 35
; BE-NEXT: blr		; BE-NEXT: blr
;		;
; LE-LABEL: shuffle3:		; LE-LABEL: shuffle3:
; LE: # %bb.0:		; LE: # %bb.0:
; LE-NEXT: vpkudum 2, 3, 2		; LE-NEXT: vpkudum 2, 3, 2
; LE-NEXT: vpkudum 3, 5, 4		; LE-NEXT: vpkudum 3, 5, 4
; LE-NEXT: xvaddsp 34, 34, 35		; LE-NEXT: xvaddsp 34, 34, 35
▲ Show 20 Lines • Show All 85 Lines • Show Last 20 Lines