This is an archive of the discontinued LLVM Phabricator instance.

Differential D12032

Vector element extraction without stack operations on Power 8
ClosedPublic

Authored by nemanjai on Aug 14 2015, 6:03 AM.

Download Raw Diff

Details

Reviewers

wschmidt
kbarton
seurer
hfinkel

Summary

This patch builds onto the patch that provided scalar to vector conversions without stack operations (D11471).
Included in this patch:

Vector element extraction for all vector types with constant element number (both LE and BE)
Vector element extraction for v16i8 and v8i16 with variable element number (both LE and BE)
Removal of some unnecessary COPY_TO_REGCLASS operations that ended up unnecessarily moving things around between registers

Not included in this patch (will be in upcoming patch):

Vector element extraction for v4i32, v4f32, v2i64 and v2f64 with variable element number
Vector element insertion for variable/constant element number

Testing is provided for all extractions. The extractions that are not implemented yet are just placeholders.

Diff Detail

Repository: rL LLVM

Event Timeline

nemanjai updated this revision to Diff 32150.Aug 14 2015, 6:03 AM

nemanjai retitled this revision from to Vector element extraction without stack operations on Power 8.

nemanjai updated this object.

nemanjai added reviewers: wschmidt, hfinkel, kbarton, seurer.

nemanjai set the repository for this revision to rL LLVM.

nemanjai added a subscriber: llvm-commits.

nemanjai added inline comments.Aug 14 2015, 6:26 AM

test/CodeGen/PowerPC/p8-scalar_vector_conversions.ll
663	I actually think that the regular expressions for VRA and VRB here are not needed as I think the source vector must be in VR 2. Similarly for all other instances of the vperm.

wschmidt added inline comments.Aug 18 2015, 12:05 PM

lib/Target/PowerPC/PPCInstrVSX.td
1309	This code concerns me, starting with LE_BYTE_2. I see from your test cases that it happens to work, but it looks fragile to me. If you have bytes 7 6 5 4 3 2 1 0 and apply RLDICL 48, 16, you will get X X X X X X 3 2 where the Xs have been cleared to zero. The correct answer is X X X X X X X 2. For RLDICL 40, 24, you will get X X X X X 5 4 3, which is incorrect for both byte and halfword extraction. The pattern you need is (RLDICL LE_DWORD_0 48, 8), followed by 40,8, etc. Then you need separate patterns for LE_HWORD that uses 16 instead of 8. Somehow in your tests you are ending up with extsb instructions that make this work, which I really don't understand based on the code you're specifying here (which just creates an i32, not an i8). I am concerned that those are an artifact that could disappear. Can you explain how those come to be generated?
1334	This looks wrong. If $Idx = 19, (ANDC8 (LI8 8), $Idx) will AND 0...001000 with 1...101100, giving 0...001000 = 8. Basically this expression will always produce either 0 or 8, right? That is surely not what you want. This same issue seems repeated several places below, so I've stopped trying to figure these expressions out for now. Have you done any execution testing to check that the code you're generating works? I'm skeptical that it can.

wschmidt added inline comments.Aug 18 2015, 1:28 PM

lib/Target/PowerPC/PPCInstrVSX.td
1309	Ah, I see why. Your tests are returning an i8, which forces the conversion from i32 to i8 that generates the extsb. Without that, the incorrect code generation would be exposed.

nemanjai added inline comments.Aug 18 2015, 4:11 PM

lib/Target/PowerPC/PPCInstrVSX.td
1309	I think this isn't clearly specified in the language reference, but as far as I can tell, the remainder of the wider register when a value is extracted from a vector is undefined and hence need not be cleared. Namely, when an i8 or i16 is extracted from a vector, it must be extended (sign or zero) since we have no such small registers. Also, I don't think this is necessarily restricted to vector extraction. However we happen to get an i8 or i16 into a register, the SDAG will have a sign_extend or a zero_extend. Even code like this: signed char f(signed char c) { return c; } Will end up with the following SDAG nodes: 0x1003ec39830: i8,ch = load 0x1003ec39708, 0x1003ec39390, 0x1003ec395e0<LD1[%c.addr]> [ORD=4] 0x1003ec3a8c0: i32 = sign_extend 0x1003ec39830 [ORD=5] Similarly, with code like this: signed char f(vector signed char c) { return c[3]; } We end up with a very similar pair of SDAG nodes: 0x10009e69b58: i8 = extract_vector_elt 0x10009e697e0, 0x10009e69a30 [ORD=5] 0x10009e69c80: i32 = sign_extend 0x10009e69b58 [ORD=6] And the sign_extend then becomes an extsb. Of course, this does not apply with fast-isel, but then again, neither does any of this patch since it is all based on the SDAG. I used signed char here just as an illustration since you mentioned the extsb instruction, but the equivalent zero_extend applies with the unsigned variants (in both cases). So to summarize, I can add the clear instruction to each case, but I'm quite certain that it will be a reduntant instruction. It will also make the table gen logic more complex because I'd need a clear for the unsigned cases and a sign-extend for the signed cases.
1334	I may be missing something here, but always getting a 0 or 8 is precisely what I was after. Basically, if the element is on the left half of the VSR, I don't need to shift it. If it is on the right side of the VSR, I do need to shift it and I need to shift it by EXACTLY 8 bytes. This applies on both LE and BE except that the element numbers that are on the left vs. right half of the VSR are exact opposites. For LE, the bytes 0-7 need to be shifted before the MFVSR (so the expression you mentioned returns 8). Subsequent to this shift, what is then on the left side of the VSR gets moved out to a GPR. Now that I have the right half of the VSR in the GPR, I need to shift to the right by the correct number of bytes (0-7). So element zero, needs not shift at this point, whereas elements 1-7 need to be shifted by the equal number of bytes. Similarly, element 8 needs not shift, whereas elements 9-15 need to be shifted by $Idx & 7 bytes. Of course, the shift amount I need to specify to SRD is in bits, hence the reason for the left shift of 3. Here's the complete shift sequence for LE (bytes): Elem LShift in VSR RShift in GPR (bytes) 0 8 0 1 8 1 2 8 2 3 8 3 4 8 4 5 8 5 6 8 6 7 8 7 8 0 0 9 0 1 10 0 2 11 0 3 12 0 4 13 0 5 14 0 6 15 0 7 I have worked out similar sequences for the halfwords and for both on BE. Also, I have an execution test case that prints each element of each size vector both as constant element values and as variables. In my testing, the test case was compiled with clang and with gcc and the results compared with diff. I have run this on P8 systems, both LE and BE.

nemanjai added inline comments.Aug 18 2015, 4:32 PM

lib/Target/PowerPC/PPCInstrVSX.td
1309	This code: int f(vector signed char c, int i) { return c[3]; } Has the same extsb, which is then followed in this case by an extsw which is redundant, but understandable given the integral promotion.

wschmidt added inline comments.Aug 18 2015, 6:08 PM

lib/Target/PowerPC/PPCInstrVSX.td
1309	You are relying on context for your patterns to argue that they will always be correct. That is ok, but it is then incumbent on you to document this extremely well. "Clever tricks" should never be unaccompanied by diligent commentary. The fact is that if somebody were to use your "LE_BYTE_x" patterns in the (perfectly reasonable) belief that this would actually produce a right-adjusted zero- or sign-extended byte, they would be grossly disappointed. Outside of their contextual uses in the vector_extract patterns, these would not produce what their name argues they would produce. They would produce an i32 that is pretending to be an i8 but contains values not representable by an i8. So go ahead and do this, but document the bejeebus out of it. Then if something goes wrong with the contextual assumptions in the future, the poor maintainer will have half a chance of sorting it out. When creating a building block, you have to be aware that others may want to use it in a way you didn't foresee. I understand why you want to do this, but a more proper solution is to do it right and make sure that we have peephole optimizations later on that will remove redundant extend operations. We need those anyway. Therefore, go ahead and do what you're doing, but in addition to documenting it, add commentary indicating what ought to be done in the future. Hal, please feel to weigh in...
1334	OK. Again, this needs to be better documented. Your commentary for the LShift portion is: "What we do is set up the index by masking off bits we don't need and shifting accordingly." You're not actually masking off any bits for this case, which is what confused me. Based on that commentary, this didn't appear to fulfill your intent. I note that your subsequent explanation to me doesn't have any mention of masking off bits either. So please document more clearly what is going on with these patterns. Too much commentary is never a crime...an expression like dag BE_VBYTE_SHIFT = (EXTRACT_SUBREG (RLDICR (ANDC8 (LI8 7), $Idx), 3, 60), sub_32); without its own line of commentary is kind of a travesty. If you don't mind, I would like to wait for a more heavily documented version before fully reviewing the variable extracts.

wschmidt added inline comments.Aug 18 2015, 6:11 PM

lib/Target/PowerPC/PPCInstrVSX.td
1334	(I guess that converting to 0 or 8 is "masking off bits," but that really isn't very descriptive of what you're doing. You're just isolating the containing doubleword.)

wschmidt added inline comments.Aug 18 2015, 6:54 PM

lib/Target/PowerPC/PPCInstrVSX.td
1309	Actually, I'm going to stick to my guns on this one. I'm not asking you to add another extend instruction. I'm asking you to use the correct form of RLDICL for the LE_BYTE_x and LE_HWORD_x cases. If you change the last parameter to RLDICL to always be 8, you will isolate a byte. Do this for the LE_BYTE_x forms. If you change it to always be 16, you will isolate a halfword. Do this for the LE_HWORD_X forms. This does not cost any extra instructions, and the patterns will generate i32s that contain true i8s and i16s, respectively. Furthermore, this has an added benefit: We can later optimize away the sign- or zero-extend instruction that gets generated by the context. A peephole optimization can easily prove that a rldicl that has already isolated a byte does not need a subsequent extsb or second truncating rldicl. The same is true for isolation of a halfword. On the other hand, if you leave things as you have them, we can never get rid of the extsb, because the optimizer doesn't know that you left nonzero bits in there because you "knew" there would be an extend coming along to fix things up.

hfinkel added inline comments.Aug 19 2015, 1:11 AM

lib/Target/PowerPC/PPCInstrVSX.td
1309	First, Nemanja appears to be right, it is legal to leave the higher-order bits arbitrarily defined. The code in DAGTypeLegalizer::PromoteIntOp_EXTRACT_VECTOR_ELT ends with: // EXTRACT_VECTOR_ELT can return types which are wider than the incoming // element types. If this is the case then we need to expand the outgoing // value and not truncate it. return DAG.getAnyExtOrTrunc(Ext, dl, N->getValueType(0)); Second, Bill is right. We should zero-out the higher-order bits if we can do so without any extra instructions. We already have a peephole optimization in PPCISelDAGToDAG.cpp to eliminate unnecessary i32->i64 extensions (see the PeepholePPC64ZExt() function), and it should be easy to extend it to catch this case as well. Also, it makes "manual" debugging easier to have the values zero extended (and making debugging easier at no additional cost is certainly a good thing). And, for any pattern where you do leave some of the higher-order bits arbitrarily defined, please document that explicitly.

nemanjai added inline comments.Sep 21 2015, 2:12 AM

lib/Target/PowerPC/PPCInstrVSX.td
1309	OK, it appears that the source of my misunderstanding was that I thought there was no way to both shift right and clear high order bits with a single RLDICL instruction. Of course, it is clearly preferable for a moved i8, i16 or i32 to have the high order bits cleared rather than arbitrarily set. I will change the last parameter to the RLDICL instructions to the moved witdth, test it and update the patch. Thank you both for your feedback.

One major drawback identified by wschmidt with the previous patch was that the element being moved was just right justified in the GPR without clearing the high order bits. Basically, the only change is to use the version of the RLDICL instruction that will not only right justify the value, but will clear the other bits too.

Sorry, don't review this yet. Another update is coming very soon. I forgot to improve the documentation for the variable element number extract patterns. Stay tuned and sorry about this omission - I was too focused on the RLDICL patterns.

Documented the really complex patterns that extract a variable element number from a vector.

Otherwise, this LGTM. Thanks for addressing all my concerns!

lib/Target/PowerPC/PPCISelLowering.cpp
558	You don't want to reintroduce this restriction, right?
test/CodeGen/PowerPC/p8-scalar_vector_conversions.ll
64	Same here -- don't disable these checks.
96	Do we have an issue open for the unnecessary extsb? We should peephole this away at some point. Please open one if there is not.
735	Similarly, the extsh should be peepholed away eventually.

Bill,
I've opened a work item to track the redundant rldicl instructions for unsigned versions of these. We can optimize those away at some point.
Since Hal's OK with this patch and if you're satisfied with the changes and the follow-up work item, please accept this patch and I'll commit it.

lib/Target/PowerPC/PPCISelLowering.cpp
558	That's just the existing code since the bootstrap failure fix is not committed yet.
test/CodeGen/PowerPC/p8-scalar_vector_conversions.ll
64	Same, that will be part of the commit for the bootstrap failure fix.
735	We need the sign extends as we've discussed on IRC.

Go to it!

This revision is now accepted and ready to land.Sep 29 2015, 11:06 AM

Committed revision 249822.

Revision Contents

Path

Size

lib/

Target/

PowerPC/

PPCISelLowering.cpp

9 lines

PPCInstrVSX.td

382 lines

PPCVSXCopy.cpp

1 line

test/

CodeGen/

PowerPC/

p8-scalar_vector_conversions.ll

1380 lines

Diff 35274

lib/Target/PowerPC/PPCISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 538 Lines • ▼ Show 20 Lines	if (Subtarget.hasAltivec()) {
// Altivec does not contain unordered floating-point compare instructions		// Altivec does not contain unordered floating-point compare instructions
setCondCodeAction(ISD::SETUO, MVT::v4f32, Expand);		setCondCodeAction(ISD::SETUO, MVT::v4f32, Expand);
setCondCodeAction(ISD::SETUEQ, MVT::v4f32, Expand);		setCondCodeAction(ISD::SETUEQ, MVT::v4f32, Expand);
setCondCodeAction(ISD::SETO, MVT::v4f32, Expand);		setCondCodeAction(ISD::SETO, MVT::v4f32, Expand);
setCondCodeAction(ISD::SETONE, MVT::v4f32, Expand);		setCondCodeAction(ISD::SETONE, MVT::v4f32, Expand);

if (Subtarget.hasVSX()) {		if (Subtarget.hasVSX()) {
setOperationAction(ISD::SCALAR_TO_VECTOR, MVT::v2f64, Legal);		setOperationAction(ISD::SCALAR_TO_VECTOR, MVT::v2f64, Legal);
if (Subtarget.hasP8Vector())		setOperationAction(ISD::EXTRACT_VECTOR_ELT, MVT::v2f64, Legal);
		if (Subtarget.hasP8Vector()) {
setOperationAction(ISD::SCALAR_TO_VECTOR, MVT::v4f32, Legal);		setOperationAction(ISD::SCALAR_TO_VECTOR, MVT::v4f32, Legal);
		setOperationAction(ISD::EXTRACT_VECTOR_ELT, MVT::v4f32, Legal);
		}
if (Subtarget.hasDirectMove()) {		if (Subtarget.hasDirectMove()) {
setOperationAction(ISD::SCALAR_TO_VECTOR, MVT::v16i8, Legal);		setOperationAction(ISD::SCALAR_TO_VECTOR, MVT::v16i8, Legal);
setOperationAction(ISD::SCALAR_TO_VECTOR, MVT::v8i16, Legal);		setOperationAction(ISD::SCALAR_TO_VECTOR, MVT::v8i16, Legal);
setOperationAction(ISD::SCALAR_TO_VECTOR, MVT::v4i32, Legal);		setOperationAction(ISD::SCALAR_TO_VECTOR, MVT::v4i32, Legal);
// FIXME: this is causing bootstrap failures, disable temporarily		// FIXME: this is causing bootstrap failures, disable temporarily
//setOperationAction(ISD::SCALAR_TO_VECTOR, MVT::v2i64, Legal);		//setOperationAction(ISD::SCALAR_TO_VECTOR, MVT::v2i64, Legal);
		setOperationAction(ISD::EXTRACT_VECTOR_ELT, MVT::v16i8, Legal);
		wschmidtUnsubmitted Not Done Reply Inline Actions You don't want to reintroduce this restriction, right? wschmidt: You don't want to reintroduce this restriction, right?
		nemanjaiAuthorUnsubmitted Not Done Reply Inline Actions That's just the existing code since the bootstrap failure fix is not committed yet. nemanjai: That's just the existing code since the bootstrap failure fix is not committed yet.
		setOperationAction(ISD::EXTRACT_VECTOR_ELT, MVT::v8i16, Legal);
		setOperationAction(ISD::EXTRACT_VECTOR_ELT, MVT::v4i32, Legal);
		setOperationAction(ISD::EXTRACT_VECTOR_ELT, MVT::v2i64, Legal);
}		}
setOperationAction(ISD::EXTRACT_VECTOR_ELT, MVT::v2f64, Legal);		setOperationAction(ISD::EXTRACT_VECTOR_ELT, MVT::v2f64, Legal);

setOperationAction(ISD::FFLOOR, MVT::v2f64, Legal);		setOperationAction(ISD::FFLOOR, MVT::v2f64, Legal);
setOperationAction(ISD::FCEIL, MVT::v2f64, Legal);		setOperationAction(ISD::FCEIL, MVT::v2f64, Legal);
setOperationAction(ISD::FTRUNC, MVT::v2f64, Legal);		setOperationAction(ISD::FTRUNC, MVT::v2f64, Legal);
setOperationAction(ISD::FNEARBYINT, MVT::v2f64, Legal);		setOperationAction(ISD::FNEARBYINT, MVT::v2f64, Legal);
setOperationAction(ISD::FROUND, MVT::v2f64, Legal);		setOperationAction(ISD::FROUND, MVT::v2f64, Legal);
▲ Show 20 Lines • Show All 10,990 Lines • Show Last 20 Lines

lib/Target/PowerPC/PPCInstrVSX.td

Show First 20 Lines • Show All 1,241 Lines • ▼ Show 20 Lines	let Predicates = [HasDirectMove, HasVSX] in {
def MTVSRWA : XX1_RS6_RD5_XO<31, 211, (outs vsfrc:$XT), (ins gprc:$rA),		def MTVSRWA : XX1_RS6_RD5_XO<31, 211, (outs vsfrc:$XT), (ins gprc:$rA),
"mtvsrwa $XT, $rA", IIC_VecGeneral,		"mtvsrwa $XT, $rA", IIC_VecGeneral,
[(set f64:$XT, (PPCmtvsra i32:$rA))]>;		[(set f64:$XT, (PPCmtvsra i32:$rA))]>;
def MTVSRWZ : XX1_RS6_RD5_XO<31, 243, (outs vsfrc:$XT), (ins gprc:$rA),		def MTVSRWZ : XX1_RS6_RD5_XO<31, 243, (outs vsfrc:$XT), (ins gprc:$rA),
"mtvsrwz $XT, $rA", IIC_VecGeneral,		"mtvsrwz $XT, $rA", IIC_VecGeneral,
[(set f64:$XT, (PPCmtvsrz i32:$rA))]>;		[(set f64:$XT, (PPCmtvsrz i32:$rA))]>;
} // HasDirectMove, HasVSX		} // HasDirectMove, HasVSX

/* Direct moves of various size entities from GPR's into VSR's. Each lines		/* Direct moves of various widths from GPR's into VSR's. Each move lines
the value up into element 0 (both BE and LE). Namely, entities smaller than		the value up into element 0 (both BE and LE). Namely, entities smaller than
a doubleword are shifted left and moved for BE. For LE, they're moved, then		a doubleword are shifted left and moved for BE. For LE, they're moved, then
swapped to go into the least significant element of the VSR.		swapped to go into the least significant element of the VSR.
*/		*/
def Moves {		def MovesToVSR {
dag BE_BYTE_0 = (MTVSRD		dag BE_BYTE_0 =
		(MTVSRD
(RLDICR		(RLDICR
(INSERT_SUBREG (i64 (IMPLICIT_DEF)), $A, sub_32), 56, 7));		(INSERT_SUBREG (i64 (IMPLICIT_DEF)), $A, sub_32), 56, 7));
dag BE_HALF_0 = (MTVSRD		dag BE_HALF_0 =
		(MTVSRD
(RLDICR		(RLDICR
(INSERT_SUBREG (i64 (IMPLICIT_DEF)), $A, sub_32), 48, 15));		(INSERT_SUBREG (i64 (IMPLICIT_DEF)), $A, sub_32), 48, 15));
dag BE_WORD_0 = (MTVSRD		dag BE_WORD_0 =
		(MTVSRD
(RLDICR		(RLDICR
(INSERT_SUBREG (i64 (IMPLICIT_DEF)), $A, sub_32), 32, 31));		(INSERT_SUBREG (i64 (IMPLICIT_DEF)), $A, sub_32), 32, 31));
dag BE_DWORD_0 = (MTVSRD $A);		dag BE_DWORD_0 = (MTVSRD $A);

dag LE_MTVSRW = (MTVSRD (INSERT_SUBREG (i64 (IMPLICIT_DEF)), $A, sub_32));		dag LE_MTVSRW = (MTVSRD (INSERT_SUBREG (i64 (IMPLICIT_DEF)), $A, sub_32));
dag LE_WORD_1 = (v2i64 (COPY_TO_REGCLASS LE_MTVSRW, VSRC));		dag LE_WORD_1 = (v2i64 (INSERT_SUBREG (v2i64 (IMPLICIT_DEF)),
		LE_MTVSRW, sub_64));
dag LE_WORD_0 = (XXPERMDI LE_WORD_1, LE_WORD_1, 2);		dag LE_WORD_0 = (XXPERMDI LE_WORD_1, LE_WORD_1, 2);
dag LE_DWORD_1 = (v2i64 (COPY_TO_REGCLASS BE_DWORD_0, VSRC));		dag LE_DWORD_1 = (v2i64 (INSERT_SUBREG (v2i64 (IMPLICIT_DEF)),
		BE_DWORD_0, sub_64));
dag LE_DWORD_0 = (XXPERMDI LE_DWORD_1, LE_DWORD_1, 2);		dag LE_DWORD_0 = (XXPERMDI LE_DWORD_1, LE_DWORD_1, 2);
}		}

		/* Direct moves of various widths from VSR's to GPR's. Each moves the
		respective element out of the VSR and ensures that it is lined up
		to the right side of the GPR. In addition to the extraction from positions
		specified by a constant, a pattern for extracting from a variable position
		is provided. This is useful when the element number is not known at
		compile time.
		The numbering for the DAG's is for LE, but when used on BE, the correct
		LE element can just be used (i.e. LE_BYTE_2 == BE_BYTE_13).
		*/
		def MovesFromVSR {
		// Doubleword extraction
		dag LE_DWORD_0 =
		(MFVSRD
		(EXTRACT_SUBREG
		(XXPERMDI (COPY_TO_REGCLASS $S, VSRC),
		(COPY_TO_REGCLASS $S, VSRC), 2), sub_64));
		dag LE_DWORD_1 = (MFVSRD
		(EXTRACT_SUBREG
		(v2i64 (COPY_TO_REGCLASS $S, VSRC)), sub_64));

		// Word extraction
		dag LE_WORD_0 = (MFVSRWZ (EXTRACT_SUBREG (XXSLDWI $S, $S, 2), sub_64));
		dag LE_WORD_1 = (MFVSRWZ (EXTRACT_SUBREG (XXSLDWI $S, $S, 1), sub_64));
		dag LE_WORD_2 = (MFVSRWZ (EXTRACT_SUBREG
		(v2i64 (COPY_TO_REGCLASS $S, VSRC)), sub_64));
		dag LE_WORD_3 = (MFVSRWZ (EXTRACT_SUBREG (XXSLDWI $S, $S, 3), sub_64));

		// Halfword extraction
		dag LE_HALF_0 = (i32 (EXTRACT_SUBREG (RLDICL LE_DWORD_0, 0, 48), sub_32));
		dag LE_HALF_1 = (i32 (EXTRACT_SUBREG (RLDICL LE_DWORD_0, 48, 48), sub_32));
		dag LE_HALF_2 = (i32 (EXTRACT_SUBREG (RLDICL LE_DWORD_0, 32, 48), sub_32));
		wschmidtUnsubmitted Not Done Reply Inline Actions This code concerns me, starting with LE_BYTE_2. I see from your test cases that it happens to work, but it looks fragile to me. If you have bytes 7 6 5 4 3 2 1 0 and apply RLDICL 48, 16, you will get X X X X X X 3 2 where the Xs have been cleared to zero. The correct answer is X X X X X X X 2. For RLDICL 40, 24, you will get X X X X X 5 4 3, which is incorrect for both byte and halfword extraction. The pattern you need is (RLDICL LE_DWORD_0 48, 8), followed by 40,8, etc. Then you need separate patterns for LE_HWORD that uses 16 instead of 8. Somehow in your tests you are ending up with extsb instructions that make this work, which I really don't understand based on the code you're specifying here (which just creates an i32, not an i8). I am concerned that those are an artifact that could disappear. Can you explain how those come to be generated? wschmidt: This code concerns me, starting with LE_BYTE_2. I see from your test cases that it happens to…
		wschmidtUnsubmitted Not Done Reply Inline Actions Ah, I see why. Your tests are returning an i8, which forces the conversion from i32 to i8 that generates the extsb. Without that, the incorrect code generation would be exposed. wschmidt: Ah, I see why. Your tests are returning an i8, which forces the conversion from i32 to i8 that…
		nemanjaiAuthorUnsubmitted Not Done Reply Inline Actions This code: int f(vector signed char c, int i) { return c[3]; } Has the same extsb, which is then followed in this case by an extsw which is redundant, but understandable given the integral promotion. nemanjai: This code: ``` int f(vector signed char c, int i) { return c[3]; } ``` Has the same extsb…
		wschmidtUnsubmitted Not Done Reply Inline Actions You are relying on context for your patterns to argue that they will always be correct. That is ok, but it is then incumbent on you to document this extremely well. "Clever tricks" should never be unaccompanied by diligent commentary. The fact is that if somebody were to use your "LE_BYTE_x" patterns in the (perfectly reasonable) belief that this would actually produce a right-adjusted zero- or sign-extended byte, they would be grossly disappointed. Outside of their contextual uses in the vector_extract patterns, these would not produce what their name argues they would produce. They would produce an i32 that is pretending to be an i8 but contains values not representable by an i8. So go ahead and do this, but document the bejeebus out of it. Then if something goes wrong with the contextual assumptions in the future, the poor maintainer will have half a chance of sorting it out. When creating a building block, you have to be aware that others may want to use it in a way you didn't foresee. I understand why you want to do this, but a more proper solution is to do it right and make sure that we have peephole optimizations later on that will remove redundant extend operations. We need those anyway. Therefore, go ahead and do what you're doing, but in addition to documenting it, add commentary indicating what ought to be done in the future. Hal, please feel to weigh in... wschmidt: You are relying on context for your patterns to argue that they will always be correct. That…
		wschmidtUnsubmitted Not Done Reply Inline Actions Actually, I'm going to stick to my guns on this one. I'm not asking you to add another extend instruction. I'm asking you to use the correct form of RLDICL for the LE_BYTE_x and LE_HWORD_x cases. If you change the last parameter to RLDICL to always be 8, you will isolate a byte. Do this for the LE_BYTE_x forms. If you change it to always be 16, you will isolate a halfword. Do this for the LE_HWORD_X forms. This does not cost any extra instructions, and the patterns will generate i32s that contain true i8s and i16s, respectively. Furthermore, this has an added benefit: We can later optimize away the sign- or zero-extend instruction that gets generated by the context. A peephole optimization can easily prove that a rldicl that has already isolated a byte does not need a subsequent extsb or second truncating rldicl. The same is true for isolation of a halfword. On the other hand, if you leave things as you have them, we can never get rid of the extsb, because the optimizer doesn't know that you left nonzero bits in there because you "knew" there would be an extend coming along to fix things up. wschmidt: Actually, I'm going to stick to my guns on this one. I'm not asking you to add another extend…
		hfinkelUnsubmitted Not Done Reply Inline Actions First, Nemanja appears to be right, it is legal to leave the higher-order bits arbitrarily defined. The code in DAGTypeLegalizer::PromoteIntOp_EXTRACT_VECTOR_ELT ends with: // EXTRACT_VECTOR_ELT can return types which are wider than the incoming // element types. If this is the case then we need to expand the outgoing // value and not truncate it. return DAG.getAnyExtOrTrunc(Ext, dl, N->getValueType(0)); Second, Bill is right. We should zero-out the higher-order bits if we can do so without any extra instructions. We already have a peephole optimization in PPCISelDAGToDAG.cpp to eliminate unnecessary i32->i64 extensions (see the PeepholePPC64ZExt() function), and it should be easy to extend it to catch this case as well. Also, it makes "manual" debugging easier to have the values zero extended (and making debugging easier at no additional cost is certainly a good thing). And, for any pattern where you do leave some of the higher-order bits arbitrarily defined, please document that explicitly. hfinkel: First, Nemanja appears to be right, it is legal to leave the higher-order bits arbitrarily…
		nemanjaiAuthorUnsubmitted Not Done Reply Inline Actions OK, it appears that the source of my misunderstanding was that I thought there was no way to both shift right and clear high order bits with a single RLDICL instruction. Of course, it is clearly preferable for a moved i8, i16 or i32 to have the high order bits cleared rather than arbitrarily set. I will change the last parameter to the RLDICL instructions to the moved witdth, test it and update the patch. Thank you both for your feedback. nemanjai: OK, it appears that the source of my misunderstanding was that I thought there was no way to…
		nemanjaiAuthorUnsubmitted Not Done Reply Inline Actions I think this isn't clearly specified in the language reference, but as far as I can tell, the remainder of the wider register when a value is extracted from a vector is undefined and hence need not be cleared. Namely, when an i8 or i16 is extracted from a vector, it must be extended (sign or zero) since we have no such small registers. Also, I don't think this is necessarily restricted to vector extraction. However we happen to get an i8 or i16 into a register, the SDAG will have a sign_extend or a zero_extend. Even code like this: signed char f(signed char c) { return c; } Will end up with the following SDAG nodes: 0x1003ec39830: i8,ch = load 0x1003ec39708, 0x1003ec39390, 0x1003ec395e0<LD1[%c.addr]> [ORD=4] 0x1003ec3a8c0: i32 = sign_extend 0x1003ec39830 [ORD=5] Similarly, with code like this: signed char f(vector signed char c) { return c[3]; } We end up with a very similar pair of SDAG nodes: 0x10009e69b58: i8 = extract_vector_elt 0x10009e697e0, 0x10009e69a30 [ORD=5] 0x10009e69c80: i32 = sign_extend 0x10009e69b58 [ORD=6] And the sign_extend then becomes an extsb. Of course, this does not apply with fast-isel, but then again, neither does any of this patch since it is all based on the SDAG. I used signed char here just as an illustration since you mentioned the extsb instruction, but the equivalent zero_extend applies with the unsigned variants (in both cases). So to summarize, I can add the clear instruction to each case, but I'm quite certain that it will be a reduntant instruction. It will also make the table gen logic more complex because I'd need a clear for the unsigned cases and a sign-extend for the signed cases. nemanjai: I think this isn't clearly specified in the language reference, but as far as I can tell, the…
		dag LE_HALF_3 = (i32 (EXTRACT_SUBREG (RLDICL LE_DWORD_0, 16, 48), sub_32));
		dag LE_HALF_4 = (i32 (EXTRACT_SUBREG (RLDICL LE_DWORD_1, 0, 48), sub_32));
		dag LE_HALF_5 = (i32 (EXTRACT_SUBREG (RLDICL LE_DWORD_1, 48, 48), sub_32));
		dag LE_HALF_6 = (i32 (EXTRACT_SUBREG (RLDICL LE_DWORD_1, 32, 48), sub_32));
		dag LE_HALF_7 = (i32 (EXTRACT_SUBREG (RLDICL LE_DWORD_1, 16, 48), sub_32));

		// Byte extraction
		dag LE_BYTE_0 = (i32 (EXTRACT_SUBREG (RLDICL LE_DWORD_0, 0, 56), sub_32));
		dag LE_BYTE_1 = (i32 (EXTRACT_SUBREG (RLDICL LE_DWORD_0, 56, 56), sub_32));
		dag LE_BYTE_2 = (i32 (EXTRACT_SUBREG (RLDICL LE_DWORD_0, 48, 56), sub_32));
		dag LE_BYTE_3 = (i32 (EXTRACT_SUBREG (RLDICL LE_DWORD_0, 40, 56), sub_32));
		dag LE_BYTE_4 = (i32 (EXTRACT_SUBREG (RLDICL LE_DWORD_0, 32, 56), sub_32));
		dag LE_BYTE_5 = (i32 (EXTRACT_SUBREG (RLDICL LE_DWORD_0, 24, 56), sub_32));
		dag LE_BYTE_6 = (i32 (EXTRACT_SUBREG (RLDICL LE_DWORD_0, 16, 56), sub_32));
		dag LE_BYTE_7 = (i32 (EXTRACT_SUBREG (RLDICL LE_DWORD_0, 8, 56), sub_32));
		dag LE_BYTE_8 = (i32 (EXTRACT_SUBREG (RLDICL LE_DWORD_1, 0, 56), sub_32));
		dag LE_BYTE_9 = (i32 (EXTRACT_SUBREG (RLDICL LE_DWORD_1, 56, 56), sub_32));
		dag LE_BYTE_10 = (i32 (EXTRACT_SUBREG (RLDICL LE_DWORD_1, 48, 56), sub_32));
		dag LE_BYTE_11 = (i32 (EXTRACT_SUBREG (RLDICL LE_DWORD_1, 40, 56), sub_32));
		dag LE_BYTE_12 = (i32 (EXTRACT_SUBREG (RLDICL LE_DWORD_1, 32, 56), sub_32));
		dag LE_BYTE_13 = (i32 (EXTRACT_SUBREG (RLDICL LE_DWORD_1, 24, 56), sub_32));
		dag LE_BYTE_14 = (i32 (EXTRACT_SUBREG (RLDICL LE_DWORD_1, 16, 56), sub_32));
		dag LE_BYTE_15 = (i32 (EXTRACT_SUBREG (RLDICL LE_DWORD_1, 8, 56), sub_32));

		/* Variable element number (BE and LE patterns must be specified separately)
		wschmidtUnsubmitted Not Done Reply Inline Actions This looks wrong. If $Idx = 19, (ANDC8 (LI8 8), $Idx) will AND 0...001000 with 1...101100, giving 0...001000 = 8. Basically this expression will always produce either 0 or 8, right? That is surely not what you want. This same issue seems repeated several places below, so I've stopped trying to figure these expressions out for now. Have you done any execution testing to check that the code you're generating works? I'm skeptical that it can. wschmidt: This looks wrong. If $Idx = 19, (ANDC8 (LI8 8), $Idx) will AND 0...001000 with 1...101100…
		nemanjaiAuthorUnsubmitted Not Done Reply Inline Actions I may be missing something here, but always getting a 0 or 8 is precisely what I was after. Basically, if the element is on the left half of the VSR, I don't need to shift it. If it is on the right side of the VSR, I do need to shift it and I need to shift it by EXACTLY 8 bytes. This applies on both LE and BE except that the element numbers that are on the left vs. right half of the VSR are exact opposites. For LE, the bytes 0-7 need to be shifted before the MFVSR (so the expression you mentioned returns 8). Subsequent to this shift, what is then on the left side of the VSR gets moved out to a GPR. Now that I have the right half of the VSR in the GPR, I need to shift to the right by the correct number of bytes (0-7). So element zero, needs not shift at this point, whereas elements 1-7 need to be shifted by the equal number of bytes. Similarly, element 8 needs not shift, whereas elements 9-15 need to be shifted by $Idx & 7 bytes. Of course, the shift amount I need to specify to SRD is in bits, hence the reason for the left shift of 3. Here's the complete shift sequence for LE (bytes): Elem LShift in VSR RShift in GPR (bytes) 0 8 0 1 8 1 2 8 2 3 8 3 4 8 4 5 8 5 6 8 6 7 8 7 8 0 0 9 0 1 10 0 2 11 0 3 12 0 4 13 0 5 14 0 6 15 0 7 I have worked out similar sequences for the halfwords and for both on BE. Also, I have an execution test case that prints each element of each size vector both as constant element values and as variables. In my testing, the test case was compiled with clang and with gcc and the results compared with diff. I have run this on P8 systems, both LE and BE. nemanjai: I may be missing something here, but always getting a 0 or 8 is precisely what I was after.
		wschmidtUnsubmitted Not Done Reply Inline Actions OK. Again, this needs to be better documented. Your commentary for the LShift portion is: "What we do is set up the index by masking off bits we don't need and shifting accordingly." You're not actually masking off any bits for this case, which is what confused me. Based on that commentary, this didn't appear to fulfill your intent. I note that your subsequent explanation to me doesn't have any mention of masking off bits either. So please document more clearly what is going on with these patterns. Too much commentary is never a crime...an expression like dag BE_VBYTE_SHIFT = (EXTRACT_SUBREG (RLDICR (ANDC8 (LI8 7), $Idx), 3, 60), sub_32); without its own line of commentary is kind of a travesty. If you don't mind, I would like to wait for a more heavily documented version before fully reviewing the variable extracts. wschmidt: OK. Again, this needs to be better documented. Your commentary for the LShift portion is…
		wschmidtUnsubmitted Not Done Reply Inline Actions (I guess that converting to 0 or 8 is "masking off bits," but that really isn't very descriptive of what you're doing. You're just isolating the containing doubleword.) wschmidt: (I guess that converting to 0 or 8 is "masking off bits," but that really isn't very…
		This is a rather involved process.

		Conceptually, this is how the move is accomplished:
		1. Identify which doubleword contains the element
		2. Shift in the VMX register so that the correct doubleword is correctly
		lined up for the MFVSRD
		3. Perform the move so that the element (along with some extra stuff)
		is in the GPR
		4. Right shift within the GPR so that the element is right-justified

		Of course, the index is an element number which has a different meaning
		on LE/BE so the patterns have to be specified separately.

		Note: The final result will be the element right-justified with high
		order bits being arbitrarily defined (namely, whatever was in the
		vector register to the left of the value originally).
		*/

		/* LE variable byte
		Number 1. above:
		- For elements 0-7, we shift left by 8 bytes since they're on the right
		- For elements 8-15, we need not shift (shift left by zero bytes)
		This is accomplished by inverting the bits of the index and AND-ing
		with 0x8 (i.e. clearing all bits of the index and inverting bit 60).
		*/
		dag LE_VBYTE_PERM_VEC = (LVSL ZERO8, (ANDC8 (LI8 8), $Idx));

		// Number 2. above:
		// - Now that we set up the shift amount, we shift in the VMX register
		dag LE_VBYTE_PERMUTE = (VPERM $S, $S, LE_VBYTE_PERM_VEC);

		// Number 3. above:
		// - The doubleword containing our element is moved to a GPR
		dag LE_MV_VBYTE = (MFVSRD
		(EXTRACT_SUBREG
		(v2i64 (COPY_TO_REGCLASS LE_VBYTE_PERMUTE, VSRC)),
		sub_64));

		/* Number 4. above:
		- Truncate the element number to the range 0-7 (8-15 are symmetrical
		and out of range values are truncated accordingly)
		- Multiply by 8 as we need to shift right by the number of bits, not bytes
		- Shift right in the GPR by the calculated value
		*/
		dag LE_VBYTE_SHIFT = (EXTRACT_SUBREG (RLDICR (AND8 (LI8 7), $Idx), 3, 60),
		sub_32);
		dag LE_VARIABLE_BYTE = (EXTRACT_SUBREG (SRD LE_MV_VBYTE, LE_VBYTE_SHIFT),
		sub_32);

		/* BE variable byte
		The algorithm here is the same as the LE variable byte except:
		- The shift in the VMX register is by 0/8 for opposite element numbers so
		we simply AND the element number with 0x8
		- The order of elements after the move to GPR is reversed, so we invert
		the bits of the index prior to truncating to the range 0-7
		*/
		dag BE_VBYTE_PERM_VEC = (LVSL ZERO8, (ANDIo8 $Idx, 8));
		dag BE_VBYTE_PERMUTE = (VPERM $S, $S, BE_VBYTE_PERM_VEC);
		dag BE_MV_VBYTE = (MFVSRD
		(EXTRACT_SUBREG
		(v2i64 (COPY_TO_REGCLASS BE_VBYTE_PERMUTE, VSRC)),
		sub_64));
		dag BE_VBYTE_SHIFT = (EXTRACT_SUBREG (RLDICR (ANDC8 (LI8 7), $Idx), 3, 60),
		sub_32);
		dag BE_VARIABLE_BYTE = (EXTRACT_SUBREG (SRD BE_MV_VBYTE, BE_VBYTE_SHIFT),
		sub_32);

		/* LE variable halfword
		Number 1. above:
		- For elements 0-3, we shift left by 8 since they're on the right
		- For elements 4-7, we need not shift (shift left by zero bytes)
		Similarly to the byte pattern, we invert the bits of the index, but we
		AND with 0x4 (i.e. clear all bits of the index and invert bit 61).
		Of course, the shift is still by 8 bytes, so we must multiply by 2.
		*/
		dag LE_VHALF_PERM_VEC = (LVSL ZERO8, (RLDICR (ANDC8 (LI8 4), $Idx), 1, 62));

		// Number 2. above:
		// - Now that we set up the shift amount, we shift in the VMX register
		dag LE_VHALF_PERMUTE = (VPERM $S, $S, LE_VHALF_PERM_VEC);

		// Number 3. above:
		// - The doubleword containing our element is moved to a GPR
		dag LE_MV_VHALF = (MFVSRD
		(EXTRACT_SUBREG
		(v2i64 (COPY_TO_REGCLASS LE_VHALF_PERMUTE, VSRC)),
		sub_64));

		/* Number 4. above:
		- Truncate the element number to the range 0-3 (4-7 are symmetrical
		and out of range values are truncated accordingly)
		- Multiply by 16 as we need to shift right by the number of bits
		- Shift right in the GPR by the calculated value
		*/
		dag LE_VHALF_SHIFT = (EXTRACT_SUBREG (RLDICR (AND8 (LI8 3), $Idx), 4, 59),
		sub_32);
		dag LE_VARIABLE_HALF = (EXTRACT_SUBREG (SRD LE_MV_VHALF, LE_VHALF_SHIFT),
		sub_32);

		/* BE variable halfword
		The algorithm here is the same as the LE variable halfword except:
		- The shift in the VMX register is by 0/8 for opposite element numbers so
		we simply AND the element number with 0x4 and multiply by 2
		- The order of elements after the move to GPR is reversed, so we invert
		the bits of the index prior to truncating to the range 0-3
		*/
		dag BE_VHALF_PERM_VEC = (LVSL ZERO8, (RLDICR (ANDIo8 $Idx, 4), 1, 62));
		dag BE_VHALF_PERMUTE = (VPERM $S, $S, BE_VHALF_PERM_VEC);
		dag BE_MV_VHALF = (MFVSRD
		(EXTRACT_SUBREG
		(v2i64 (COPY_TO_REGCLASS BE_VHALF_PERMUTE, VSRC)),
		sub_64));
		dag BE_VHALF_SHIFT = (EXTRACT_SUBREG (RLDICR (ANDC8 (LI8 3), $Idx), 4, 60),
		sub_32);
		dag BE_VARIABLE_HALF = (EXTRACT_SUBREG (SRD BE_MV_VHALF, BE_VHALF_SHIFT),
		sub_32);
		}

		// v4f32 scalar <-> vector conversions (BE)
let Predicates = [IsBigEndian, HasP8Vector] in {		let Predicates = [IsBigEndian, HasP8Vector] in {
def : Pat<(v4f32 (scalar_to_vector f32:$A)),		def : Pat<(v4f32 (scalar_to_vector f32:$A)),
(v4f32 (XSCVDPSPN $A))>;		(v4f32 (XSCVDPSPN $A))>;
		def : Pat<(f32 (vector_extract v4f32:$S, 0)),
		(f32 (XSCVSPDPN $S))>;
		def : Pat<(f32 (vector_extract v4f32:$S, 1)),
		(f32 (XSCVSPDPN (XXSLDWI $S, $S, 1)))>;
		def : Pat<(f32 (vector_extract v4f32:$S, 2)),
		(f32 (XSCVSPDPN (XXSLDWI $S, $S, 2)))>;
		def : Pat<(f32 (vector_extract v4f32:$S, 3)),
		(f32 (XSCVSPDPN (XXSLDWI $S, $S, 3)))>;
} // IsBigEndian, HasP8Vector		} // IsBigEndian, HasP8Vector

let Predicates = [IsBigEndian, HasDirectMove] in {		let Predicates = [IsBigEndian, HasDirectMove] in {
		// v16i8 scalar <-> vector conversions (BE)
def : Pat<(v16i8 (scalar_to_vector i32:$A)),		def : Pat<(v16i8 (scalar_to_vector i32:$A)),
(v16i8 (COPY_TO_REGCLASS Moves.BE_BYTE_0, VSRC))>;		(v16i8 (SUBREG_TO_REG (i64 1), MovesToVSR.BE_BYTE_0, sub_64))>;
def : Pat<(v8i16 (scalar_to_vector i32:$A)),		def : Pat<(v8i16 (scalar_to_vector i32:$A)),
(v8i16 (COPY_TO_REGCLASS Moves.BE_HALF_0, VSRC))>;		(v8i16 (SUBREG_TO_REG (i64 1), MovesToVSR.BE_HALF_0, sub_64))>;
def : Pat<(v4i32 (scalar_to_vector i32:$A)),		def : Pat<(v4i32 (scalar_to_vector i32:$A)),
(v4i32 (COPY_TO_REGCLASS Moves.BE_WORD_0, VSRC))>;		(v4i32 (SUBREG_TO_REG (i64 1), MovesToVSR.BE_WORD_0, sub_64))>;
def : Pat<(v2i64 (scalar_to_vector i64:$A)),		def : Pat<(v2i64 (scalar_to_vector i64:$A)),
(v2i64 (COPY_TO_REGCLASS Moves.BE_DWORD_0, VSRC))>;		(v2i64 (SUBREG_TO_REG (i64 1), MovesToVSR.BE_DWORD_0, sub_64))>;
		def : Pat<(i32 (vector_extract v16i8:$S, 0)),
		(i32 MovesFromVSR.LE_BYTE_15)>;
		def : Pat<(i32 (vector_extract v16i8:$S, 1)),
		(i32 MovesFromVSR.LE_BYTE_14)>;
		def : Pat<(i32 (vector_extract v16i8:$S, 2)),
		(i32 MovesFromVSR.LE_BYTE_13)>;
		def : Pat<(i32 (vector_extract v16i8:$S, 3)),
		(i32 MovesFromVSR.LE_BYTE_12)>;
		def : Pat<(i32 (vector_extract v16i8:$S, 4)),
		(i32 MovesFromVSR.LE_BYTE_11)>;
		def : Pat<(i32 (vector_extract v16i8:$S, 5)),
		(i32 MovesFromVSR.LE_BYTE_10)>;
		def : Pat<(i32 (vector_extract v16i8:$S, 6)),
		(i32 MovesFromVSR.LE_BYTE_9)>;
		def : Pat<(i32 (vector_extract v16i8:$S, 7)),
		(i32 MovesFromVSR.LE_BYTE_8)>;
		def : Pat<(i32 (vector_extract v16i8:$S, 8)),
		(i32 MovesFromVSR.LE_BYTE_7)>;
		def : Pat<(i32 (vector_extract v16i8:$S, 9)),
		(i32 MovesFromVSR.LE_BYTE_6)>;
		def : Pat<(i32 (vector_extract v16i8:$S, 10)),
		(i32 MovesFromVSR.LE_BYTE_5)>;
		def : Pat<(i32 (vector_extract v16i8:$S, 11)),
		(i32 MovesFromVSR.LE_BYTE_4)>;
		def : Pat<(i32 (vector_extract v16i8:$S, 12)),
		(i32 MovesFromVSR.LE_BYTE_3)>;
		def : Pat<(i32 (vector_extract v16i8:$S, 13)),
		(i32 MovesFromVSR.LE_BYTE_2)>;
		def : Pat<(i32 (vector_extract v16i8:$S, 14)),
		(i32 MovesFromVSR.LE_BYTE_1)>;
		def : Pat<(i32 (vector_extract v16i8:$S, 15)),
		(i32 MovesFromVSR.LE_BYTE_0)>;
		def : Pat<(i32 (vector_extract v16i8:$S, i64:$Idx)),
		(i32 MovesFromVSR.BE_VARIABLE_BYTE)>;

		// v8i16 scalar <-> vector conversions (BE)
		def : Pat<(i32 (vector_extract v8i16:$S, 0)),
		(i32 MovesFromVSR.LE_HALF_7)>;
		def : Pat<(i32 (vector_extract v8i16:$S, 1)),
		(i32 MovesFromVSR.LE_HALF_6)>;
		def : Pat<(i32 (vector_extract v8i16:$S, 2)),
		(i32 MovesFromVSR.LE_HALF_5)>;
		def : Pat<(i32 (vector_extract v8i16:$S, 3)),
		(i32 MovesFromVSR.LE_HALF_4)>;
		def : Pat<(i32 (vector_extract v8i16:$S, 4)),
		(i32 MovesFromVSR.LE_HALF_3)>;
		def : Pat<(i32 (vector_extract v8i16:$S, 5)),
		(i32 MovesFromVSR.LE_HALF_2)>;
		def : Pat<(i32 (vector_extract v8i16:$S, 6)),
		(i32 MovesFromVSR.LE_HALF_1)>;
		def : Pat<(i32 (vector_extract v8i16:$S, 7)),
		(i32 MovesFromVSR.LE_HALF_0)>;
		def : Pat<(i32 (vector_extract v8i16:$S, i64:$Idx)),
		(i32 MovesFromVSR.BE_VARIABLE_HALF)>;

		// v4i32 scalar <-> vector conversions (BE)
		def : Pat<(i32 (vector_extract v4i32:$S, 0)),
		(i32 MovesFromVSR.LE_WORD_3)>;
		def : Pat<(i32 (vector_extract v4i32:$S, 1)),
		(i32 MovesFromVSR.LE_WORD_2)>;
		def : Pat<(i32 (vector_extract v4i32:$S, 2)),
		(i32 MovesFromVSR.LE_WORD_1)>;
		def : Pat<(i32 (vector_extract v4i32:$S, 3)),
		(i32 MovesFromVSR.LE_WORD_0)>;

		// v2i64 scalar <-> vector conversions (BE)
		def : Pat<(i64 (vector_extract v2i64:$S, 0)),
		(i64 MovesFromVSR.LE_DWORD_1)>;
		def : Pat<(i64 (vector_extract v2i64:$S, 1)),
		(i64 MovesFromVSR.LE_DWORD_0)>;
} // IsBigEndian, HasDirectMove		} // IsBigEndian, HasDirectMove

		// v4f32 scalar <-> vector conversions (LE)
let Predicates = [IsLittleEndian, HasP8Vector] in {		let Predicates = [IsLittleEndian, HasP8Vector] in {
def : Pat<(v4f32 (scalar_to_vector f32:$A)),		def : Pat<(v4f32 (scalar_to_vector f32:$A)),
(v4f32 (XXSLDWI (XSCVDPSPN $A), (XSCVDPSPN $A), 1))>;		(v4f32 (XXSLDWI (XSCVDPSPN $A), (XSCVDPSPN $A), 1))>;
		def : Pat<(f32 (vector_extract v4f32:$S, 0)),
		(f32 (XSCVSPDPN (XXSLDWI $S, $S, 3)))>;
		def : Pat<(f32 (vector_extract v4f32:$S, 1)),
		(f32 (XSCVSPDPN (XXSLDWI $S, $S, 2)))>;
		def : Pat<(f32 (vector_extract v4f32:$S, 2)),
		(f32 (XSCVSPDPN (XXSLDWI $S, $S, 1)))>;
		def : Pat<(f32 (vector_extract v4f32:$S, 3)),
		(f32 (XSCVSPDPN $S))>;
} // IsLittleEndian, HasP8Vector		} // IsLittleEndian, HasP8Vector

let Predicates = [IsLittleEndian, HasDirectMove] in {		let Predicates = [IsLittleEndian, HasDirectMove] in {
		// v16i8 scalar <-> vector conversions (LE)
def : Pat<(v16i8 (scalar_to_vector i32:$A)),		def : Pat<(v16i8 (scalar_to_vector i32:$A)),
(v16i8 (COPY_TO_REGCLASS Moves.LE_WORD_0, VSRC))>;		(v16i8 (COPY_TO_REGCLASS MovesToVSR.LE_WORD_0, VSRC))>;
def : Pat<(v8i16 (scalar_to_vector i32:$A)),		def : Pat<(v8i16 (scalar_to_vector i32:$A)),
(v8i16 (COPY_TO_REGCLASS Moves.LE_WORD_0, VSRC))>;		(v8i16 (COPY_TO_REGCLASS MovesToVSR.LE_WORD_0, VSRC))>;
def : Pat<(v4i32 (scalar_to_vector i32:$A)),		def : Pat<(v4i32 (scalar_to_vector i32:$A)),
(v4i32 (COPY_TO_REGCLASS Moves.LE_WORD_0, VSRC))>;		(v4i32 MovesToVSR.LE_WORD_0)>;
def : Pat<(v2i64 (scalar_to_vector i64:$A)),		def : Pat<(v2i64 (scalar_to_vector i64:$A)),
(v2i64 Moves.LE_DWORD_0)>;		(v2i64 MovesToVSR.LE_DWORD_0)>;
		def : Pat<(i32 (vector_extract v16i8:$S, 0)),
		(i32 MovesFromVSR.LE_BYTE_0)>;
		def : Pat<(i32 (vector_extract v16i8:$S, 1)),
		(i32 MovesFromVSR.LE_BYTE_1)>;
		def : Pat<(i32 (vector_extract v16i8:$S, 2)),
		(i32 MovesFromVSR.LE_BYTE_2)>;
		def : Pat<(i32 (vector_extract v16i8:$S, 3)),
		(i32 MovesFromVSR.LE_BYTE_3)>;
		def : Pat<(i32 (vector_extract v16i8:$S, 4)),
		(i32 MovesFromVSR.LE_BYTE_4)>;
		def : Pat<(i32 (vector_extract v16i8:$S, 5)),
		(i32 MovesFromVSR.LE_BYTE_5)>;
		def : Pat<(i32 (vector_extract v16i8:$S, 6)),
		(i32 MovesFromVSR.LE_BYTE_6)>;
		def : Pat<(i32 (vector_extract v16i8:$S, 7)),
		(i32 MovesFromVSR.LE_BYTE_7)>;
		def : Pat<(i32 (vector_extract v16i8:$S, 8)),
		(i32 MovesFromVSR.LE_BYTE_8)>;
		def : Pat<(i32 (vector_extract v16i8:$S, 9)),
		(i32 MovesFromVSR.LE_BYTE_9)>;
		def : Pat<(i32 (vector_extract v16i8:$S, 10)),
		(i32 MovesFromVSR.LE_BYTE_10)>;
		def : Pat<(i32 (vector_extract v16i8:$S, 11)),
		(i32 MovesFromVSR.LE_BYTE_11)>;
		def : Pat<(i32 (vector_extract v16i8:$S, 12)),
		(i32 MovesFromVSR.LE_BYTE_12)>;
		def : Pat<(i32 (vector_extract v16i8:$S, 13)),
		(i32 MovesFromVSR.LE_BYTE_13)>;
		def : Pat<(i32 (vector_extract v16i8:$S, 14)),
		(i32 MovesFromVSR.LE_BYTE_14)>;
		def : Pat<(i32 (vector_extract v16i8:$S, 15)),
		(i32 MovesFromVSR.LE_BYTE_15)>;
		def : Pat<(i32 (vector_extract v16i8:$S, i64:$Idx)),
		(i32 MovesFromVSR.LE_VARIABLE_BYTE)>;

		// v8i16 scalar <-> vector conversions (LE)
		def : Pat<(i32 (vector_extract v8i16:$S, 0)),
		(i32 MovesFromVSR.LE_HALF_0)>;
		def : Pat<(i32 (vector_extract v8i16:$S, 1)),
		(i32 MovesFromVSR.LE_HALF_1)>;
		def : Pat<(i32 (vector_extract v8i16:$S, 2)),
		(i32 MovesFromVSR.LE_HALF_2)>;
		def : Pat<(i32 (vector_extract v8i16:$S, 3)),
		(i32 MovesFromVSR.LE_HALF_3)>;
		def : Pat<(i32 (vector_extract v8i16:$S, 4)),
		(i32 MovesFromVSR.LE_HALF_4)>;
		def : Pat<(i32 (vector_extract v8i16:$S, 5)),
		(i32 MovesFromVSR.LE_HALF_5)>;
		def : Pat<(i32 (vector_extract v8i16:$S, 6)),
		(i32 MovesFromVSR.LE_HALF_6)>;
		def : Pat<(i32 (vector_extract v8i16:$S, 7)),
		(i32 MovesFromVSR.LE_HALF_7)>;
		def : Pat<(i32 (vector_extract v8i16:$S, i64:$Idx)),
		(i32 MovesFromVSR.LE_VARIABLE_HALF)>;

		// v4i32 scalar <-> vector conversions (LE)
		def : Pat<(i32 (vector_extract v4i32:$S, 0)),
		(i32 MovesFromVSR.LE_WORD_0)>;
		def : Pat<(i32 (vector_extract v4i32:$S, 1)),
		(i32 MovesFromVSR.LE_WORD_1)>;
		def : Pat<(i32 (vector_extract v4i32:$S, 2)),
		(i32 MovesFromVSR.LE_WORD_2)>;
		def : Pat<(i32 (vector_extract v4i32:$S, 3)),
		(i32 MovesFromVSR.LE_WORD_3)>;

		// v2i64 scalar <-> vector conversions (LE)
		def : Pat<(i64 (vector_extract v2i64:$S, 0)),
		(i64 MovesFromVSR.LE_DWORD_0)>;
		def : Pat<(i64 (vector_extract v2i64:$S, 1)),
		(i64 MovesFromVSR.LE_DWORD_1)>;
} // IsLittleEndian, HasDirectMove		} // IsLittleEndian, HasDirectMove

lib/Target/PowerPC/PPCVSXCopy.cpp

Show First 20 Lines • Show All 122 Lines • ▼ Show 20 Lines	bool processBlock(MachineBasicBlock &MBB) {
IsVSReg(SrcMO.getReg(), MRI)) {		IsVSReg(SrcMO.getReg(), MRI)) {
// This is a copy from a VSX register to a non-VSX register.		// This is a copy from a VSX register to a non-VSX register.
Changed = true;		Changed = true;

const TargetRegisterClass *DstRC =		const TargetRegisterClass *DstRC =
IsVRReg(DstMO.getReg(), MRI) ? &PPC::VSHRCRegClass :		IsVRReg(DstMO.getReg(), MRI) ? &PPC::VSHRCRegClass :
&PPC::VSLRCRegClass;		&PPC::VSLRCRegClass;
assert((IsF8Reg(DstMO.getReg(), MRI) \|\|		assert((IsF8Reg(DstMO.getReg(), MRI) \|\|
		IsVSFReg(DstMO.getReg(), MRI) \|\|
IsVRReg(DstMO.getReg(), MRI)) &&		IsVRReg(DstMO.getReg(), MRI)) &&
"Unknown destination for a VSX copy");		"Unknown destination for a VSX copy");

// Copy the VSX value into a new VSX register of the correct subclass.		// Copy the VSX value into a new VSX register of the correct subclass.
unsigned NewVReg = MRI.createVirtualRegister(DstRC);		unsigned NewVReg = MRI.createVirtualRegister(DstRC);
BuildMI(MBB, MI, MI->getDebugLoc(),		BuildMI(MBB, MI, MI->getDebugLoc(),
TII->get(TargetOpcode::COPY), NewVReg)		TII->get(TargetOpcode::COPY), NewVReg)
.addOperand(SrcMO);		.addOperand(SrcMO);
▲ Show 20 Lines • Show All 43 Lines • Show Last 20 Lines

test/CodeGen/PowerPC/p8-scalar_vector_conversions.ll

Show First 20 Lines • Show All 55 Lines • ▼ Show 20 Lines	entry:
%a.addr = alloca i64, align 8		%a.addr = alloca i64, align 8
store i64 %a, i64* %a.addr, align 8		store i64 %a, i64* %a.addr, align 8
%0 = load i64, i64* %a.addr, align 8		%0 = load i64, i64* %a.addr, align 8
%splat.splatinsert = insertelement <2 x i64> undef, i64 %0, i32 0		%splat.splatinsert = insertelement <2 x i64> undef, i64 %0, i32 0
%splat.splat = shufflevector <2 x i64> %splat.splatinsert, <2 x i64> undef, <2 x i32> zeroinitializer		%splat.splat = shufflevector <2 x i64> %splat.splatinsert, <2 x i64> undef, <2 x i32> zeroinitializer
ret <2 x i64> %splat.splat		ret <2 x i64> %splat.splat
; FIXME-CHECK: mtvsrd {{[0-9]+}}, 3		; FIXME-CHECK: mtvsrd {{[0-9]+}}, 3
; FIXME-CHECK-LE: mtvsrd [[REG1:[0-9]+]], 3		; FIXME-CHECK-LE: mtvsrd [[REG1:[0-9]+]], 3
; FIXME-CHECK-LE: xxswapd {{[0-9]+}}, [[REG1]]		; FIXME-CHECK-LE: xxswapd {{[0-9]+}}, [[REG1]]
		wschmidtUnsubmitted Not Done Reply Inline Actions Same here -- don't disable these checks. wschmidt: Same here -- don't disable these checks.
		nemanjaiAuthorUnsubmitted Not Done Reply Inline Actions Same, that will be part of the commit for the bootstrap failure fix. nemanjai: Same, that will be part of the commit for the bootstrap failure fix.
}		}

; Function Attrs: nounwind		; Function Attrs: nounwind
define <4 x float> @buildf(float %a) {		define <4 x float> @buildf(float %a) {
entry:		entry:
%a.addr = alloca float, align 4		%a.addr = alloca float, align 4
store float %a, float* %a.addr, align 4		store float %a, float* %a.addr, align 4
%0 = load float, float* %a.addr, align 4		%0 = load float, float* %a.addr, align 4
%splat.splatinsert = insertelement <4 x float> undef, float %0, i32 0		%splat.splatinsert = insertelement <4 x float> undef, float %0, i32 0
%splat.splat = shufflevector <4 x float> %splat.splatinsert, <4 x float> undef, <4 x i32> zeroinitializer		%splat.splat = shufflevector <4 x float> %splat.splatinsert, <4 x float> undef, <4 x i32> zeroinitializer
ret <4 x float> %splat.splat		ret <4 x float> %splat.splat
; CHECK: xscvdpspn {{[0-9]+}}, 1		; CHECK: xscvdpspn {{[0-9]+}}, 1
; CHECK-LE: xscvdpspn [[REG1:[0-9]+]], 1		; CHECK-LE: xscvdpspn [[REG1:[0-9]+]], 1
; CHECK-LE: xxsldwi {{[0-9]+}}, [[REG1]], [[REG1]], 1		; CHECK-LE: xxsldwi {{[0-9]+}}, [[REG1]], [[REG1]], 1
}		}

		; Function Attrs: nounwind
		define signext i8 @getsc0(<16 x i8> %vsc) {
		entry:
		%vsc.addr = alloca <16 x i8>, align 16
		store <16 x i8> %vsc, <16 x i8>* %vsc.addr, align 16
		%0 = load <16 x i8>, <16 x i8>* %vsc.addr, align 16
		%vecext = extractelement <16 x i8> %0, i32 0
		ret i8 %vecext
		; CHECK-LABEL: @getsc0
		; CHECK: mfvsrd 3, 34
		; CHECK: rldicl 3, 3, 8, 56
		; CHECK: extsb 3, 3
		; CHECK-LE-LABEL: @getsc0
		; CHECK-LE: mfvsrd 3,
		; CHECK-LE: clrldi 3, 3, 56
		; CHECK-LE: extsb 3, 3
		wschmidtUnsubmitted Not Done Reply Inline Actions Do we have an issue open for the unnecessary extsb? We should peephole this away at some point. Please open one if there is not. wschmidt: Do we have an issue open for the unnecessary extsb? We should peephole this away at some point.
		}

		; Function Attrs: nounwind
		define signext i8 @getsc1(<16 x i8> %vsc) {
		entry:
		%vsc.addr = alloca <16 x i8>, align 16
		store <16 x i8> %vsc, <16 x i8>* %vsc.addr, align 16
		%0 = load <16 x i8>, <16 x i8>* %vsc.addr, align 16
		%vecext = extractelement <16 x i8> %0, i32 1
		ret i8 %vecext
		; CHECK-LABEL: @getsc1
		; CHECK: mfvsrd 3, 34
		; CHECK: rldicl 3, 3, 16, 56
		; CHECK: extsb 3, 3
		; CHECK-LE-LABEL: @getsc1
		; CHECK-LE: mfvsrd 3,
		; CHECK-LE: rldicl 3, 3, 56, 56
		; CHECK-LE: extsb 3, 3
		}

		; Function Attrs: nounwind
		define signext i8 @getsc2(<16 x i8> %vsc) {
		entry:
		%vsc.addr = alloca <16 x i8>, align 16
		store <16 x i8> %vsc, <16 x i8>* %vsc.addr, align 16
		%0 = load <16 x i8>, <16 x i8>* %vsc.addr, align 16
		%vecext = extractelement <16 x i8> %0, i32 2
		ret i8 %vecext
		; CHECK-LABEL: @getsc2
		; CHECK: mfvsrd 3, 34
		; CHECK: rldicl 3, 3, 24, 56
		; CHECK: extsb 3, 3
		; CHECK-LE-LABEL: @getsc2
		; CHECK-LE: mfvsrd 3,
		; CHECK-LE: rldicl 3, 3, 48, 56
		; CHECK-LE: extsb 3, 3
		}

		; Function Attrs: nounwind
		define signext i8 @getsc3(<16 x i8> %vsc) {
		entry:
		%vsc.addr = alloca <16 x i8>, align 16
		store <16 x i8> %vsc, <16 x i8>* %vsc.addr, align 16
		%0 = load <16 x i8>, <16 x i8>* %vsc.addr, align 16
		%vecext = extractelement <16 x i8> %0, i32 3
		ret i8 %vecext
		; CHECK-LABEL: @getsc3
		; CHECK: mfvsrd 3, 34
		; CHECK: rldicl 3, 3, 32, 56
		; CHECK: extsb 3, 3
		; CHECK-LE-LABEL: @getsc3
		; CHECK-LE: mfvsrd 3,
		; CHECK-LE: rldicl 3, 3, 40, 56
		; CHECK-LE: extsb 3, 3
		}

		; Function Attrs: nounwind
		define signext i8 @getsc4(<16 x i8> %vsc) {
		entry:
		%vsc.addr = alloca <16 x i8>, align 16
		store <16 x i8> %vsc, <16 x i8>* %vsc.addr, align 16
		%0 = load <16 x i8>, <16 x i8>* %vsc.addr, align 16
		%vecext = extractelement <16 x i8> %0, i32 4
		ret i8 %vecext
		; CHECK-LABEL: @getsc4
		; CHECK: mfvsrd 3, 34
		; CHECK: rldicl 3, 3, 40, 56
		; CHECK: extsb 3, 3
		; CHECK-LE-LABEL: @getsc4
		; CHECK-LE: mfvsrd 3,
		; CHECK-LE: rldicl 3, 3, 32, 56
		; CHECK-LE: extsb 3, 3
		}

		; Function Attrs: nounwind
		define signext i8 @getsc5(<16 x i8> %vsc) {
		entry:
		%vsc.addr = alloca <16 x i8>, align 16
		store <16 x i8> %vsc, <16 x i8>* %vsc.addr, align 16
		%0 = load <16 x i8>, <16 x i8>* %vsc.addr, align 16
		%vecext = extractelement <16 x i8> %0, i32 5
		ret i8 %vecext
		; CHECK-LABEL: @getsc5
		; CHECK: mfvsrd 3, 34
		; CHECK: rldicl 3, 3, 48, 56
		; CHECK: extsb 3, 3
		; CHECK-LE-LABEL: @getsc5
		; CHECK-LE: mfvsrd 3,
		; CHECK-LE: rldicl 3, 3, 24, 56
		; CHECK-LE: extsb 3, 3
		}

		; Function Attrs: nounwind
		define signext i8 @getsc6(<16 x i8> %vsc) {
		entry:
		%vsc.addr = alloca <16 x i8>, align 16
		store <16 x i8> %vsc, <16 x i8>* %vsc.addr, align 16
		%0 = load <16 x i8>, <16 x i8>* %vsc.addr, align 16
		%vecext = extractelement <16 x i8> %0, i32 6
		ret i8 %vecext
		; CHECK-LABEL: @getsc6
		; CHECK: mfvsrd 3, 34
		; CHECK: rldicl 3, 3, 56, 56
		; CHECK: extsb 3, 3
		; CHECK-LE-LABEL: @getsc6
		; CHECK-LE: mfvsrd 3,
		; CHECK-LE: rldicl 3, 3, 16, 56
		; CHECK-LE: extsb 3, 3
		}

		; Function Attrs: nounwind
		define signext i8 @getsc7(<16 x i8> %vsc) {
		entry:
		%vsc.addr = alloca <16 x i8>, align 16
		store <16 x i8> %vsc, <16 x i8>* %vsc.addr, align 16
		%0 = load <16 x i8>, <16 x i8>* %vsc.addr, align 16
		%vecext = extractelement <16 x i8> %0, i32 7
		ret i8 %vecext
		; CHECK-LABEL: @getsc7
		; CHECK: mfvsrd 3, 34
		; CHECK: clrldi 3, 3, 56
		; CHECK: extsb 3, 3
		; CHECK-LE-LABEL: @getsc7
		; CHECK-LE: mfvsrd 3,
		; CHECK-LE: rldicl 3, 3, 8, 56
		; CHECK-LE: extsb 3, 3
		}

		; Function Attrs: nounwind
		define signext i8 @getsc8(<16 x i8> %vsc) {
		entry:
		%vsc.addr = alloca <16 x i8>, align 16
		store <16 x i8> %vsc, <16 x i8>* %vsc.addr, align 16
		%0 = load <16 x i8>, <16 x i8>* %vsc.addr, align 16
		%vecext = extractelement <16 x i8> %0, i32 8
		ret i8 %vecext
		; CHECK-LABEL: @getsc8
		; CHECK: mfvsrd 3,
		; CHECK: rldicl 3, 3, 8, 56
		; CHECK: extsb 3, 3
		; CHECK-LE-LABEL: @getsc8
		; CHECK-LE: mfvsrd 3, 34
		; CHECK-LE: clrldi 3, 3, 56
		; CHECK-LE: extsb 3, 3
		}

		; Function Attrs: nounwind
		define signext i8 @getsc9(<16 x i8> %vsc) {
		entry:
		%vsc.addr = alloca <16 x i8>, align 16
		store <16 x i8> %vsc, <16 x i8>* %vsc.addr, align 16
		%0 = load <16 x i8>, <16 x i8>* %vsc.addr, align 16
		%vecext = extractelement <16 x i8> %0, i32 9
		ret i8 %vecext
		; CHECK-LABEL: @getsc9
		; CHECK: mfvsrd 3,
		; CHECK: rldicl 3, 3, 16, 56
		; CHECK: extsb 3, 3
		; CHECK-LE-LABEL: @getsc9
		; CHECK-LE: mfvsrd 3, 34
		; CHECK-LE: rldicl 3, 3, 56, 56
		; CHECK-LE: extsb 3, 3
		}

		; Function Attrs: nounwind
		define signext i8 @getsc10(<16 x i8> %vsc) {
		entry:
		%vsc.addr = alloca <16 x i8>, align 16
		store <16 x i8> %vsc, <16 x i8>* %vsc.addr, align 16
		%0 = load <16 x i8>, <16 x i8>* %vsc.addr, align 16
		%vecext = extractelement <16 x i8> %0, i32 10
		ret i8 %vecext
		; CHECK-LABEL: @getsc10
		; CHECK: mfvsrd 3,
		; CHECK: rldicl 3, 3, 24, 56
		; CHECK: extsb 3, 3
		; CHECK-LE-LABEL: @getsc10
		; CHECK-LE: mfvsrd 3, 34
		; CHECK-LE: rldicl 3, 3, 48, 56
		; CHECK-LE: extsb 3, 3
		}

		; Function Attrs: nounwind
		define signext i8 @getsc11(<16 x i8> %vsc) {
		entry:
		%vsc.addr = alloca <16 x i8>, align 16
		store <16 x i8> %vsc, <16 x i8>* %vsc.addr, align 16
		%0 = load <16 x i8>, <16 x i8>* %vsc.addr, align 16
		%vecext = extractelement <16 x i8> %0, i32 11
		ret i8 %vecext
		; CHECK-LABEL: @getsc11
		; CHECK: mfvsrd 3,
		; CHECK: rldicl 3, 3, 32, 56
		; CHECK: extsb 3, 3
		; CHECK-LE-LABEL: @getsc11
		; CHECK-LE: mfvsrd 3, 34
		; CHECK-LE: rldicl 3, 3, 40, 56
		; CHECK-LE: extsb 3, 3
		}

		; Function Attrs: nounwind
		define signext i8 @getsc12(<16 x i8> %vsc) {
		entry:
		%vsc.addr = alloca <16 x i8>, align 16
		store <16 x i8> %vsc, <16 x i8>* %vsc.addr, align 16
		%0 = load <16 x i8>, <16 x i8>* %vsc.addr, align 16
		%vecext = extractelement <16 x i8> %0, i32 12
		ret i8 %vecext
		; CHECK-LABEL: @getsc12
		; CHECK: mfvsrd 3,
		; CHECK: rldicl 3, 3, 40, 56
		; CHECK: extsb 3, 3
		; CHECK-LE-LABEL: @getsc12
		; CHECK-LE: mfvsrd 3, 34
		; CHECK-LE: rldicl 3, 3, 32, 56
		; CHECK-LE: extsb 3, 3
		}

		; Function Attrs: nounwind
		define signext i8 @getsc13(<16 x i8> %vsc) {
		entry:
		%vsc.addr = alloca <16 x i8>, align 16
		store <16 x i8> %vsc, <16 x i8>* %vsc.addr, align 16
		%0 = load <16 x i8>, <16 x i8>* %vsc.addr, align 16
		%vecext = extractelement <16 x i8> %0, i32 13
		ret i8 %vecext
		; CHECK-LABEL: @getsc13
		; CHECK: mfvsrd 3,
		; CHECK: rldicl 3, 3, 48, 56
		; CHECK: extsb 3, 3
		; CHECK-LE-LABEL: @getsc13
		; CHECK-LE: mfvsrd 3, 34
		; CHECK-LE: rldicl 3, 3, 24, 56
		; CHECK-LE: extsb 3, 3
		}

		; Function Attrs: nounwind
		define signext i8 @getsc14(<16 x i8> %vsc) {
		entry:
		%vsc.addr = alloca <16 x i8>, align 16
		store <16 x i8> %vsc, <16 x i8>* %vsc.addr, align 16
		%0 = load <16 x i8>, <16 x i8>* %vsc.addr, align 16
		%vecext = extractelement <16 x i8> %0, i32 14
		ret i8 %vecext
		; CHECK-LABEL: @getsc14
		; CHECK: mfvsrd 3,
		; CHECK: rldicl 3, 3, 56, 56
		; CHECK: extsb 3, 3
		; CHECK-LE-LABEL: @getsc14
		; CHECK-LE: mfvsrd 3, 34
		; CHECK-LE: rldicl 3, 3, 16, 56
		; CHECK-LE: extsb 3, 3
		}

		; Function Attrs: nounwind
		define signext i8 @getsc15(<16 x i8> %vsc) {
		entry:
		%vsc.addr = alloca <16 x i8>, align 16
		store <16 x i8> %vsc, <16 x i8>* %vsc.addr, align 16
		%0 = load <16 x i8>, <16 x i8>* %vsc.addr, align 16
		%vecext = extractelement <16 x i8> %0, i32 15
		ret i8 %vecext
		; CHECK-LABEL: @getsc15
		; CHECK: mfvsrd 3,
		; CHECK: extsb 3, 3
		; CHECK-LE-LABEL: @getsc15
		; CHECK-LE: mfvsrd 3, 34
		; CHECK-LE: rldicl 3, 3, 8, 56
		; CHECK-LE: extsb 3, 3
		}

		; Function Attrs: nounwind
		define zeroext i8 @getuc0(<16 x i8> %vuc) {
		entry:
		%vuc.addr = alloca <16 x i8>, align 16
		store <16 x i8> %vuc, <16 x i8>* %vuc.addr, align 16
		%0 = load <16 x i8>, <16 x i8>* %vuc.addr, align 16
		%vecext = extractelement <16 x i8> %0, i32 0
		ret i8 %vecext
		; CHECK-LABEL: @getuc0
		; CHECK: mfvsrd 3, 34
		; CHECK: rldicl 3, 3, 8, 56
		; CHECK: clrldi 3, 3, 56
		; CHECK-LE-LABEL: @getuc0
		; CHECK-LE: mfvsrd 3,
		; CHECK-LE: clrldi 3, 3, 56
		}

		; Function Attrs: nounwind
		define zeroext i8 @getuc1(<16 x i8> %vuc) {
		entry:
		%vuc.addr = alloca <16 x i8>, align 16
		store <16 x i8> %vuc, <16 x i8>* %vuc.addr, align 16
		%0 = load <16 x i8>, <16 x i8>* %vuc.addr, align 16
		%vecext = extractelement <16 x i8> %0, i32 1
		ret i8 %vecext
		; CHECK-LABEL: @getuc1
		; CHECK: mfvsrd 3, 34
		; CHECK: rldicl 3, 3, 16, 56
		; CHECK: clrldi 3, 3, 56
		; CHECK-LE-LABEL: @getuc1
		; CHECK-LE: mfvsrd 3,
		; CHECK-LE: rldicl 3, 3, 56, 56
		; CHECK-LE: clrldi 3, 3, 56
		}

		; Function Attrs: nounwind
		define zeroext i8 @getuc2(<16 x i8> %vuc) {
		entry:
		%vuc.addr = alloca <16 x i8>, align 16
		store <16 x i8> %vuc, <16 x i8>* %vuc.addr, align 16
		%0 = load <16 x i8>, <16 x i8>* %vuc.addr, align 16
		%vecext = extractelement <16 x i8> %0, i32 2
		ret i8 %vecext
		; CHECK-LABEL: @getuc2
		; CHECK: mfvsrd 3, 34
		; CHECK: rldicl 3, 3, 24, 56
		; CHECK: clrldi 3, 3, 56
		; CHECK-LE-LABEL: @getuc2
		; CHECK-LE: mfvsrd 3,
		; CHECK-LE: rldicl 3, 3, 48, 56
		; CHECK-LE: clrldi 3, 3, 56
		}

		; Function Attrs: nounwind
		define zeroext i8 @getuc3(<16 x i8> %vuc) {
		entry:
		%vuc.addr = alloca <16 x i8>, align 16
		store <16 x i8> %vuc, <16 x i8>* %vuc.addr, align 16
		%0 = load <16 x i8>, <16 x i8>* %vuc.addr, align 16
		%vecext = extractelement <16 x i8> %0, i32 3
		ret i8 %vecext
		; CHECK-LABEL: @getuc3
		; CHECK: mfvsrd 3, 34
		; CHECK: rldicl 3, 3, 32, 56
		; CHECK: clrldi 3, 3, 56
		; CHECK-LE-LABEL: @getuc3
		; CHECK-LE: mfvsrd 3,
		; CHECK-LE: rldicl 3, 3, 40, 56
		; CHECK-LE: clrldi 3, 3, 56
		}

		; Function Attrs: nounwind
		define zeroext i8 @getuc4(<16 x i8> %vuc) {
		entry:
		%vuc.addr = alloca <16 x i8>, align 16
		store <16 x i8> %vuc, <16 x i8>* %vuc.addr, align 16
		%0 = load <16 x i8>, <16 x i8>* %vuc.addr, align 16
		%vecext = extractelement <16 x i8> %0, i32 4
		ret i8 %vecext
		; CHECK-LABEL: @getuc4
		; CHECK: mfvsrd 3, 34
		; CHECK: rldicl 3, 3, 40, 56
		; CHECK: clrldi 3, 3, 56
		; CHECK-LE-LABEL: @getuc4
		; CHECK-LE: mfvsrd 3,
		; CHECK-LE: rldicl 3, 3, 32, 56
		; CHECK-LE: clrldi 3, 3, 56
		}

		; Function Attrs: nounwind
		define zeroext i8 @getuc5(<16 x i8> %vuc) {
		entry:
		%vuc.addr = alloca <16 x i8>, align 16
		store <16 x i8> %vuc, <16 x i8>* %vuc.addr, align 16
		%0 = load <16 x i8>, <16 x i8>* %vuc.addr, align 16
		%vecext = extractelement <16 x i8> %0, i32 5
		ret i8 %vecext
		; CHECK-LABEL: @getuc5
		; CHECK: mfvsrd 3, 34
		; CHECK: rldicl 3, 3, 48, 56
		; CHECK: clrldi 3, 3, 56
		; CHECK-LE-LABEL: @getuc5
		; CHECK-LE: mfvsrd 3,
		; CHECK-LE: rldicl 3, 3, 24, 56
		; CHECK-LE: clrldi 3, 3, 56
		}

		; Function Attrs: nounwind
		define zeroext i8 @getuc6(<16 x i8> %vuc) {
		entry:
		%vuc.addr = alloca <16 x i8>, align 16
		store <16 x i8> %vuc, <16 x i8>* %vuc.addr, align 16
		%0 = load <16 x i8>, <16 x i8>* %vuc.addr, align 16
		%vecext = extractelement <16 x i8> %0, i32 6
		ret i8 %vecext
		; CHECK-LABEL: @getuc6
		; CHECK: mfvsrd 3, 34
		; CHECK: rldicl 3, 3, 56, 56
		; CHECK: clrldi 3, 3, 56
		; CHECK-LE-LABEL: @getuc6
		; CHECK-LE: mfvsrd 3,
		; CHECK-LE: rldicl 3, 3, 16, 56
		; CHECK-LE: clrldi 3, 3, 56
		}

		; Function Attrs: nounwind
		define zeroext i8 @getuc7(<16 x i8> %vuc) {
		entry:
		%vuc.addr = alloca <16 x i8>, align 16
		store <16 x i8> %vuc, <16 x i8>* %vuc.addr, align 16
		%0 = load <16 x i8>, <16 x i8>* %vuc.addr, align 16
		%vecext = extractelement <16 x i8> %0, i32 7
		ret i8 %vecext
		; CHECK-LABEL: @getuc7
		; CHECK: mfvsrd 3, 34
		; CHECK: clrldi 3, 3, 56
		; CHECK-LE-LABEL: @getuc7
		; CHECK-LE: mfvsrd 3,
		; CHECK-LE: rldicl 3, 3, 8, 56
		; CHECK-LE: clrldi 3, 3, 56
		}

		; Function Attrs: nounwind
		define zeroext i8 @getuc8(<16 x i8> %vuc) {
		entry:
		%vuc.addr = alloca <16 x i8>, align 16
		store <16 x i8> %vuc, <16 x i8>* %vuc.addr, align 16
		%0 = load <16 x i8>, <16 x i8>* %vuc.addr, align 16
		%vecext = extractelement <16 x i8> %0, i32 8
		ret i8 %vecext
		; CHECK-LABEL: @getuc8
		; CHECK: mfvsrd 3,
		; CHECK: rldicl 3, 3, 8, 56
		; CHECK: clrldi 3, 3, 56
		; CHECK-LE-LABEL: @getuc8
		; CHECK-LE: mfvsrd 3, 34
		; CHECK-LE: clrldi 3, 3, 56
		}

		; Function Attrs: nounwind
		define zeroext i8 @getuc9(<16 x i8> %vuc) {
		entry:
		%vuc.addr = alloca <16 x i8>, align 16
		store <16 x i8> %vuc, <16 x i8>* %vuc.addr, align 16
		%0 = load <16 x i8>, <16 x i8>* %vuc.addr, align 16
		%vecext = extractelement <16 x i8> %0, i32 9
		ret i8 %vecext
		; CHECK-LABEL: @getuc9
		; CHECK: mfvsrd 3,
		; CHECK: rldicl 3, 3, 16, 56
		; CHECK: clrldi 3, 3, 56
		; CHECK-LE-LABEL: @getuc9
		; CHECK-LE: mfvsrd 3, 34
		; CHECK-LE: rldicl 3, 3, 56, 56
		; CHECK-LE: clrldi 3, 3, 56
		}

		; Function Attrs: nounwind
		define zeroext i8 @getuc10(<16 x i8> %vuc) {
		entry:
		%vuc.addr = alloca <16 x i8>, align 16
		store <16 x i8> %vuc, <16 x i8>* %vuc.addr, align 16
		%0 = load <16 x i8>, <16 x i8>* %vuc.addr, align 16
		%vecext = extractelement <16 x i8> %0, i32 10
		ret i8 %vecext
		; CHECK-LABEL: @getuc10
		; CHECK: mfvsrd 3,
		; CHECK: rldicl 3, 3, 24, 56
		; CHECK: clrldi 3, 3, 56
		; CHECK-LE-LABEL: @getuc10
		; CHECK-LE: mfvsrd 3, 34
		; CHECK-LE: rldicl 3, 3, 48, 56
		; CHECK-LE: clrldi 3, 3, 56
		}

		; Function Attrs: nounwind
		define zeroext i8 @getuc11(<16 x i8> %vuc) {
		entry:
		%vuc.addr = alloca <16 x i8>, align 16
		store <16 x i8> %vuc, <16 x i8>* %vuc.addr, align 16
		%0 = load <16 x i8>, <16 x i8>* %vuc.addr, align 16
		%vecext = extractelement <16 x i8> %0, i32 11
		ret i8 %vecext
		; CHECK-LABEL: @getuc11
		; CHECK: mfvsrd 3,
		; CHECK: rldicl 3, 3, 32, 56
		; CHECK: clrldi 3, 3, 56
		; CHECK-LE-LABEL: @getuc11
		; CHECK-LE: mfvsrd 3, 34
		; CHECK-LE: rldicl 3, 3, 40, 56
		; CHECK-LE: clrldi 3, 3, 56
		}

		; Function Attrs: nounwind
		define zeroext i8 @getuc12(<16 x i8> %vuc) {
		entry:
		%vuc.addr = alloca <16 x i8>, align 16
		store <16 x i8> %vuc, <16 x i8>* %vuc.addr, align 16
		%0 = load <16 x i8>, <16 x i8>* %vuc.addr, align 16
		%vecext = extractelement <16 x i8> %0, i32 12
		ret i8 %vecext
		; CHECK-LABEL: @getuc12
		; CHECK: mfvsrd 3,
		; CHECK: rldicl 3, 3, 40, 56
		; CHECK: clrldi 3, 3, 56
		; CHECK-LE-LABEL: @getuc12
		; CHECK-LE: mfvsrd 3, 34
		; CHECK-LE: rldicl 3, 3, 32, 56
		; CHECK-LE: clrldi 3, 3, 56
		}

		; Function Attrs: nounwind
		define zeroext i8 @getuc13(<16 x i8> %vuc) {
		entry:
		%vuc.addr = alloca <16 x i8>, align 16
		store <16 x i8> %vuc, <16 x i8>* %vuc.addr, align 16
		%0 = load <16 x i8>, <16 x i8>* %vuc.addr, align 16
		%vecext = extractelement <16 x i8> %0, i32 13
		ret i8 %vecext
		; CHECK-LABEL: @getuc13
		; CHECK: mfvsrd 3,
		; CHECK: rldicl 3, 3, 48, 56
		; CHECK: clrldi 3, 3, 56
		; CHECK-LE-LABEL: @getuc13
		; CHECK-LE: mfvsrd 3, 34
		; CHECK-LE: rldicl 3, 3, 24, 56
		; CHECK-LE: clrldi 3, 3, 56
		}

		; Function Attrs: nounwind
		define zeroext i8 @getuc14(<16 x i8> %vuc) {
		entry:
		%vuc.addr = alloca <16 x i8>, align 16
		store <16 x i8> %vuc, <16 x i8>* %vuc.addr, align 16
		%0 = load <16 x i8>, <16 x i8>* %vuc.addr, align 16
		%vecext = extractelement <16 x i8> %0, i32 14
		ret i8 %vecext
		; CHECK-LABEL: @getuc14
		; CHECK: mfvsrd 3,
		; CHECK: rldicl 3, 3, 56, 56
		; CHECK: clrldi 3, 3, 56
		; CHECK-LE-LABEL: @getuc14
		; CHECK-LE: mfvsrd 3, 34
		; CHECK-LE: rldicl 3, 3, 16, 56
		; CHECK-LE: clrldi 3, 3, 56
		}

		; Function Attrs: nounwind
		define zeroext i8 @getuc15(<16 x i8> %vuc) {
		entry:
		%vuc.addr = alloca <16 x i8>, align 16
		store <16 x i8> %vuc, <16 x i8>* %vuc.addr, align 16
		%0 = load <16 x i8>, <16 x i8>* %vuc.addr, align 16
		%vecext = extractelement <16 x i8> %0, i32 15
		ret i8 %vecext
		; CHECK-LABEL: @getuc15
		; CHECK: mfvsrd 3,
		; CHECK: clrldi 3, 3, 56
		; CHECK-LE-LABEL: @getuc15
		; CHECK-LE: mfvsrd 3, 34
		; CHECK-LE: rldicl 3, 3, 8, 56
		; CHECK-LE: clrldi 3, 3, 56
		}

		; Function Attrs: nounwind
		define signext i8 @getvelsc(<16 x i8> %vsc, i32 signext %i) {
		entry:
		%vsc.addr = alloca <16 x i8>, align 16
		%i.addr = alloca i32, align 4
		store <16 x i8> %vsc, <16 x i8>* %vsc.addr, align 16
		store i32 %i, i32* %i.addr, align 4
		%0 = load <16 x i8>, <16 x i8>* %vsc.addr, align 16
		%1 = load i32, i32* %i.addr, align 4
		%vecext = extractelement <16 x i8> %0, i32 %1
		ret i8 %vecext
		; CHECK-LABEL: @getvelsc
		nemanjaiAuthorUnsubmitted Not Done Reply Inline Actions I actually think that the regular expressions for VRA and VRB here are not needed as I think the source vector must be in VR 2. Similarly for all other instances of the vperm. nemanjai: I actually think that the regular expressions for VRA and VRB here are not needed as I think…
		; CHECK-DAG: andi. [[ANDI:[0-9]+]], {{[0-9]+}}, 8
		; CHECK-DAG: lvsl [[SHMSK:[0-9]+]], 0, [[ANDI]]
		; CHECK-DAG: vperm [[PERMD:[0-9]+]], {{[0-9]+}}, {{[0-9]+}}, [[SHMSK]]
		; CHECK-DAG: mfvsrd [[MOV:[0-9]+]],
		; CHECK-DAG: li [[IMM7:[0-9]+]], 7
		; CHECK-DAG: andc [[ANDC:[0-9]+]], [[IMM7]]
		; CHECK-DAG: sldi [[SHL:[0-9]+]], [[ANDC]], 3
		; CHECK-DAG: srd 3, [[MOV]], [[SHL]]
		; CHECK-DAG: extsb 3, 3
		; CHECK-LE-LABEL: @getvelsc
		; CHECK-DAG-LE: li [[IMM8:[0-9]+]], 8
		; CHECK-DAG-LE: andc [[ANDC:[0-9]+]], [[IMM8]]
		; CHECK-DAG-LE: lvsl [[SHMSK:[0-9]+]], 0, [[ANDC]]
		; CHECK-DAG-LE: vperm [[PERMD:[0-9]+]], {{[0-9]+}}, {{[0-9]+}}, [[SHMSK]]
		; CHECK-DAG-LE: mfvsrd [[MOV:[0-9]+]],
		; CHECK-DAG-LE: li [[IMM7:[0-9]+]], 7
		; CHECK-DAG-LE: and [[AND:[0-9]+]], [[IMM7]]
		; CHECK-DAG-LE: sldi [[SHL:[0-9]+]], [[AND]], 3
		; CHECK-DAG-LE: srd 3, [[MOV]], [[SHL]]
		; CHECK-DAG-LE: extsb 3, 3
		}

		; Function Attrs: nounwind
		define zeroext i8 @getveluc(<16 x i8> %vuc, i32 signext %i) {
		entry:
		%vuc.addr = alloca <16 x i8>, align 16
		%i.addr = alloca i32, align 4
		store <16 x i8> %vuc, <16 x i8>* %vuc.addr, align 16
		store i32 %i, i32* %i.addr, align 4
		%0 = load <16 x i8>, <16 x i8>* %vuc.addr, align 16
		%1 = load i32, i32* %i.addr, align 4
		%vecext = extractelement <16 x i8> %0, i32 %1
		ret i8 %vecext
		; CHECK-LABEL: @getveluc
		; CHECK-DAG: andi. [[ANDI:[0-9]+]], {{[0-9]+}}, 8
		; CHECK-DAG: lvsl [[SHMSK:[0-9]+]], 0, [[ANDI]]
		; CHECK-DAG: vperm [[PERMD:[0-9]+]], {{[0-9]+}}, {{[0-9]+}}, [[SHMSK]]
		; CHECK-DAG: mfvsrd [[MOV:[0-9]+]],
		; CHECK-DAG: li [[IMM7:[0-9]+]], 7
		; CHECK-DAG: andc [[ANDC:[0-9]+]], [[IMM7]]
		; CHECK-DAG: sldi [[SHL:[0-9]+]], [[ANDC]], 3
		; CHECK-DAG: srd 3, [[MOV]], [[SHL]]
		; CHECK-DAG: clrldi 3, 3, 56
		; CHECK-LE-LABEL: @getveluc
		; CHECK-DAG-LE: li [[IMM8:[0-9]+]], 8
		; CHECK-DAG-LE: andc [[ANDC:[0-9]+]], [[IMM8]]
		; CHECK-DAG-LE: lvsl [[SHMSK:[0-9]+]], 0, [[ANDC]]
		; CHECK-DAG-LE: vperm [[PERMD:[0-9]+]], {{[0-9]+}}, {{[0-9]+}}, [[SHMSK]]
		; CHECK-DAG-LE: mfvsrd [[MOV:[0-9]+]],
		; CHECK-DAG-LE: li [[IMM7:[0-9]+]], 7
		; CHECK-DAG-LE: and [[AND:[0-9]+]], [[IMM7]]
		; CHECK-DAG-LE: sldi [[SHL:[0-9]+]], [[AND]], 3
		; CHECK-DAG-LE: srd 3, [[MOV]], [[SHL]]
		; CHECK-DAG-LE: clrldi 3, 3, 56
		}

		; Function Attrs: nounwind
		define signext i16 @getss0(<8 x i16> %vss) {
		entry:
		%vss.addr = alloca <8 x i16>, align 16
		store <8 x i16> %vss, <8 x i16>* %vss.addr, align 16
		%0 = load <8 x i16>, <8 x i16>* %vss.addr, align 16
		%vecext = extractelement <8 x i16> %0, i32 0
		ret i16 %vecext
		; CHECK-LABEL: @getss0
		; CHECK: mfvsrd 3, 34
		; CHECK: rldicl 3, 3, 16, 48
		; CHECK: extsh 3, 3
		; CHECK-LE-LABEL: @getss0
		; CHECK-LE: mfvsrd 3,
		; CHECK-LE: clrldi 3, 3, 48
		; CHECK-LE: extsh 3, 3
		wschmidtUnsubmitted Not Done Reply Inline Actions Similarly, the extsh should be peepholed away eventually. wschmidt: Similarly, the extsh should be peepholed away eventually.
		nemanjaiAuthorUnsubmitted Not Done Reply Inline Actions We need the sign extends as we've discussed on IRC. nemanjai: We need the sign extends as we've discussed on IRC.
		}

		; Function Attrs: nounwind
		define signext i16 @getss1(<8 x i16> %vss) {
		entry:
		%vss.addr = alloca <8 x i16>, align 16
		store <8 x i16> %vss, <8 x i16>* %vss.addr, align 16
		%0 = load <8 x i16>, <8 x i16>* %vss.addr, align 16
		%vecext = extractelement <8 x i16> %0, i32 1
		ret i16 %vecext
		; CHECK-LABEL: @getss1
		; CHECK: mfvsrd 3, 34
		; CHECK: rldicl 3, 3, 32, 48
		; CHECK: extsh 3, 3
		; CHECK-LE-LABEL: @getss1
		; CHECK-LE: mfvsrd 3,
		; CHECK-LE: rldicl 3, 3, 48, 48
		; CHECK-LE: extsh 3, 3
		}

		; Function Attrs: nounwind
		define signext i16 @getss2(<8 x i16> %vss) {
		entry:
		%vss.addr = alloca <8 x i16>, align 16
		store <8 x i16> %vss, <8 x i16>* %vss.addr, align 16
		%0 = load <8 x i16>, <8 x i16>* %vss.addr, align 16
		%vecext = extractelement <8 x i16> %0, i32 2
		ret i16 %vecext
		; CHECK-LABEL: @getss2
		; CHECK: mfvsrd 3, 34
		; CHECK: rldicl 3, 3, 48, 48
		; CHECK: extsh 3, 3
		; CHECK-LE-LABEL: @getss2
		; CHECK-LE: mfvsrd 3,
		; CHECK-LE: rldicl 3, 3, 32, 48
		; CHECK-LE: extsh 3, 3
		}

		; Function Attrs: nounwind
		define signext i16 @getss3(<8 x i16> %vss) {
		entry:
		%vss.addr = alloca <8 x i16>, align 16
		store <8 x i16> %vss, <8 x i16>* %vss.addr, align 16
		%0 = load <8 x i16>, <8 x i16>* %vss.addr, align 16
		%vecext = extractelement <8 x i16> %0, i32 3
		ret i16 %vecext
		; CHECK-LABEL: @getss3
		; CHECK: mfvsrd 3, 34
		; CHECK: clrldi 3, 3, 48
		; CHECK: extsh 3, 3
		; CHECK-LE-LABEL: @getss3
		; CHECK-LE: mfvsrd 3,
		; CHECK-LE: rldicl 3, 3, 16, 48
		; CHECK-LE: extsh 3, 3
		}

		; Function Attrs: nounwind
		define signext i16 @getss4(<8 x i16> %vss) {
		entry:
		%vss.addr = alloca <8 x i16>, align 16
		store <8 x i16> %vss, <8 x i16>* %vss.addr, align 16
		%0 = load <8 x i16>, <8 x i16>* %vss.addr, align 16
		%vecext = extractelement <8 x i16> %0, i32 4
		ret i16 %vecext
		; CHECK-LABEL: @getss4
		; CHECK: mfvsrd 3,
		; CHECK: rldicl 3, 3, 16, 48
		; CHECK: extsh 3, 3
		; CHECK-LE-LABEL: @getss4
		; CHECK-LE: mfvsrd 3, 34
		; CHECK-LE: clrldi 3, 3, 48
		; CHECK-LE: extsh 3, 3
		}

		; Function Attrs: nounwind
		define signext i16 @getss5(<8 x i16> %vss) {
		entry:
		%vss.addr = alloca <8 x i16>, align 16
		store <8 x i16> %vss, <8 x i16>* %vss.addr, align 16
		%0 = load <8 x i16>, <8 x i16>* %vss.addr, align 16
		%vecext = extractelement <8 x i16> %0, i32 5
		ret i16 %vecext
		; CHECK-LABEL: @getss5
		; CHECK: mfvsrd 3,
		; CHECK: rldicl 3, 3, 32, 48
		; CHECK: extsh 3, 3
		; CHECK-LE-LABEL: @getss5
		; CHECK-LE: mfvsrd 3, 34
		; CHECK-LE: rldicl 3, 3, 48, 48
		; CHECK-LE: extsh 3, 3
		}

		; Function Attrs: nounwind
		define signext i16 @getss6(<8 x i16> %vss) {
		entry:
		%vss.addr = alloca <8 x i16>, align 16
		store <8 x i16> %vss, <8 x i16>* %vss.addr, align 16
		%0 = load <8 x i16>, <8 x i16>* %vss.addr, align 16
		%vecext = extractelement <8 x i16> %0, i32 6
		ret i16 %vecext
		; CHECK-LABEL: @getss6
		; CHECK: mfvsrd 3,
		; CHECK: rldicl 3, 3, 48, 48
		; CHECK: extsh 3, 3
		; CHECK-LE-LABEL: @getss6
		; CHECK-LE: mfvsrd 3, 34
		; CHECK-LE: rldicl 3, 3, 32, 48
		; CHECK-LE: extsh 3, 3
		}

		; Function Attrs: nounwind
		define signext i16 @getss7(<8 x i16> %vss) {
		entry:
		%vss.addr = alloca <8 x i16>, align 16
		store <8 x i16> %vss, <8 x i16>* %vss.addr, align 16
		%0 = load <8 x i16>, <8 x i16>* %vss.addr, align 16
		%vecext = extractelement <8 x i16> %0, i32 7
		ret i16 %vecext
		; CHECK-LABEL: @getss7
		; CHECK: mfvsrd 3,
		; CHECK: extsh 3, 3
		; CHECK-LE-LABEL: @getss7
		; CHECK-LE: mfvsrd 3, 34
		; CHECK-LE: rldicl 3, 3, 16, 48
		; CHECK-LE: extsh 3, 3
		}

		; Function Attrs: nounwind
		define zeroext i16 @getus0(<8 x i16> %vus) {
		entry:
		%vus.addr = alloca <8 x i16>, align 16
		store <8 x i16> %vus, <8 x i16>* %vus.addr, align 16
		%0 = load <8 x i16>, <8 x i16>* %vus.addr, align 16
		%vecext = extractelement <8 x i16> %0, i32 0
		ret i16 %vecext
		; CHECK-LABEL: @getus0
		; CHECK: mfvsrd 3, 34
		; CHECK: rldicl 3, 3, 16, 48
		; CHECK: clrldi 3, 3, 48
		; CHECK-LE-LABEL: @getus0
		; CHECK-LE: mfvsrd 3,
		; CHECK-LE: clrldi 3, 3, 48
		}

		; Function Attrs: nounwind
		define zeroext i16 @getus1(<8 x i16> %vus) {
		entry:
		%vus.addr = alloca <8 x i16>, align 16
		store <8 x i16> %vus, <8 x i16>* %vus.addr, align 16
		%0 = load <8 x i16>, <8 x i16>* %vus.addr, align 16
		%vecext = extractelement <8 x i16> %0, i32 1
		ret i16 %vecext
		; CHECK-LABEL: @getus1
		; CHECK: mfvsrd 3, 34
		; CHECK: rldicl 3, 3, 32, 48
		; CHECK: clrldi 3, 3, 48
		; CHECK-LE-LABEL: @getus1
		; CHECK-LE: mfvsrd 3,
		; CHECK-LE: rldicl 3, 3, 48, 48
		; CHECK-LE: clrldi 3, 3, 48
		}

		; Function Attrs: nounwind
		define zeroext i16 @getus2(<8 x i16> %vus) {
		entry:
		%vus.addr = alloca <8 x i16>, align 16
		store <8 x i16> %vus, <8 x i16>* %vus.addr, align 16
		%0 = load <8 x i16>, <8 x i16>* %vus.addr, align 16
		%vecext = extractelement <8 x i16> %0, i32 2
		ret i16 %vecext
		; CHECK-LABEL: @getus2
		; CHECK: mfvsrd 3, 34
		; CHECK: rldicl 3, 3, 48, 48
		; CHECK: clrldi 3, 3, 48
		; CHECK-LE-LABEL: @getus2
		; CHECK-LE: mfvsrd 3,
		; CHECK-LE: rldicl 3, 3, 32, 48
		; CHECK-LE: clrldi 3, 3, 48
		}

		; Function Attrs: nounwind
		define zeroext i16 @getus3(<8 x i16> %vus) {
		entry:
		%vus.addr = alloca <8 x i16>, align 16
		store <8 x i16> %vus, <8 x i16>* %vus.addr, align 16
		%0 = load <8 x i16>, <8 x i16>* %vus.addr, align 16
		%vecext = extractelement <8 x i16> %0, i32 3
		ret i16 %vecext
		; CHECK-LABEL: @getus3
		; CHECK: mfvsrd 3, 34
		; CHECK: clrldi 3, 3, 48
		; CHECK-LE-LABEL: @getus3
		; CHECK-LE: mfvsrd 3,
		; CHECK-LE: rldicl 3, 3, 16, 48
		; CHECK-LE: clrldi 3, 3, 48
		}

		; Function Attrs: nounwind
		define zeroext i16 @getus4(<8 x i16> %vus) {
		entry:
		%vus.addr = alloca <8 x i16>, align 16
		store <8 x i16> %vus, <8 x i16>* %vus.addr, align 16
		%0 = load <8 x i16>, <8 x i16>* %vus.addr, align 16
		%vecext = extractelement <8 x i16> %0, i32 4
		ret i16 %vecext
		; CHECK-LABEL: @getus4
		; CHECK: mfvsrd 3,
		; CHECK: rldicl 3, 3, 16, 48
		; CHECK: clrldi 3, 3, 48
		; CHECK-LE-LABEL: @getus4
		; CHECK-LE: mfvsrd 3, 34
		; CHECK-LE: clrldi 3, 3, 48
		}

		; Function Attrs: nounwind
		define zeroext i16 @getus5(<8 x i16> %vus) {
		entry:
		%vus.addr = alloca <8 x i16>, align 16
		store <8 x i16> %vus, <8 x i16>* %vus.addr, align 16
		%0 = load <8 x i16>, <8 x i16>* %vus.addr, align 16
		%vecext = extractelement <8 x i16> %0, i32 5
		ret i16 %vecext
		; CHECK-LABEL: @getus5
		; CHECK: mfvsrd 3,
		; CHECK: rldicl 3, 3, 32, 48
		; CHECK: clrldi 3, 3, 48
		; CHECK-LE-LABEL: @getus5
		; CHECK-LE: mfvsrd 3, 34
		; CHECK-LE: rldicl 3, 3, 48, 48
		; CHECK-LE: clrldi 3, 3, 48
		}

		; Function Attrs: nounwind
		define zeroext i16 @getus6(<8 x i16> %vus) {
		entry:
		%vus.addr = alloca <8 x i16>, align 16
		store <8 x i16> %vus, <8 x i16>* %vus.addr, align 16
		%0 = load <8 x i16>, <8 x i16>* %vus.addr, align 16
		%vecext = extractelement <8 x i16> %0, i32 6
		ret i16 %vecext
		; CHECK-LABEL: @getus6
		; CHECK: mfvsrd 3,
		; CHECK: rldicl 3, 3, 48, 48
		; CHECK: clrldi 3, 3, 48
		; CHECK-LE-LABEL: @getus6
		; CHECK-LE: mfvsrd 3, 34
		; CHECK-LE: rldicl 3, 3, 32, 48
		; CHECK-LE: clrldi 3, 3, 48
		}

		; Function Attrs: nounwind
		define zeroext i16 @getus7(<8 x i16> %vus) {
		entry:
		%vus.addr = alloca <8 x i16>, align 16
		store <8 x i16> %vus, <8 x i16>* %vus.addr, align 16
		%0 = load <8 x i16>, <8 x i16>* %vus.addr, align 16
		%vecext = extractelement <8 x i16> %0, i32 7
		ret i16 %vecext
		; CHECK-LABEL: @getus7
		; CHECK: mfvsrd 3,
		; CHECK: clrldi 3, 3, 48
		; CHECK-LE-LABEL: @getus7
		; CHECK-LE: mfvsrd 3, 34
		; CHECK-LE: rldicl 3, 3, 16, 48
		; CHECK-LE: clrldi 3, 3, 48
		}

		; Function Attrs: nounwind
		define signext i16 @getvelss(<8 x i16> %vss, i32 signext %i) {
		entry:
		%vss.addr = alloca <8 x i16>, align 16
		%i.addr = alloca i32, align 4
		store <8 x i16> %vss, <8 x i16>* %vss.addr, align 16
		store i32 %i, i32* %i.addr, align 4
		%0 = load <8 x i16>, <8 x i16>* %vss.addr, align 16
		%1 = load i32, i32* %i.addr, align 4
		%vecext = extractelement <8 x i16> %0, i32 %1
		ret i16 %vecext
		; CHECK-LABEL: @getvelss
		; CHECK-DAG: andi. [[ANDI:[0-9]+]], {{[0-9]+}}, 4
		; CHECK-DAG: sldi [[MUL2:[0-9]+]], [[ANDI]], 1
		; CHECK-DAG: lvsl [[SHMSK:[0-9]+]], 0, [[MUL2]]
		; CHECK-DAG: vperm [[PERMD:[0-9]+]], {{[0-9]+}}, {{[0-9]+}}, [[SHMSK]]
		; CHECK-DAG: mfvsrd [[MOV:[0-9]+]],
		; CHECK-DAG: li [[IMM3:[0-9]+]], 3
		; CHECK-DAG: andc [[ANDC:[0-9]+]], [[IMM3]]
		; CHECK-DAG: rldicr [[SHL:[0-9]+]], [[ANDC]], 4, 60
		; CHECK-DAG: srd 3, [[MOV]], [[SHL]]
		; CHECK-DAG: extsh 3, 3
		; CHECK-LE-LABEL: @getvelss
		; CHECK-DAG-LE: li [[IMM4:[0-9]+]], 4
		; CHECK-DAG-LE: andc [[ANDC:[0-9]+]], [[IMM4]]
		; CHECK-DAG-LE: sldi [[MUL2:[0-9]+]], [[ANDC]], 1
		; CHECK-DAG-LE: lvsl [[SHMSK:[0-9]+]], 0, [[MUL2]]
		; CHECK-DAG-LE: vperm [[PERMD:[0-9]+]], {{[0-9]+}}, {{[0-9]+}}, [[SHMSK]]
		; CHECK-DAG-LE: mfvsrd [[MOV:[0-9]+]],
		; CHECK-DAG-LE: li [[IMM3:[0-9]+]], 3
		; CHECK-DAG-LE: and [[AND:[0-9]+]], [[IMM3]]
		; CHECK-DAG-LE: sldi [[SHL:[0-9]+]], [[AND]], 4
		; CHECK-DAG-LE: srd 3, [[MOV]], [[SHL]]
		; CHECK-DAG-LE: extsh 3, 3
		}

		; Function Attrs: nounwind
		define zeroext i16 @getvelus(<8 x i16> %vus, i32 signext %i) {
		entry:
		%vus.addr = alloca <8 x i16>, align 16
		%i.addr = alloca i32, align 4
		store <8 x i16> %vus, <8 x i16>* %vus.addr, align 16
		store i32 %i, i32* %i.addr, align 4
		%0 = load <8 x i16>, <8 x i16>* %vus.addr, align 16
		%1 = load i32, i32* %i.addr, align 4
		%vecext = extractelement <8 x i16> %0, i32 %1
		ret i16 %vecext
		; CHECK-LABEL: @getvelus
		; CHECK-DAG: andi. [[ANDI:[0-9]+]], {{[0-9]+}}, 4
		; CHECK-DAG: sldi [[MUL2:[0-9]+]], [[ANDI]], 1
		; CHECK-DAG: lvsl [[SHMSK:[0-9]+]], 0, [[MUL2]]
		; CHECK-DAG: vperm [[PERMD:[0-9]+]], {{[0-9]+}}, {{[0-9]+}}, [[SHMSK]]
		; CHECK-DAG: mfvsrd [[MOV:[0-9]+]],
		; CHECK-DAG: li [[IMM3:[0-9]+]], 3
		; CHECK-DAG: andc [[ANDC:[0-9]+]], [[IMM3]]
		; CHECK-DAG: rldicr [[SHL:[0-9]+]], [[ANDC]], 4, 60
		; CHECK-DAG: srd 3, [[MOV]], [[SHL]]
		; CHECK-DAG: clrldi 3, 3, 48
		; CHECK-LE-LABEL: @getvelus
		; CHECK-DAG-LE: li [[IMM4:[0-9]+]], 4
		; CHECK-DAG-LE: andc [[ANDC:[0-9]+]], [[IMM4]]
		; CHECK-DAG-LE: sldi [[MUL2:[0-9]+]], [[ANDC]], 1
		; CHECK-DAG-LE: lvsl [[SHMSK:[0-9]+]], 0, [[MUL2]]
		; CHECK-DAG-LE: vperm [[PERMD:[0-9]+]], {{[0-9]+}}, {{[0-9]+}}, [[SHMSK]]
		; CHECK-DAG-LE: mfvsrd [[MOV:[0-9]+]],
		; CHECK-DAG-LE: li [[IMM3:[0-9]+]], 3
		; CHECK-DAG-LE: and [[AND:[0-9]+]], [[IMM3]]
		; CHECK-DAG-LE: sldi [[SHL:[0-9]+]], [[AND]], 4
		; CHECK-DAG-LE: srd 3, [[MOV]], [[SHL]]
		; CHECK-DAG-LE: clrldi 3, 3, 48
		}

		; Function Attrs: nounwind
		define signext i32 @getsi0(<4 x i32> %vsi) {
		entry:
		%vsi.addr = alloca <4 x i32>, align 16
		store <4 x i32> %vsi, <4 x i32>* %vsi.addr, align 16
		%0 = load <4 x i32>, <4 x i32>* %vsi.addr, align 16
		%vecext = extractelement <4 x i32> %0, i32 0
		ret i32 %vecext
		; CHECK-LABEL: @getsi0
		; CHECK: xxsldwi [[SHL:[0-9]+]], 34, 34, 3
		; CHECK: mfvsrwz 3, [[SHL]]
		; CHECK: extsw 3, 3
		; CHECK-LE-LABEL: @getsi0
		; CHECK-LE: xxsldwi [[SHL:[0-9]+]], 34, 34, 2
		; CHECK-LE: mfvsrwz 3, [[SHL]]
		; CHECK-LE: extsw 3, 3
		}

		; Function Attrs: nounwind
		define signext i32 @getsi1(<4 x i32> %vsi) {
		entry:
		%vsi.addr = alloca <4 x i32>, align 16
		store <4 x i32> %vsi, <4 x i32>* %vsi.addr, align 16
		%0 = load <4 x i32>, <4 x i32>* %vsi.addr, align 16
		%vecext = extractelement <4 x i32> %0, i32 1
		ret i32 %vecext
		; CHECK-LABEL: @getsi1
		; CHECK: mfvsrwz 3, 34
		; CHECK: extsw 3, 3
		; CHECK-LE-LABEL: @getsi1
		; CHECK-LE: xxsldwi [[SHL:[0-9]+]], 34, 34, 1
		; CHECK-LE: mfvsrwz 3, [[SHL]]
		; CHECK-LE: extsw 3, 3
		}

		; Function Attrs: nounwind
		define signext i32 @getsi2(<4 x i32> %vsi) {
		entry:
		%vsi.addr = alloca <4 x i32>, align 16
		store <4 x i32> %vsi, <4 x i32>* %vsi.addr, align 16
		%0 = load <4 x i32>, <4 x i32>* %vsi.addr, align 16
		%vecext = extractelement <4 x i32> %0, i32 2
		ret i32 %vecext
		; CHECK-LABEL: @getsi2
		; CHECK: xxsldwi [[SHL:[0-9]+]], 34, 34, 1
		; CHECK: mfvsrwz 3, [[SHL]]
		; CHECK: extsw 3, 3
		; CHECK-LE-LABEL: @getsi2
		; CHECK-LE: mfvsrwz 3, 34
		; CHECK-LE: extsw 3, 3
		}

		; Function Attrs: nounwind
		define signext i32 @getsi3(<4 x i32> %vsi) {
		entry:
		%vsi.addr = alloca <4 x i32>, align 16
		store <4 x i32> %vsi, <4 x i32>* %vsi.addr, align 16
		%0 = load <4 x i32>, <4 x i32>* %vsi.addr, align 16
		%vecext = extractelement <4 x i32> %0, i32 3
		ret i32 %vecext
		; CHECK-LABEL: @getsi3
		; CHECK: xxsldwi [[SHL:[0-9]+]], 34, 34, 2
		; CHECK: mfvsrwz 3, [[SHL]]
		; CHECK: extsw 3, 3
		; CHECK-LE-LABEL: @getsi3
		; CHECK-LE: xxsldwi [[SHL:[0-9]+]], 34, 34, 3
		; CHECK-LE: mfvsrwz 3, [[SHL]]
		; CHECK-LE: extsw 3, 3
		}

		; Function Attrs: nounwind
		define zeroext i32 @getui0(<4 x i32> %vui) {
		entry:
		%vui.addr = alloca <4 x i32>, align 16
		store <4 x i32> %vui, <4 x i32>* %vui.addr, align 16
		%0 = load <4 x i32>, <4 x i32>* %vui.addr, align 16
		%vecext = extractelement <4 x i32> %0, i32 0
		ret i32 %vecext
		; CHECK-LABEL: @getui0
		; CHECK: xxsldwi [[SHL:[0-9]+]], 34, 34, 3
		; CHECK: mfvsrwz 3, [[SHL]]
		; CHECK: clrldi 3, 3, 32
		; CHECK-LE-LABEL: @getui0
		; CHECK-LE: xxsldwi [[SHL:[0-9]+]], 34, 34, 2
		; CHECK-LE: mfvsrwz 3, [[SHL]]
		; CHECK-LE: clrldi 3, 3, 32
		}

		; Function Attrs: nounwind
		define zeroext i32 @getui1(<4 x i32> %vui) {
		entry:
		%vui.addr = alloca <4 x i32>, align 16
		store <4 x i32> %vui, <4 x i32>* %vui.addr, align 16
		%0 = load <4 x i32>, <4 x i32>* %vui.addr, align 16
		%vecext = extractelement <4 x i32> %0, i32 1
		ret i32 %vecext
		; CHECK-LABEL: @getui1
		; CHECK: mfvsrwz 3, 34
		; CHECK: clrldi 3, 3, 32
		; CHECK-LE-LABEL: @getui1
		; CHECK-LE: xxsldwi [[SHL:[0-9]+]], 34, 34, 1
		; CHECK-LE: mfvsrwz 3, [[SHL]]
		; CHECK-LE: clrldi 3, 3, 32
		}

		; Function Attrs: nounwind
		define zeroext i32 @getui2(<4 x i32> %vui) {
		entry:
		%vui.addr = alloca <4 x i32>, align 16
		store <4 x i32> %vui, <4 x i32>* %vui.addr, align 16
		%0 = load <4 x i32>, <4 x i32>* %vui.addr, align 16
		%vecext = extractelement <4 x i32> %0, i32 2
		ret i32 %vecext
		; CHECK-LABEL: @getui2
		; CHECK: xxsldwi [[SHL:[0-9]+]], 34, 34, 1
		; CHECK: mfvsrwz 3, [[SHL]]
		; CHECK: clrldi 3, 3, 32
		; CHECK-LE-LABEL: @getui2
		; CHECK-LE: mfvsrwz 3, 34
		; CHECK-LE: clrldi 3, 3, 32
		}

		; Function Attrs: nounwind
		define zeroext i32 @getui3(<4 x i32> %vui) {
		entry:
		%vui.addr = alloca <4 x i32>, align 16
		store <4 x i32> %vui, <4 x i32>* %vui.addr, align 16
		%0 = load <4 x i32>, <4 x i32>* %vui.addr, align 16
		%vecext = extractelement <4 x i32> %0, i32 3
		ret i32 %vecext
		; CHECK-LABEL: @getui3
		; CHECK: xxsldwi [[SHL:[0-9]+]], 34, 34, 2
		; CHECK: mfvsrwz 3, [[SHL]]
		; CHECK: clrldi 3, 3, 32
		; CHECK-LE-LABEL: @getui3
		; CHECK-LE: xxsldwi [[SHL:[0-9]+]], 34, 34, 3
		; CHECK-LE: mfvsrwz 3, [[SHL]]
		; CHECK-LE: clrldi 3, 3, 32
		}

		; Function Attrs: nounwind
		define signext i32 @getvelsi(<4 x i32> %vsi, i32 signext %i) {
		entry:
		%vsi.addr = alloca <4 x i32>, align 16
		%i.addr = alloca i32, align 4
		store <4 x i32> %vsi, <4 x i32>* %vsi.addr, align 16
		store i32 %i, i32* %i.addr, align 4
		%0 = load <4 x i32>, <4 x i32>* %vsi.addr, align 16
		%1 = load i32, i32* %i.addr, align 4
		%vecext = extractelement <4 x i32> %0, i32 %1
		ret i32 %vecext
		; CHECK-LABEL: @getvelsi
		; CHECK-LE-LABEL: @getvelsi
		; FIXME: add check patterns when variable element extraction is implemented
		}

		; Function Attrs: nounwind
		define zeroext i32 @getvelui(<4 x i32> %vui, i32 signext %i) {
		entry:
		%vui.addr = alloca <4 x i32>, align 16
		%i.addr = alloca i32, align 4
		store <4 x i32> %vui, <4 x i32>* %vui.addr, align 16
		store i32 %i, i32* %i.addr, align 4
		%0 = load <4 x i32>, <4 x i32>* %vui.addr, align 16
		%1 = load i32, i32* %i.addr, align 4
		%vecext = extractelement <4 x i32> %0, i32 %1
		ret i32 %vecext
		; CHECK-LABEL: @getvelui
		; CHECK-LE-LABEL: @getvelui
		; FIXME: add check patterns when variable element extraction is implemented
		}

		; Function Attrs: nounwind
		define i64 @getsl0(<2 x i64> %vsl) {
		entry:
		%vsl.addr = alloca <2 x i64>, align 16
		store <2 x i64> %vsl, <2 x i64>* %vsl.addr, align 16
		%0 = load <2 x i64>, <2 x i64>* %vsl.addr, align 16
		%vecext = extractelement <2 x i64> %0, i32 0
		ret i64 %vecext
		; CHECK-LABEL: @getsl0
		; CHECK: mfvsrd 3, 34
		; CHECK-LE-LABEL: @getsl0
		; CHECK-LE: xxswapd [[SWP:[0-9]+]], 34
		; CHECK-LE: mfvsrd 3, [[SWP]]
		}

		; Function Attrs: nounwind
		define i64 @getsl1(<2 x i64> %vsl) {
		entry:
		%vsl.addr = alloca <2 x i64>, align 16
		store <2 x i64> %vsl, <2 x i64>* %vsl.addr, align 16
		%0 = load <2 x i64>, <2 x i64>* %vsl.addr, align 16
		%vecext = extractelement <2 x i64> %0, i32 1
		ret i64 %vecext
		; CHECK-LABEL: @getsl1
		; CHECK: xxswapd [[SWP:[0-9]+]], 34
		; CHECK: mfvsrd 3, [[SWP]]
		; CHECK-LE-LABEL: @getsl1
		; CHECK-LE: mfvsrd 3, 34
		}

		; Function Attrs: nounwind
		define i64 @getul0(<2 x i64> %vul) {
		entry:
		%vul.addr = alloca <2 x i64>, align 16
		store <2 x i64> %vul, <2 x i64>* %vul.addr, align 16
		%0 = load <2 x i64>, <2 x i64>* %vul.addr, align 16
		%vecext = extractelement <2 x i64> %0, i32 0
		ret i64 %vecext
		; CHECK-LABEL: @getul0
		; CHECK: mfvsrd 3, 34
		; CHECK-LE-LABEL: @getul0
		; CHECK-LE: xxswapd [[SWP:[0-9]+]], 34
		; CHECK-LE: mfvsrd 3, [[SWP]]
		}

		; Function Attrs: nounwind
		define i64 @getul1(<2 x i64> %vul) {
		entry:
		%vul.addr = alloca <2 x i64>, align 16
		store <2 x i64> %vul, <2 x i64>* %vul.addr, align 16
		%0 = load <2 x i64>, <2 x i64>* %vul.addr, align 16
		%vecext = extractelement <2 x i64> %0, i32 1
		ret i64 %vecext
		; CHECK-LABEL: @getul1
		; CHECK: xxswapd [[SWP:[0-9]+]], 34
		; CHECK: mfvsrd 3, [[SWP]]
		; CHECK-LE-LABEL: @getul1
		; CHECK-LE: mfvsrd 3, 34
		}

		; Function Attrs: nounwind
		define i64 @getvelsl(<2 x i64> %vsl, i32 signext %i) {
		entry:
		%vsl.addr = alloca <2 x i64>, align 16
		%i.addr = alloca i32, align 4
		store <2 x i64> %vsl, <2 x i64>* %vsl.addr, align 16
		store i32 %i, i32* %i.addr, align 4
		%0 = load <2 x i64>, <2 x i64>* %vsl.addr, align 16
		%1 = load i32, i32* %i.addr, align 4
		%vecext = extractelement <2 x i64> %0, i32 %1
		ret i64 %vecext
		; CHECK-LABEL: @getvelsl
		; CHECK-LE-LABEL: @getvelsl
		; FIXME: add check patterns when variable element extraction is implemented
		}

		; Function Attrs: nounwind
		define i64 @getvelul(<2 x i64> %vul, i32 signext %i) {
		entry:
		%vul.addr = alloca <2 x i64>, align 16
		%i.addr = alloca i32, align 4
		store <2 x i64> %vul, <2 x i64>* %vul.addr, align 16
		store i32 %i, i32* %i.addr, align 4
		%0 = load <2 x i64>, <2 x i64>* %vul.addr, align 16
		%1 = load i32, i32* %i.addr, align 4
		%vecext = extractelement <2 x i64> %0, i32 %1
		ret i64 %vecext
		; CHECK-LABEL: @getvelul
		; CHECK-LE-LABEL: @getvelul
		; FIXME: add check patterns when variable element extraction is implemented
		}

		; Function Attrs: nounwind
		define float @getf0(<4 x float> %vf) {
		entry:
		%vf.addr = alloca <4 x float>, align 16
		store <4 x float> %vf, <4 x float>* %vf.addr, align 16
		%0 = load <4 x float>, <4 x float>* %vf.addr, align 16
		%vecext = extractelement <4 x float> %0, i32 0
		ret float %vecext
		; CHECK-LABEL: @getf0
		; CHECK: xscvspdpn 1, 34
		; CHECK-LE-LABEL: @getf0
		; CHECK-LE: xxsldwi [[SHL:[0-9]+]], 34, 34, 3
		; CHECK-LE: xscvspdpn 1, [[SHL]]
		}

		; Function Attrs: nounwind
		define float @getf1(<4 x float> %vf) {
		entry:
		%vf.addr = alloca <4 x float>, align 16
		store <4 x float> %vf, <4 x float>* %vf.addr, align 16
		%0 = load <4 x float>, <4 x float>* %vf.addr, align 16
		%vecext = extractelement <4 x float> %0, i32 1
		ret float %vecext
		; CHECK-LABEL: @getf1
		; CHECK: xxsldwi [[SHL:[0-9]+]], 34, 34, 1
		; CHECK: xscvspdpn 1, [[SHL]]
		; CHECK-LE-LABEL: @getf1
		; CHECK-LE: xxsldwi [[SHL:[0-9]+]], 34, 34, 2
		; CHECK-LE: xscvspdpn 1, [[SHL]]
		}

		; Function Attrs: nounwind
		define float @getf2(<4 x float> %vf) {
		entry:
		%vf.addr = alloca <4 x float>, align 16
		store <4 x float> %vf, <4 x float>* %vf.addr, align 16
		%0 = load <4 x float>, <4 x float>* %vf.addr, align 16
		%vecext = extractelement <4 x float> %0, i32 2
		ret float %vecext
		; CHECK-LABEL: @getf2
		; CHECK: xxsldwi [[SHL:[0-9]+]], 34, 34, 2
		; CHECK: xscvspdpn 1, [[SHL]]
		; CHECK-LE-LABEL: @getf2
		; CHECK-LE: xxsldwi [[SHL:[0-9]+]], 34, 34, 1
		; CHECK-LE: xscvspdpn 1, [[SHL]]
		}

		; Function Attrs: nounwind
		define float @getf3(<4 x float> %vf) {
		entry:
		%vf.addr = alloca <4 x float>, align 16
		store <4 x float> %vf, <4 x float>* %vf.addr, align 16
		%0 = load <4 x float>, <4 x float>* %vf.addr, align 16
		%vecext = extractelement <4 x float> %0, i32 3
		ret float %vecext
		; CHECK-LABEL: @getf3
		; CHECK: xxsldwi [[SHL:[0-9]+]], 34, 34, 3
		; CHECK: xscvspdpn 1, [[SHL]]
		; CHECK-LE-LABEL: @getf3
		; CHECK-LE: xscvspdpn 1, 34
		}

		; Function Attrs: nounwind
		define float @getvelf(<4 x float> %vf, i32 signext %i) {
		entry:
		%vf.addr = alloca <4 x float>, align 16
		%i.addr = alloca i32, align 4
		store <4 x float> %vf, <4 x float>* %vf.addr, align 16
		store i32 %i, i32* %i.addr, align 4
		%0 = load <4 x float>, <4 x float>* %vf.addr, align 16
		%1 = load i32, i32* %i.addr, align 4
		%vecext = extractelement <4 x float> %0, i32 %1
		ret float %vecext
		; CHECK-LABEL: @getvelf
		; CHECK-LE-LABEL: @getvelf
		; FIXME: add check patterns when variable element extraction is implemented
		}

		; Function Attrs: nounwind
		define double @getd0(<2 x double> %vd) {
		entry:
		%vd.addr = alloca <2 x double>, align 16
		store <2 x double> %vd, <2 x double>* %vd.addr, align 16
		%0 = load <2 x double>, <2 x double>* %vd.addr, align 16
		%vecext = extractelement <2 x double> %0, i32 0
		ret double %vecext
		; CHECK-LABEL: @getd0
		; CHECK: xxlor 1, 34, 34
		; CHECK-LE-LABEL: @getd0
		; CHECK-LE: xxswapd 1, 34
		}

		; Function Attrs: nounwind
		define double @getd1(<2 x double> %vd) {
		entry:
		%vd.addr = alloca <2 x double>, align 16
		store <2 x double> %vd, <2 x double>* %vd.addr, align 16
		%0 = load <2 x double>, <2 x double>* %vd.addr, align 16
		%vecext = extractelement <2 x double> %0, i32 1
		ret double %vecext
		; CHECK-LABEL: @getd1
		; CHECK: xxswapd 1, 34
		; CHECK-LE-LABEL: @getd1
		; CHECK-LE: xxlor 1, 34, 34
		}

		; Function Attrs: nounwind
		define double @getveld(<2 x double> %vd, i32 signext %i) {
		entry:
		%vd.addr = alloca <2 x double>, align 16
		%i.addr = alloca i32, align 4
		store <2 x double> %vd, <2 x double>* %vd.addr, align 16
		store i32 %i, i32* %i.addr, align 4
		%0 = load <2 x double>, <2 x double>* %vd.addr, align 16
		%1 = load i32, i32* %i.addr, align 4
		%vecext = extractelement <2 x double> %0, i32 %1
		ret double %vecext
		; CHECK-LABEL: @getveld
		; CHECK-LE-LABEL: @getveld
		; FIXME: add check patterns when variable element extraction is implemented
		}