This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ] Increase the number of VLREPs
ClosedPublic

Authored by jonpa on Nov 8 2018, 7:26 AM.

Download Raw Diff

Details

Reviewers

Summary

As I experimented with making an fp-vector element 0 insert free (given that there is the "vector merge high" instruction available which takes two input vectors and combines them into one), I ran into a regression. In the end, it seemed like the SLP-vectorizer is doing one more vectorization (4 instead of 3) in a function, which ended up to cause the vector operands to be significantly more numerous. This would be worth looking into at some point, probably.

However, one of the things I also noticed was many VL64 -> VREPG instructions where this could have seemingly been just VLREPG. IIUC, SystemZ buildVector() creates a REPLICATE node if there is a loaded element, with the intent of this becoming VLREP. It seems that this folding does not happen if the load has more than one user.

To remedy this, I experimented with a patch that handles these cases by putting those other users of the load to use the REPLICATE 0-element instead of the load. This way, the load has only the REPLICATE node as user, and we get a VLREP.

So far, I have only looked at my test case, which was floating point (calculix/Utilities_DV), and I am not sure of all the implications. It just seems better to get VLREP in more cases... For the files that get a different opcode count comparing to trunk on SPEC, see

vlrep_opcounts.txt26 KBDownload

I think the regression I saw during experiments disappeared mostly with this fix. Not sure about the performance effects generally yet.

Diff Detail

Event Timeline

jonpa created this revision.Nov 8 2018, 7:26 AM

I'm not sure that this is always a win. In fact, this may depend on the type of the loaded value.

For floating point values (which can typically be re-used directly from the element 0 slot of the replicated VR), I agree that this approach looks useful. But for integer values (which will then need to be moved to a GR), there seem to be drawbacks.

Looking at the first example in your list, I see:

vlvgp          :                    9                    0       -9
vlgvf          :                    0                    9       +9
l              :                  166                  157       -9
vrepf          :                   11                    2       -9
vlrepf         :                    0                    9       +9

This appears to indicate that we have a 32-bit integer value that is being used both as scalar in a GPR, and as replicated vector in a VR. The original code would have been something like

L -> GPR -> use GPR
         -> VLVGP -> VR -> VREPF -> use VR

This is a total latency for the GPR use of 4 cycles, and a latency of 9 cycles for the VR use.

With the new approach, we have instead

VLREPF -> VR -> use VR
             -> VLGVF -> GPR -> use GPR

The total latency of the VR use goes down to 4 cycles, but the latency of the GPR use is now 8 cycles.

While this may be an overall win in many cases, it can also hurt in those cases where the overall performance is bounded more by the latency of the GPR use chain.

As an aside, for the case of a 64-bit integer value, I think the sequence LG -> VLVGP (which implicitly does the replicate already) seems to be preferable always.

I'm not sure that this is always a win. In fact, this may depend on the type of the loaded value.

Thanks for review!

After initial feedback, I began to update the patch in the direction of never introducing any new VLGVs (extracts to GPRs). I added checks for cases where there was another scalar use of the load, such as an address register or a scalar operation. There were still some cases left which was i16/i32 elements that got transferred from the new VLREP via a GPR (to another vector register), instead of VPERM. This surprised me, and I suspect that there is an issue here to fix, given that a VPERM should be better than a pair of VLGVF + VLVGF, even though it includes a load of the byte mask, or? There were only a few more VLREPs produced in the i16/i32 cases, so I decided to simply avoid this.

There was then no difference on SPEC code generation between handling just i64 minus scalar uses, or handling no integer loads at all, so I simply removed the integer handling to only do this for floating point.

This seems now to generally be useful with some 200 more VLREPs on SPEC. There are also some (2?) files where the load seems to be used in multiple blocks and so there are some minor changes in CodeGen. In one case it is the Control Flow Optimizer which now can no longer merge identical code. I am guessing that this is generally beneficial still, since I have seen a case where a small loop got many more instructions just because of this problem (although that was with another patch applied).

For new opcode diffs per file, see:

opcounts_VLREP.txt9 KBDownload

Summary for SPEC:

opcounts_vlrep_summary.txt1 KBDownload

This seems reasonable in principle to me now, but see inline comments.

lib/Target/SystemZ/SystemZISelLowering.cpp
5387	What exactly are you trying to check here? As far as I can see, UseVT is always equal to LdVT, right? (It's the type of UI's "Val", which is the value of N ...) Why not simply checking that the result type of the load is a floating-point type above, and then be done with it?
5388	UserResultVT never appears to be used.
5390	What exactly ensures that there cannot be a second REPLICATE use? You'd be causing an internal compiler error here if that actually does occur ... If this is just something that this optimization cannot handle, you should gracefully fail here instead of asserting.
5411	Again, I'm not sure what this type check is supposed to achive. (Also, why isInteger?)

Updated per review.

lib/Target/SystemZ/SystemZISelLowering.cpp
5387	UseVT here is the type of the used SDValue, so I think this could be i1 for a chain. I think that chains should be ignored / preserved in this method: If there are only other memory accesses chained after this load, I think VLREP will still happen, even though there are other "users". Don't change the chain edge to the REPLICATE below
5388	Ah, right - I can remove that now also since I removed all the i64 checks.
5390	I suppose I don't really need that check anymore. It was just making sure that I understood correctly that there really should just be one REPLICATE of the same load - since DAG.getNode() should return the same node for the second query.. It never happened so far, so I think I could remove it since it wouldn't be disastrous to REPLICATE again an already VLREP:ed load. Should I remove any other of the asserts as well?
5411	See above about preserving Chains. And for the isInteger(), I had also forgotten to remove that after removing the i64 checks, sorry. Is the IsDataUse variable ok, or would you rather just eliminate it?

uweigand added inline comments.Nov 12 2018, 8:03 AM

lib/Target/SystemZ/SystemZISelLowering.cpp
5387	I see. If you simply want to distinguish between the value and chain results of the load, wouldn't a check along the lines of UI.getResNo() == 0 be more straightforward?
5390	Just removing the assert doesn't feel correct, you now don't even add the use to the OtherUses list. If that's OK, you should do that (or otherwise, just fail gracefully). The other asserts can probably be just removed.

NFC update per review.

lib/Target/SystemZ/SystemZISelLowering.cpp
5387	right.
5390	I think this should not ever happen, so I just added a check to fail if it ever did. I still think the DAG would be legal with two REPLICATEs, where one of them takes on the scalar users, but maybe this is safest if something weird like that happens...

LGTM, thanks.

This revision is now accepted and ready to land.Nov 12 2018, 8:49 AM

Thanks for review. r346746

Revision Contents

Path

Size

lib/

Target/

SystemZ/

SystemZISelLowering.h

1 line

SystemZISelLowering.cpp

42 lines

test/

CodeGen/

SystemZ/

vec-move-21.ll

56 lines

vec-move-22.ll

15 lines

Diff 173686

lib/Target/SystemZ/SystemZISelLowering.h

Show First 20 Lines • Show All 581 Lines • ▼ Show 20 Lines	SDValue combineExtract(const SDLoc &DL, EVT ElemVT, EVT VecVT, SDValue OrigOp,
unsigned Index, DAGCombinerInfo &DCI,		unsigned Index, DAGCombinerInfo &DCI,
bool Force) const;		bool Force) const;
SDValue combineTruncateExtract(const SDLoc &DL, EVT TruncVT, SDValue Op,		SDValue combineTruncateExtract(const SDLoc &DL, EVT TruncVT, SDValue Op,
DAGCombinerInfo &DCI) const;		DAGCombinerInfo &DCI) const;
SDValue combineZERO_EXTEND(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue combineZERO_EXTEND(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue combineSIGN_EXTEND(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue combineSIGN_EXTEND(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue combineSIGN_EXTEND_INREG(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue combineSIGN_EXTEND_INREG(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue combineMERGE(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue combineMERGE(SDNode *N, DAGCombinerInfo &DCI) const;
		SDValue combineLOAD(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue combineSTORE(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue combineSTORE(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue combineEXTRACT_VECTOR_ELT(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue combineEXTRACT_VECTOR_ELT(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue combineJOIN_DWORDS(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue combineJOIN_DWORDS(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue combineFP_ROUND(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue combineFP_ROUND(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue combineBSWAP(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue combineBSWAP(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue combineBR_CCMASK(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue combineBR_CCMASK(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue combineSELECT_CCMASK(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue combineSELECT_CCMASK(SDNode *N, DAGCombinerInfo &DCI) const;
SDValue combineGET_CCMASK(SDNode *N, DAGCombinerInfo &DCI) const;		SDValue combineGET_CCMASK(SDNode *N, DAGCombinerInfo &DCI) const;
▲ Show 20 Lines • Show All 47 Lines • Show Last 20 Lines

lib/Target/SystemZ/SystemZISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 517 Lines • ▼ Show 20 Lines	SystemZTargetLowering::SystemZTargetLowering(const TargetMachine &TM,
setOperationAction(ISD::VASTART, MVT::Other, Custom);		setOperationAction(ISD::VASTART, MVT::Other, Custom);
setOperationAction(ISD::VACOPY, MVT::Other, Custom);		setOperationAction(ISD::VACOPY, MVT::Other, Custom);
setOperationAction(ISD::VAEND, MVT::Other, Expand);		setOperationAction(ISD::VAEND, MVT::Other, Expand);

// Codes for which we want to perform some z-specific combinations.		// Codes for which we want to perform some z-specific combinations.
setTargetDAGCombine(ISD::ZERO_EXTEND);		setTargetDAGCombine(ISD::ZERO_EXTEND);
setTargetDAGCombine(ISD::SIGN_EXTEND);		setTargetDAGCombine(ISD::SIGN_EXTEND);
setTargetDAGCombine(ISD::SIGN_EXTEND_INREG);		setTargetDAGCombine(ISD::SIGN_EXTEND_INREG);
		setTargetDAGCombine(ISD::LOAD);
setTargetDAGCombine(ISD::STORE);		setTargetDAGCombine(ISD::STORE);
setTargetDAGCombine(ISD::EXTRACT_VECTOR_ELT);		setTargetDAGCombine(ISD::EXTRACT_VECTOR_ELT);
setTargetDAGCombine(ISD::FP_ROUND);		setTargetDAGCombine(ISD::FP_ROUND);
setTargetDAGCombine(ISD::BSWAP);		setTargetDAGCombine(ISD::BSWAP);
setTargetDAGCombine(ISD::SDIV);		setTargetDAGCombine(ISD::SDIV);
setTargetDAGCombine(ISD::UDIV);		setTargetDAGCombine(ISD::UDIV);
setTargetDAGCombine(ISD::SREM);		setTargetDAGCombine(ISD::SREM);
setTargetDAGCombine(ISD::UREM);		setTargetDAGCombine(ISD::UREM);
▲ Show 20 Lines • Show All 4,829 Lines • ▼ Show 20 Lines	if (ElemBytes <= 4) {
SDValue Op = DAG.getNode(Opcode, SDLoc(N), OutVT, Op1);		SDValue Op = DAG.getNode(Opcode, SDLoc(N), OutVT, Op1);
DCI.AddToWorklist(Op.getNode());		DCI.AddToWorklist(Op.getNode());
return DAG.getNode(ISD::BITCAST, SDLoc(N), VT, Op);		return DAG.getNode(ISD::BITCAST, SDLoc(N), VT, Op);
}		}
}		}
return SDValue();		return SDValue();
}		}

		SDValue SystemZTargetLowering::combineLOAD(
		SDNode *N, DAGCombinerInfo &DCI) const {
		SelectionDAG &DAG = DCI.DAG;
		EVT LdVT = N->getValueType(0);
		if (LdVT.isVector() \|\| LdVT.isInteger())
		return SDValue();
		// Transform a scalar load that is REPLICATEd as well as having other
		// use(s) to the form where the other use(s) use the first element of the
		// REPLICATE instead of the load. Otherwise instruction selection will not
		// produce a VLREP. Avoid extracting to a GPR, so only do this for floating
		// point loads.

		SDValue Replicate;
		SmallVector<SDNode*, 8> OtherUses;
		for (SDNode::use_iterator UI = N->use_begin(), UE = N->use_end();
		UI != UE; ++UI) {
		uweigandUnsubmitted Not Done Reply Inline Actions What exactly are you trying to check here? As far as I can see, UseVT is always equal to LdVT, right? (It's the type of UI's "Val", which is the value of N ...) Why not simply checking that the result type of the load is a floating-point type above, and then be done with it? uweigand: What exactly are you trying to check here? As far as I can see, UseVT is always equal to LdVT…
		jonpaAuthorUnsubmitted Not Done Reply Inline Actions UseVT here is the type of the used SDValue, so I think this could be i1 for a chain. I think that chains should be ignored / preserved in this method: If there are only other memory accesses chained after this load, I think VLREP will still happen, even though there are other "users". Don't change the chain edge to the REPLICATE below jonpa: UseVT here is the type of the used SDValue, so I think this could be i1 for a chain. I think…
		uweigandUnsubmitted Not Done Reply Inline Actions I see. If you simply want to distinguish between the value and chain results of the load, wouldn't a check along the lines of UI.getResNo() == 0 be more straightforward? uweigand: I see. If you simply want to distinguish between the value and chain results of the load…
		jonpaAuthorUnsubmitted Not Done Reply Inline Actions right. jonpa: right.
		if (UI->getOpcode() == SystemZISD::REPLICATE) {
		uweigandUnsubmitted Not Done Reply Inline Actions UserResultVT never appears to be used. uweigand: UserResultVT never appears to be used.
		jonpaAuthorUnsubmitted Not Done Reply Inline Actions Ah, right - I can remove that now also since I removed all the i64 checks. jonpa: Ah, right - I can remove that now also since I removed all the i64 checks.
		if (Replicate)
		return SDValue(); // Should never happen
		uweigandUnsubmitted Not Done Reply Inline Actions What exactly ensures that there cannot be a second REPLICATE use? You'd be causing an internal compiler error here if that actually does occur ... If this is just something that this optimization cannot handle, you should gracefully fail here instead of asserting. uweigand: What exactly ensures that there cannot be a second REPLICATE use? You'd be causing an…
		jonpaAuthorUnsubmitted Not Done Reply Inline Actions I suppose I don't really need that check anymore. It was just making sure that I understood correctly that there really should just be one REPLICATE of the same load - since DAG.getNode() should return the same node for the second query.. It never happened so far, so I think I could remove it since it wouldn't be disastrous to REPLICATE again an already VLREP:ed load. Should I remove any other of the asserts as well? jonpa: I suppose I don't really need that check anymore. It was just making sure that I understood…
		uweigandUnsubmitted Not Done Reply Inline Actions Just removing the assert doesn't feel correct, you now don't even add the use to the OtherUses list. If that's OK, you should do that (or otherwise, just fail gracefully). The other asserts can probably be just removed. uweigand: Just removing the assert doesn't feel correct, you now don't even add the use to the OtherUses…
		jonpaAuthorUnsubmitted Not Done Reply Inline Actions I think this should not ever happen, so I just added a check to fail if it ever did. I still think the DAG would be legal with two REPLICATEs, where one of them takes on the scalar users, but maybe this is safest if something weird like that happens... jonpa: I think this should not ever happen, so I just added a check to fail if it ever did. I still…
		Replicate = SDValue(*UI,0);
		}
		else if (UI.getUse().getResNo() == 0)
		OtherUses.push_back(*UI);
		}
		if (!Replicate \|\| OtherUses.empty())
		return SDValue();

		SDLoc DL(N);
		SDValue Extract0 = DAG.getNode(ISD::EXTRACT_VECTOR_ELT, DL, LdVT,
		Replicate, DAG.getConstant(0, DL, MVT::i32));
		// Update uses of the loaded Value while preserving old chains.
		for (SDNode *U : OtherUses) {
		SmallVector<SDValue, 8> Ops;
		for (SDValue Op : U->ops())
		Ops.push_back((Op.getNode() == N && Op.getResNo() == 0) ? Extract0 : Op);
		DAG.UpdateNodeOperands(U, Ops);
		}
		return SDValue(N, 0);
		}

		uweigandUnsubmitted Done Reply Inline Actions Again, I'm not sure what this type check is supposed to achive. (Also, why isInteger?) uweigand: Again, I'm not sure what this type check is supposed to achive. (Also, why isInteger?)
		jonpaAuthorUnsubmitted Not Done Reply Inline Actions See above about preserving Chains. And for the isInteger(), I had also forgotten to remove that after removing the i64 checks, sorry. Is the IsDataUse variable ok, or would you rather just eliminate it? jonpa: See above about preserving Chains. And for the isInteger(), I had also forgotten to remove…
SDValue SystemZTargetLowering::combineSTORE(		SDValue SystemZTargetLowering::combineSTORE(
SDNode *N, DAGCombinerInfo &DCI) const {		SDNode *N, DAGCombinerInfo &DCI) const {
SelectionDAG &DAG = DCI.DAG;		SelectionDAG &DAG = DCI.DAG;
auto *SN = cast<StoreSDNode>(N);		auto *SN = cast<StoreSDNode>(N);
auto &Op1 = N->getOperand(1);		auto &Op1 = N->getOperand(1);
EVT MemVT = SN->getMemoryVT();		EVT MemVT = SN->getMemoryVT();
// If we have (truncstoreiN (extract_vector_elt X, Y), Z) then it is better		// If we have (truncstoreiN (extract_vector_elt X, Y), Z) then it is better
// for the extraction to be done on a vMiN value, so that we can use VSTE.		// for the extraction to be done on a vMiN value, so that we can use VSTE.
▲ Show 20 Lines • Show All 315 Lines • ▼ Show 20 Lines	SDValue SystemZTargetLowering::PerformDAGCombine(SDNode *N,
DAGCombinerInfo &DCI) const {		DAGCombinerInfo &DCI) const {
switch(N->getOpcode()) {		switch(N->getOpcode()) {
default: break;		default: break;
case ISD::ZERO_EXTEND: return combineZERO_EXTEND(N, DCI);		case ISD::ZERO_EXTEND: return combineZERO_EXTEND(N, DCI);
case ISD::SIGN_EXTEND: return combineSIGN_EXTEND(N, DCI);		case ISD::SIGN_EXTEND: return combineSIGN_EXTEND(N, DCI);
case ISD::SIGN_EXTEND_INREG: return combineSIGN_EXTEND_INREG(N, DCI);		case ISD::SIGN_EXTEND_INREG: return combineSIGN_EXTEND_INREG(N, DCI);
case SystemZISD::MERGE_HIGH:		case SystemZISD::MERGE_HIGH:
case SystemZISD::MERGE_LOW: return combineMERGE(N, DCI);		case SystemZISD::MERGE_LOW: return combineMERGE(N, DCI);
		case ISD::LOAD: return combineLOAD(N, DCI);
case ISD::STORE: return combineSTORE(N, DCI);		case ISD::STORE: return combineSTORE(N, DCI);
case ISD::EXTRACT_VECTOR_ELT: return combineEXTRACT_VECTOR_ELT(N, DCI);		case ISD::EXTRACT_VECTOR_ELT: return combineEXTRACT_VECTOR_ELT(N, DCI);
case SystemZISD::JOIN_DWORDS: return combineJOIN_DWORDS(N, DCI);		case SystemZISD::JOIN_DWORDS: return combineJOIN_DWORDS(N, DCI);
case ISD::FP_ROUND: return combineFP_ROUND(N, DCI);		case ISD::FP_ROUND: return combineFP_ROUND(N, DCI);
case ISD::BSWAP: return combineBSWAP(N, DCI);		case ISD::BSWAP: return combineBSWAP(N, DCI);
case SystemZISD::BR_CCMASK: return combineBR_CCMASK(N, DCI);		case SystemZISD::BR_CCMASK: return combineBR_CCMASK(N, DCI);
case SystemZISD::SELECT_CCMASK: return combineSELECT_CCMASK(N, DCI);		case SystemZISD::SELECT_CCMASK: return combineSELECT_CCMASK(N, DCI);
case SystemZISD::GET_CCMASK: return combineGET_CCMASK(N, DCI);		case SystemZISD::GET_CCMASK: return combineGET_CCMASK(N, DCI);
▲ Show 20 Lines • Show All 1,610 Lines • Show Last 20 Lines

test/CodeGen/SystemZ/vec-move-21.ll

This file was added.

				; RUN: llc < %s -mtriple=s390x-linux-gnu -mcpu=z13 \| FileCheck %s

				; Test that a replicate of a load gets folded to vlrep also in cases where
				; the load has multiple users.

				; CHECK-NOT: vrep


				define double @fun(double* %Vsrc, <2 x double> %T) {
				entry:
				%Vgep1 = getelementptr double, double* %Vsrc, i64 0
				%Vld1 = load double, double* %Vgep1
				%Vgep2 = getelementptr double, double* %Vsrc, i64 1
				%Vld2 = load double, double* %Vgep2
				%Vgep3 = getelementptr double, double* %Vsrc, i64 2
				%Vld3 = load double, double* %Vgep3
				%Vgep4 = getelementptr double, double* %Vsrc, i64 3
				%Vld4 = load double, double* %Vgep4
				%Vgep5 = getelementptr double, double* %Vsrc, i64 4
				%Vld5 = load double, double* %Vgep5
				%Vgep6 = getelementptr double, double* %Vsrc, i64 5
				%Vld6 = load double, double* %Vgep6

				%V19 = insertelement <2 x double> undef, double %Vld1, i32 0
				%V20 = shufflevector <2 x double> %V19, <2 x double> undef, <2 x i32> zeroinitializer
				%V21 = insertelement <2 x double> undef, double %Vld4, i32 0
				%V22 = insertelement <2 x double> %V21, double %Vld5, i32 1
				%V23 = fmul <2 x double> %V20, %V22
				%V24 = fadd <2 x double> %T, %V23
				%V25 = insertelement <2 x double> %V19, double %Vld2, i32 1
				%V26 = insertelement <2 x double> undef, double %Vld6, i32 0
				%V27 = insertelement <2 x double> %V26, double %Vld6, i32 1
				%V28 = fmul <2 x double> %V25, %V27
				%V29 = fadd <2 x double> %T, %V28
				%V30 = insertelement <2 x double> undef, double %Vld2, i32 0
				%V31 = shufflevector <2 x double> %V30, <2 x double> undef, <2 x i32> zeroinitializer
				%V32 = insertelement <2 x double> undef, double %Vld5, i32 0
				%V33 = insertelement <2 x double> %V32, double %Vld6, i32 1
				%V34 = fmul <2 x double> %V31, %V33
				%V35 = fadd <2 x double> %T, %V34
				%V36 = insertelement <2 x double> undef, double %Vld3, i32 0
				%V37 = shufflevector <2 x double> %V36, <2 x double> undef, <2 x i32> zeroinitializer
				%V38 = fmul <2 x double> %V37, %V22
				%V39 = fadd <2 x double> %T, %V38
				%Vmul37 = fmul double %Vld3, %Vld6
				%Vadd38 = fadd double %Vmul37, %Vmul37

				%VA0 = fadd <2 x double> %V24, %V29
				%VA1 = fadd <2 x double> %VA0, %V35
				%VA2 = fadd <2 x double> %VA1, %V39

				%VE0 = extractelement <2 x double> %VA2, i32 0
				%VS1 = fadd double %VE0, %Vadd38

				ret double %VS1
				}

test/CodeGen/SystemZ/vec-move-22.ll

This file was added.

				; RUN: llc < %s -mtriple=s390x-linux-gnu -mcpu=z13 \| FileCheck %s

				; Test that a loaded value which is used both in a vector and scalar context
				; is not transformed to a vlrep + vlgvg.

				; CHECK-NOT: vlrep

				define void @fun(i64 %arg, i64** %Addr, <2 x i64> %Dst) {
				%tmp10 = load i64, i64* %Addr
				store i64 %arg, i64* %tmp10
				%tmp12 = insertelement <2 x i64> undef, i64 %tmp10, i32 0
				%tmp13 = insertelement <2 x i64> %tmp12, i64 %tmp10, i32 1
				store <2 x i64> %tmp13, <2 x i64>* %Dst
				ret void
				}