This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
-
AArch64ISelLowering.h
1/6
AArch64ISelLowering.cpp
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
3/6
neon-extload.ll

Differential D104782

[AArch64] Custom lower <4 x i8> loads
ClosedPublic

Authored by SjoerdMeijer on Jun 23 2021, 6:33 AM.

Download Raw Diff

Details

Reviewers

dmgreen
efriedma
fhahn
zatrazz
t.p.northover

Commits

rG51e434fc2590: [AArch64] Custom lower <4 x i8> loads

Summary

This custom lowers <4 x i8> vector loads using a 32-bit load, followed by 2 SSHLL instructions to extend it to a <4 x i32> vector. Before, it was really inefficient and expensive to construct a <4 x i32> for this as 4 byte loads and 4 moves were used. With this improvement SLP vectorisation might for example become profitable, see D103629.

Diff Detail

Event Timeline

SjoerdMeijer created this revision.Jun 23 2021, 6:33 AM

Herald added subscribers: danielkiss, hiraditya, kristof.beyls. · View Herald TranscriptJun 23 2021, 6:33 AM

SjoerdMeijer requested review of this revision.Jun 23 2021, 6:33 AM

Herald added a project: Restricted Project. · View Herald TranscriptJun 23 2021, 6:33 AM

Harbormaster completed remote builds in B110614: Diff 353947.Jun 23 2021, 6:33 AM

SjoerdMeijer added inline comments.Jun 23 2021, 6:35 AM

llvm/test/CodeGen/AArch64/neon-extload.ll
38	I am trying to remember how big-endian works in LLVM, but since I noticed these reverse here, this looked okay'ish to me, but I haven't tested BE. Any opinions on this welcome (while I look a bit more at this)....

fhahn added a reviewer: t.p.northover.Jun 23 2021, 7:40 AM

dmgreen added inline comments.Jun 23 2021, 7:50 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
1129	What about v4i16 as well? And EXTLOAD (which is probably fine to treat as a ZEXTLOAD).
4485	It may be worth checking or asserting that the type is v4i32/v4i16. Also DL is more common.
4506	SIGN_EXTEND/ZERO_EXTEND do not need a second VT argument, I don't believe.
4510	ISD::SIGN_EXTEND > ExtType?

Missing testcases for load+ext to <4 x i16>.

llvm/test/CodeGen/AArch64/neon-extload.ll
38	Looks fine to me. The rev32 comes out of lowering the ISD::BITCAST.

SjoerdMeijer added inline comments.Jun 24 2021, 1:00 AM

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
1129	I thought the v4i16 were already handled, but will double check and precommit some tests for this (the new test in this patch, extended with v4i16 cases ) if we don't have them already. Yeah, I thought about EXTLOAD, but wasn't sure how to trigger this, but will look into this.
llvm/test/CodeGen/AArch64/neon-extload.ll
38	Thanks for confirming!

SjoerdMeijer mentioned this in rGc74aea466343: [AArch64] Precommit extending load tests for D104782. NFC..Jun 24 2021, 8:00 AM

Matt added a subscriber: Matt.Jun 24 2021, 10:56 AM

Address comments.

SjoerdMeijer added inline comments.Jun 24 2021, 11:41 AM

llvm/test/CodeGen/AArch64/neon-extload.ll
52	Ahh, just spotted that this is a regression. Will look into this.

LGTM

llvm/test/CodeGen/AArch64/neon-extload.ll
52	I'm not really concerned; IR-level optimizations should catch this.

This revision is now accepted and ready to land.Jun 24 2021, 11:56 AM

Fixed that regression by looking if there is one use that is an vector_extract_elt. But I can remove it if you think this is not necessary.

I don't really care either way...

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
4497	`Op.hasOneUse()` is very different from `Op->hasOneUse()`.

Harbormaster completed remote builds in B110881: Diff 354327.Jun 24 2021, 12:58 PM

In D104782#2839334, @SjoerdMeijer wrote:

Fixed that regression by looking if there is one use that is an vector_extract_elt. But I can remove it if you think this is not necessary.

Yeah I wouldn't worry. There may be other ways to fix it if we need, to do with demanded elements of a load. But this code will likely not come up in practice.

llvm/test/CodeGen/AArch64/neon-extload.ll
49–50	I would perhaps remove this dot, as dots in function names are a little unusual.

Okay, will remove this before committing (and change that function name in that test).

Thanks for the suggestions and help with this work @efriedma and @dmgreen !

This revision was landed with ongoing or failed builds.Jun 25 2021, 1:54 AM

Closed by commit rG51e434fc2590: [AArch64] Custom lower <4 x i8> loads (authored by SjoerdMeijer). · Explain Why

This revision was automatically updated to reflect the committed changes.

SjoerdMeijer added a commit: rG51e434fc2590: [AArch64] Custom lower <4 x i8> loads.

@SjoerdMeijer Since this commit, test-suite::GCC-C-execute-pr60960.test has been failing on our AArch64 bots:
https://lab.llvm.org/buildbot/#/builders/185/builds/40

(we moved them around recently so I think we missed building this commit when it first landed)

https://github.com/llvm/llvm-test-suite/blob/main/SingleSource/Regression/C/gcc-c-torture/execute/pr60960.c

In D104782#2844177, @DavidSpickett wrote:

@SjoerdMeijer Since this commit, test-suite::GCC-C-execute-pr60960.test has been failing on our AArch64 bots:
https://lab.llvm.org/buildbot/#/builders/185/builds/40

(we moved them around recently so I think we missed building this commit when it first landed)

https://github.com/llvm/llvm-test-suite/blob/main/SingleSource/Regression/C/gcc-c-torture/execute/pr60960.c

Owwww....... thanks for reporting. I am looking into this. I will do a first finger on the pulse before I revert this since I haven't heard any other complaints, but let me know if you prefer to first revert it.

There is something going on that I will have to look at tomorrow, so will revert this.

SjoerdMeijer added a reverting change: rG3a7cea2858ff: Revert "[AArch64] Custom lower <4 x i8> loads".Jun 28 2021, 9:45 AM

SjoerdMeijer mentioned this in D105110: [AArch64] Fix for custom lowering <4 x i8> loads.Jun 29 2021, 6:13 AM

SjoerdMeijer mentioned this in rGb062fff87adc: Recommit "[AArch64] Custom lower <4 x i8> loads".Jun 30 2021, 1:19 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64ISelLowering.h

1 line

AArch64ISelLowering.cpp

44 lines

test/

CodeGen/

AArch64/

neon-extload.ll

168 lines

Diff 354320

llvm/lib/Target/AArch64/AArch64ISelLowering.h

Show First 20 Lines • Show All 849 Lines • ▼ Show 20 Lines	private:

SDValue LowerCallResult(SDValue Chain, SDValue InFlag,		SDValue LowerCallResult(SDValue Chain, SDValue InFlag,
CallingConv::ID CallConv, bool isVarArg,		CallingConv::ID CallConv, bool isVarArg,
const SmallVectorImpl<ISD::InputArg> &Ins,		const SmallVectorImpl<ISD::InputArg> &Ins,
const SDLoc &DL, SelectionDAG &DAG,		const SDLoc &DL, SelectionDAG &DAG,
SmallVectorImpl<SDValue> &InVals, bool isThisReturn,		SmallVectorImpl<SDValue> &InVals, bool isThisReturn,
SDValue ThisVal) const;		SDValue ThisVal) const;

		SDValue LowerLOAD(SDValue Op, SelectionDAG &DAG) const;
		Lint: Pre-merge checks Inline Actions clang-tidy: warning: invalid case style for function 'LowerLOAD' [readability-identifier-naming] not useful Lint: Pre-merge checks: clang-tidy: warning: invalid case style for function 'LowerLOAD' [readability-identifier…
SDValue LowerSTORE(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerSTORE(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerABS(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerABS(SDValue Op, SelectionDAG &DAG) const;

SDValue LowerMGATHER(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerMGATHER(SDValue Op, SelectionDAG &DAG) const;
SDValue LowerMSCATTER(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerMSCATTER(SDValue Op, SelectionDAG &DAG) const;

SDValue LowerINTRINSIC_WO_CHAIN(SDValue Op, SelectionDAG &DAG) const;		SDValue LowerINTRINSIC_WO_CHAIN(SDValue Op, SelectionDAG &DAG) const;

▲ Show 20 Lines • Show All 252 Lines • Show Last 20 Lines

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,119 Lines • ▼ Show 20 Lines	if (Subtarget->hasFullFP16()) {
setOperationAction(ISD::FROUNDEVEN, Ty, Legal);		setOperationAction(ISD::FROUNDEVEN, Ty, Legal);
}		}
}		}

if (Subtarget->hasSVE())		if (Subtarget->hasSVE())
setOperationAction(ISD::VSCALE, MVT::i32, Custom);		setOperationAction(ISD::VSCALE, MVT::i32, Custom);

setTruncStoreAction(MVT::v4i16, MVT::v4i8, Custom);		setTruncStoreAction(MVT::v4i16, MVT::v4i8, Custom);

		setLoadExtAction(ISD::EXTLOAD, MVT::v4i16, MVT::v4i8, Custom);
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - setLoadExtAction(ISD::EXTLOAD, MVT::v4i16, MVT::v4i8, Custom); + setLoadExtAction(ISD::EXTLOAD, MVT::v4i16, MVT::v4i8, Custom); Lint: Pre-merge checks: clang-format: please reformat the code ``` - setLoadExtAction(ISD::EXTLOAD, MVT::v4i16, MVT…
		dmgreenUnsubmitted Not Done Reply Inline Actions What about v4i16 as well? And EXTLOAD (which is probably fine to treat as a ZEXTLOAD). dmgreen: What about v4i16 as well? And EXTLOAD (which is probably fine to treat as a ZEXTLOAD).
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions I thought the v4i16 were already handled, but will double check and precommit some tests for this (the new test in this patch, extended with v4i16 cases ) if we don't have them already. Yeah, I thought about EXTLOAD, but wasn't sure how to trigger this, but will look into this. SjoerdMeijer: I thought the v4i16 were already handled, but will double check and precommit some tests for…
		setLoadExtAction(ISD::SEXTLOAD, MVT::v4i16, MVT::v4i8, Custom);
		setLoadExtAction(ISD::ZEXTLOAD, MVT::v4i16, MVT::v4i8, Custom);
		setLoadExtAction(ISD::EXTLOAD, MVT::v4i32, MVT::v4i8, Custom);
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - setLoadExtAction(ISD::EXTLOAD, MVT::v4i32, MVT::v4i8, Custom); + setLoadExtAction(ISD::EXTLOAD, MVT::v4i32, MVT::v4i8, Custom); Lint: Pre-merge checks: clang-format: please reformat the code ``` - setLoadExtAction(ISD::EXTLOAD, MVT::v4i32, MVT…
		setLoadExtAction(ISD::SEXTLOAD, MVT::v4i32, MVT::v4i8, Custom);
		setLoadExtAction(ISD::ZEXTLOAD, MVT::v4i32, MVT::v4i8, Custom);
}		}

if (Subtarget->hasSVE()) {		if (Subtarget->hasSVE()) {
for (auto VT : {MVT::nxv16i8, MVT::nxv8i16, MVT::nxv4i32, MVT::nxv2i64}) {		for (auto VT : {MVT::nxv16i8, MVT::nxv8i16, MVT::nxv4i32, MVT::nxv2i64}) {
setOperationAction(ISD::BITREVERSE, VT, Custom);		setOperationAction(ISD::BITREVERSE, VT, Custom);
setOperationAction(ISD::BSWAP, VT, Custom);		setOperationAction(ISD::BSWAP, VT, Custom);
setOperationAction(ISD::CTLZ, VT, Custom);		setOperationAction(ISD::CTLZ, VT, Custom);
setOperationAction(ISD::CTPOP, VT, Custom);		setOperationAction(ISD::CTPOP, VT, Custom);
▲ Show 20 Lines • Show All 3,331 Lines • ▼ Show 20 Lines	SDValue Result = DAG.getMemIntrinsicNode(
{StoreNode->getChain(), Lo, Hi, StoreNode->getBasePtr()},		{StoreNode->getChain(), Lo, Hi, StoreNode->getBasePtr()},
StoreNode->getMemoryVT(), StoreNode->getMemOperand());		StoreNode->getMemoryVT(), StoreNode->getMemOperand());
return Result;		return Result;
}		}

return SDValue();		return SDValue();
}		}

		// Custom lowering for extending v4i8 vector loads.
		SDValue AArch64TargetLowering::LowerLOAD(SDValue Op,
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code -SDValue AArch64TargetLowering::LowerLOAD(SDValue Op, - SelectionDAG &DAG) const { +SDValue AArch64TargetLowering::LowerLOAD(SDValue Op, SelectionDAG &DAG) const { Lint: Pre-merge checks: clang-format: please reformat the code ``` -SDValue AArch64TargetLowering::LowerLOAD(SDValue Op…
		SelectionDAG &DAG) const {
		SDLoc DL(Op);
		dmgreenUnsubmitted Not Done Reply Inline Actions It may be worth checking or asserting that the type is v4i32/v4i16. Also DL is more common. dmgreen: It may be worth checking or asserting that the type is v4i32/v4i16. Also DL is more common.
		LoadSDNode *LoadNode = cast<LoadSDNode>(Op);
		assert(LoadNode && "Expected custom lowering of a load node");
		EVT VT = Op->getValueType(0);
		assert((VT == MVT::v4i16 \|\| VT == MVT::v4i32) && "Expected v4i16 or v4i32");

		if (LoadNode->getMemoryVT() != MVT::v4i8)
		return SDValue();

		unsigned ExtType;
		if (LoadNode->getExtensionType() == ISD::SEXTLOAD)
		ExtType = ISD::SIGN_EXTEND;
		else if (LoadNode->getExtensionType() == ISD::ZEXTLOAD \|\|
		efriedmaUnsubmitted Not Done Reply Inline Actions `Op.hasOneUse()` is very different from `Op->hasOneUse()`. efriedma: `Op.hasOneUse()` is very different from `Op->hasOneUse()`.
		LoadNode->getExtensionType() == ISD::EXTLOAD)
		ExtType = ISD::ZERO_EXTEND;
		else
		return SDValue();

		SDValue Load = DAG.getLoad(MVT::f32, DL, DAG.getEntryNode(),
		LoadNode->getBasePtr(),
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - LoadNode->getBasePtr(), - MachinePointerInfo()); + LoadNode->getBasePtr(), MachinePointerInfo()); Lint: Pre-merge checks: clang-format: please reformat the code ``` - LoadNode->getBasePtr()…
		MachinePointerInfo());
		SDValue Chain = Load.getValue(1);
		dmgreenUnsubmitted Not Done Reply Inline Actions SIGN_EXTEND/ZERO_EXTEND do not need a second VT argument, I don't believe. dmgreen: SIGN_EXTEND/ZERO_EXTEND do not need a second VT argument, I don't believe.
		SDValue Vec = DAG.getNode(ISD::SCALAR_TO_VECTOR, DL, MVT::v2f32, Load);
		SDValue BC = DAG.getNode(ISD::BITCAST, DL, MVT::v8i8, Vec);
		SDValue Ext = DAG.getNode(ExtType, DL, MVT::v8i16, BC);
		Ext = DAG.getNode(ISD::EXTRACT_SUBVECTOR, DL, MVT::v4i16, Ext,
		dmgreenUnsubmitted Not Done Reply Inline Actions ISD::SIGN_EXTEND > ExtType? dmgreen: ISD::SIGN_EXTEND > ExtType?
		DAG.getConstant(0, DL, MVT::i64));
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - DAG.getConstant(0, DL, MVT::i64)); + DAG.getConstant(0, DL, MVT::i64)); Lint: Pre-merge checks: clang-format: please reformat the code ``` - DAG.getConstant(0, DL, MVT::i64))…
		if (VT == MVT::v4i32)
		Ext = DAG.getNode(ExtType, DL, MVT::v4i32, Ext);
		return DAG.getMergeValues({Ext, Chain}, DL);
		}

// Generate SUBS and CSEL for integer abs.		// Generate SUBS and CSEL for integer abs.
SDValue AArch64TargetLowering::LowerABS(SDValue Op, SelectionDAG &DAG) const {		SDValue AArch64TargetLowering::LowerABS(SDValue Op, SelectionDAG &DAG) const {
MVT VT = Op.getSimpleValueType();		MVT VT = Op.getSimpleValueType();

if (VT.isVector())		if (VT.isVector())
return LowerToPredicatedOp(Op, DAG, AArch64ISD::ABS_MERGE_PASSTHRU);		return LowerToPredicatedOp(Op, DAG, AArch64ISD::ABS_MERGE_PASSTHRU);

SDLoc DL(Op);		SDLoc DL(Op);
▲ Show 20 Lines • Show All 227 Lines • ▼ Show 20 Lines	SDValue AArch64TargetLowering::LowerOperation(SDValue Op,
}		}
case ISD::TRUNCATE:		case ISD::TRUNCATE:
return LowerTRUNCATE(Op, DAG);		return LowerTRUNCATE(Op, DAG);
case ISD::MLOAD:		case ISD::MLOAD:
return LowerFixedLengthVectorMLoadToSVE(Op, DAG);		return LowerFixedLengthVectorMLoadToSVE(Op, DAG);
case ISD::LOAD:		case ISD::LOAD:
if (useSVEForFixedLengthVectorVT(Op.getValueType()))		if (useSVEForFixedLengthVectorVT(Op.getValueType()))
return LowerFixedLengthVectorLoadToSVE(Op, DAG);		return LowerFixedLengthVectorLoadToSVE(Op, DAG);
llvm_unreachable("Unexpected request to lower ISD::LOAD");		return LowerLOAD(Op, DAG);
case ISD::ADD:		case ISD::ADD:
return LowerToPredicatedOp(Op, DAG, AArch64ISD::ADD_PRED);		return LowerToPredicatedOp(Op, DAG, AArch64ISD::ADD_PRED);
case ISD::AND:		case ISD::AND:
return LowerToScalableOp(Op, DAG);		return LowerToScalableOp(Op, DAG);
case ISD::SUB:		case ISD::SUB:
return LowerToPredicatedOp(Op, DAG, AArch64ISD::SUB_PRED);		return LowerToPredicatedOp(Op, DAG, AArch64ISD::SUB_PRED);
case ISD::FMAXIMUM:		case ISD::FMAXIMUM:
return LowerToPredicatedOp(Op, DAG, AArch64ISD::FMAX_PRED);		return LowerToPredicatedOp(Op, DAG, AArch64ISD::FMAX_PRED);
▲ Show 20 Lines • Show All 13,654 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/neon-extload.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -verify-machineinstrs -mtriple=aarch64-none-linux-gnu -mattr=+neon \| FileCheck %s --check-prefix=LE			; RUN: llc < %s -verify-machineinstrs -mtriple=aarch64-none-linux-gnu -mattr=+neon \| FileCheck %s --check-prefix=LE
	; RUN: llc < %s -verify-machineinstrs -mtriple=aarch64_be-none-linux-gnu -mattr=+neon \| FileCheck %s --check-prefix=BE			; RUN: llc < %s -verify-machineinstrs -mtriple=aarch64_be-none-linux-gnu -mattr=+neon \| FileCheck %s --check-prefix=BE

	define <4 x i32> @fsext_v4i32(<4 x i8>* %a) {			define <4 x i32> @fsext_v4i32(<4 x i8>* %a) {
	; LE-LABEL: fsext_v4i32:			; LE-LABEL: fsext_v4i32:
	; LE: // %bb.0:			; LE: // %bb.0:
	; LE-NEXT: ldrsb w8, [x0]			; LE-NEXT: ldr s0, [x0]
	; LE-NEXT: ldrsb w9, [x0, #1]			; LE-NEXT: sshll v0.8h, v0.8b, #0
	; LE-NEXT: ldrsb w10, [x0, #2]			; LE-NEXT: sshll v0.4s, v0.4h, #0
	; LE-NEXT: ldrsb w11, [x0, #3]
	; LE-NEXT: fmov s0, w8
	; LE-NEXT: mov v0.s[1], w9
	; LE-NEXT: mov v0.s[2], w10
	; LE-NEXT: mov v0.s[3], w11
	; LE-NEXT: ret			; LE-NEXT: ret
	;			;
	; BE-LABEL: fsext_v4i32:			; BE-LABEL: fsext_v4i32:
	; BE: // %bb.0:			; BE: // %bb.0:
	; BE-NEXT: ldrsb w8, [x0]			; BE-NEXT: ldr s0, [x0]
	; BE-NEXT: ldrsb w9, [x0, #1]			; BE-NEXT: rev32 v0.8b, v0.8b
	; BE-NEXT: ldrsb w10, [x0, #2]			; BE-NEXT: sshll v0.8h, v0.8b, #0
	; BE-NEXT: ldrsb w11, [x0, #3]			; BE-NEXT: sshll v0.4s, v0.4h, #0
	; BE-NEXT: fmov s0, w8
	; BE-NEXT: mov v0.s[1], w9
	; BE-NEXT: mov v0.s[2], w10
	; BE-NEXT: mov v0.s[3], w11
	; BE-NEXT: rev64 v0.4s, v0.4s			; BE-NEXT: rev64 v0.4s, v0.4s
	; BE-NEXT: ext v0.16b, v0.16b, v0.16b, #8			; BE-NEXT: ext v0.16b, v0.16b, v0.16b, #8
	; BE-NEXT: ret			; BE-NEXT: ret
	%x = load <4 x i8>, <4 x i8>* %a			%x = load <4 x i8>, <4 x i8>* %a
	%y = sext <4 x i8> %x to <4 x i32>			%y = sext <4 x i8> %x to <4 x i32>
	ret <4 x i32> %y			ret <4 x i32> %y
	}			}

	define <4 x i32> @fzext_v4i32(<4 x i8>* %a) {			define <4 x i32> @fzext_v4i32(<4 x i8>* %a) {
	; LE-LABEL: fzext_v4i32:			; LE-LABEL: fzext_v4i32:
	; LE: // %bb.0:			; LE: // %bb.0:
	; LE-NEXT: ldrb w8, [x0]			; LE-NEXT: ldr s0, [x0]
	; LE-NEXT: ldrb w9, [x0, #1]			; LE-NEXT: ushll v0.8h, v0.8b, #0
	; LE-NEXT: ldrb w10, [x0, #2]			; LE-NEXT: ushll v0.4s, v0.4h, #0
	; LE-NEXT: ldrb w11, [x0, #3]
	; LE-NEXT: fmov s0, w8
	; LE-NEXT: mov v0.s[1], w9
	; LE-NEXT: mov v0.s[2], w10
	; LE-NEXT: mov v0.s[3], w11
	; LE-NEXT: ret			; LE-NEXT: ret
	;			;
	; BE-LABEL: fzext_v4i32:			; BE-LABEL: fzext_v4i32:
	; BE: // %bb.0:			; BE: // %bb.0:
	; BE-NEXT: ldrb w8, [x0]			; BE-NEXT: ldr s0, [x0]
	; BE-NEXT: ldrb w9, [x0, #1]			; BE-NEXT: rev32 v0.8b, v0.8b
				SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions I am trying to remember how big-endian works in LLVM, but since I noticed these reverse here, this looked okay'ish to me, but I haven't tested BE. Any opinions on this welcome (while I look a bit more at this).... SjoerdMeijer: I am trying to remember how big-endian works in LLVM, but since I noticed these reverse here…
				efriedmaUnsubmitted Not Done Reply Inline Actions Looks fine to me. The rev32 comes out of lowering the ISD::BITCAST. efriedma: Looks fine to me. The rev32 comes out of lowering the ISD::BITCAST.
				SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions Thanks for confirming! SjoerdMeijer: Thanks for confirming!
	; BE-NEXT: ldrb w10, [x0, #2]			; BE-NEXT: ushll v0.8h, v0.8b, #0
	; BE-NEXT: ldrb w11, [x0, #3]			; BE-NEXT: ushll v0.4s, v0.4h, #0
	; BE-NEXT: fmov s0, w8
	; BE-NEXT: mov v0.s[1], w9
	; BE-NEXT: mov v0.s[2], w10
	; BE-NEXT: mov v0.s[3], w11
	; BE-NEXT: rev64 v0.4s, v0.4s			; BE-NEXT: rev64 v0.4s, v0.4s
	; BE-NEXT: ext v0.16b, v0.16b, v0.16b, #8			; BE-NEXT: ext v0.16b, v0.16b, v0.16b, #8
	; BE-NEXT: ret			; BE-NEXT: ret
	%x = load <4 x i8>, <4 x i8>* %a			%x = load <4 x i8>, <4 x i8>* %a
	%y = zext <4 x i8> %x to <4 x i32>			%y = zext <4 x i8> %x to <4 x i32>
	ret <4 x i32> %y			ret <4 x i32> %y
	}			}

	define i32 @loadExt.i32(<4 x i8>* %ref) {			define i32 @loadExt.i32(<4 x i8>* %ref) {
	; CHECK-LABEL: loadExt.i32:
	; CHECK: ldrb
	; LE-LABEL: loadExt.i32:			; LE-LABEL: loadExt.i32:
				dmgreenUnsubmitted Not Done Reply Inline Actions I would perhaps remove this dot, as dots in function names are a little unusual. dmgreen: I would perhaps remove this dot, as dots in function names are a little unusual.
	; LE: // %bb.0:			; LE: // %bb.0:
	; LE-NEXT: ldrb w0, [x0]			; LE-NEXT: ldr s0, [x0]
				SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions Ahh, just spotted that this is a regression. Will look into this. SjoerdMeijer: Ahh, just spotted that this is a regression. Will look into this.
				efriedmaUnsubmitted Not Done Reply Inline Actions I'm not really concerned; IR-level optimizations should catch this. efriedma: I'm not really concerned; IR-level optimizations should catch this.
				; LE-NEXT: ushll v0.8h, v0.8b, #0
				; LE-NEXT: umov w8, v0.h[0]
				; LE-NEXT: and w0, w8, #0xff
	; LE-NEXT: ret			; LE-NEXT: ret
	;			;
	; BE-LABEL: loadExt.i32:			; BE-LABEL: loadExt.i32:
	; BE: // %bb.0:			; BE: // %bb.0:
	; BE-NEXT: ldrb w0, [x0]			; BE-NEXT: ldr s0, [x0]
				; BE-NEXT: rev32 v0.8b, v0.8b
				; BE-NEXT: ushll v0.8h, v0.8b, #0
				; BE-NEXT: umov w8, v0.h[0]
				; BE-NEXT: and w0, w8, #0xff
	; BE-NEXT: ret			; BE-NEXT: ret
	%a = load <4 x i8>, <4 x i8>* %ref			%a = load <4 x i8>, <4 x i8>* %ref
	%vecext = extractelement <4 x i8> %a, i32 0			%vecext = extractelement <4 x i8> %a, i32 0
	%conv = zext i8 %vecext to i32			%conv = zext i8 %vecext to i32
	ret i32 %conv			ret i32 %conv
	}			}

	define <4 x i16> @fsext_v4i16(<4 x i8>* %a) {			define <4 x i16> @fsext_v4i16(<4 x i8>* %a) {
	; LE-LABEL: fsext_v4i16:			; LE-LABEL: fsext_v4i16:
	; LE: // %bb.0:			; LE: // %bb.0:
	; LE-NEXT: ldrsb w8, [x0]			; LE-NEXT: ldr s0, [x0]
	; LE-NEXT: ldrsb w9, [x0, #1]			; LE-NEXT: sshll v0.8h, v0.8b, #0
	; LE-NEXT: ldrsb w10, [x0, #2]
	; LE-NEXT: ldrsb w11, [x0, #3]
	; LE-NEXT: fmov s0, w8
	; LE-NEXT: mov v0.h[1], w9
	; LE-NEXT: mov v0.h[2], w10
	; LE-NEXT: mov v0.h[3], w11
	; LE-NEXT: // kill: def $d0 killed $d0 killed $q0			; LE-NEXT: // kill: def $d0 killed $d0 killed $q0
	; LE-NEXT: ret			; LE-NEXT: ret
	;			;
	; BE-LABEL: fsext_v4i16:			; BE-LABEL: fsext_v4i16:
	; BE: // %bb.0:			; BE: // %bb.0:
	; BE-NEXT: ldrsb w8, [x0]			; BE-NEXT: ldr s0, [x0]
	; BE-NEXT: ldrsb w9, [x0, #1]			; BE-NEXT: rev32 v0.8b, v0.8b
	; BE-NEXT: ldrsb w10, [x0, #2]			; BE-NEXT: sshll v0.8h, v0.8b, #0
	; BE-NEXT: ldrsb w11, [x0, #3]
	; BE-NEXT: fmov s0, w8
	; BE-NEXT: mov v0.h[1], w9
	; BE-NEXT: mov v0.h[2], w10
	; BE-NEXT: mov v0.h[3], w11
	; BE-NEXT: rev64 v0.4h, v0.4h			; BE-NEXT: rev64 v0.4h, v0.4h
	; BE-NEXT: ret			; BE-NEXT: ret
	%x = load <4 x i8>, <4 x i8>* %a			%x = load <4 x i8>, <4 x i8>* %a
	%y = sext <4 x i8> %x to <4 x i16>			%y = sext <4 x i8> %x to <4 x i16>
	ret <4 x i16> %y			ret <4 x i16> %y
	}			}

	define <4 x i16> @fzext_v4i16(<4 x i8>* %a) {			define <4 x i16> @fzext_v4i16(<4 x i8>* %a) {
	; LE-LABEL: fzext_v4i16:			; LE-LABEL: fzext_v4i16:
	; LE: // %bb.0:			; LE: // %bb.0:
	; LE-NEXT: ldrb w8, [x0]			; LE-NEXT: ldr s0, [x0]
	; LE-NEXT: ldrb w9, [x0, #1]			; LE-NEXT: ushll v0.8h, v0.8b, #0
	; LE-NEXT: ldrb w10, [x0, #2]
	; LE-NEXT: ldrb w11, [x0, #3]
	; LE-NEXT: fmov s0, w8
	; LE-NEXT: mov v0.h[1], w9
	; LE-NEXT: mov v0.h[2], w10
	; LE-NEXT: mov v0.h[3], w11
	; LE-NEXT: // kill: def $d0 killed $d0 killed $q0			; LE-NEXT: // kill: def $d0 killed $d0 killed $q0
	; LE-NEXT: ret			; LE-NEXT: ret
	;			;
	; BE-LABEL: fzext_v4i16:			; BE-LABEL: fzext_v4i16:
	; BE: // %bb.0:			; BE: // %bb.0:
	; BE-NEXT: ldrb w8, [x0]			; BE-NEXT: ldr s0, [x0]
	; BE-NEXT: ldrb w9, [x0, #1]			; BE-NEXT: rev32 v0.8b, v0.8b
	; BE-NEXT: ldrb w10, [x0, #2]			; BE-NEXT: ushll v0.8h, v0.8b, #0
	; BE-NEXT: ldrb w11, [x0, #3]
	; BE-NEXT: fmov s0, w8
	; BE-NEXT: mov v0.h[1], w9
	; BE-NEXT: mov v0.h[2], w10
	; BE-NEXT: mov v0.h[3], w11
	; BE-NEXT: rev64 v0.4h, v0.4h			; BE-NEXT: rev64 v0.4h, v0.4h
	; BE-NEXT: ret			; BE-NEXT: ret
	%x = load <4 x i8>, <4 x i8>* %a			%x = load <4 x i8>, <4 x i8>* %a
	%y = zext <4 x i8> %x to <4 x i16>			%y = zext <4 x i8> %x to <4 x i16>
	ret <4 x i16> %y			ret <4 x i16> %y
	}			}

				define <4 x i16> @anyext_v4i16(<4 x i8> %a, <4 x i8> %b) {
				; LE-LABEL: anyext_v4i16:
				; LE: // %bb.0:
				; LE-NEXT: ldr s0, [x0]
				; LE-NEXT: ldr s1, [x1]
				; LE-NEXT: ushll v0.8h, v0.8b, #0
				; LE-NEXT: ushll v1.8h, v1.8b, #0
				; LE-NEXT: add v0.4h, v0.4h, v1.4h
				; LE-NEXT: shl v0.4h, v0.4h, #8
				; LE-NEXT: sshr v0.4h, v0.4h, #8
				; LE-NEXT: ret
				;
				; BE-LABEL: anyext_v4i16:
				; BE: // %bb.0:
				; BE-NEXT: ldr s0, [x0]
				; BE-NEXT: ldr s1, [x1]
				; BE-NEXT: rev32 v0.8b, v0.8b
				; BE-NEXT: rev32 v1.8b, v1.8b
				; BE-NEXT: ushll v0.8h, v0.8b, #0
				; BE-NEXT: ushll v1.8h, v1.8b, #0
				; BE-NEXT: add v0.4h, v0.4h, v1.4h
				; BE-NEXT: shl v0.4h, v0.4h, #8
				; BE-NEXT: sshr v0.4h, v0.4h, #8
				; BE-NEXT: rev64 v0.4h, v0.4h
				; BE-NEXT: ret
				%x = load <4 x i8>, <4 x i8>* %a, align 4
				%y = load <4 x i8>, <4 x i8>* %b, align 4
				%z = add <4 x i8> %x, %y
				%s = sext <4 x i8> %z to <4 x i16>
				ret <4 x i16> %s
				}

				define <4 x i32> @anyext_v4i32(<4 x i8> %a, <4 x i8> %b) {
				; LE-LABEL: anyext_v4i32:
				; LE: // %bb.0:
				; LE-NEXT: ldr s0, [x0]
				; LE-NEXT: ldr s1, [x1]
				; LE-NEXT: ushll v0.8h, v0.8b, #0
				; LE-NEXT: ushll v1.8h, v1.8b, #0
				; LE-NEXT: add v0.4h, v0.4h, v1.4h
				; LE-NEXT: ushll v0.4s, v0.4h, #0
				; LE-NEXT: shl v0.4s, v0.4s, #24
				; LE-NEXT: sshr v0.4s, v0.4s, #24
				; LE-NEXT: ret
				;
				; BE-LABEL: anyext_v4i32:
				; BE: // %bb.0:
				; BE-NEXT: ldr s0, [x0]
				; BE-NEXT: ldr s1, [x1]
				; BE-NEXT: rev32 v0.8b, v0.8b
				; BE-NEXT: rev32 v1.8b, v1.8b
				; BE-NEXT: ushll v0.8h, v0.8b, #0
				; BE-NEXT: ushll v1.8h, v1.8b, #0
				; BE-NEXT: add v0.4h, v0.4h, v1.4h
				; BE-NEXT: ushll v0.4s, v0.4h, #0
				; BE-NEXT: shl v0.4s, v0.4s, #24
				; BE-NEXT: sshr v0.4s, v0.4s, #24
				; BE-NEXT: rev64 v0.4s, v0.4s
				; BE-NEXT: ext v0.16b, v0.16b, v0.16b, #8
				; BE-NEXT: ret
				%x = load <4 x i8>, <4 x i8>* %a, align 4
				%y = load <4 x i8>, <4 x i8>* %b, align 4
				%z = add <4 x i8> %x, %y
				%s = sext <4 x i8> %z to <4 x i32>
				ret <4 x i32> %s
				}