This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/trunk/
-
trunk/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
-
AArch64.h
-
AArch64TargetMachine.cpp
-
AArch64VectorByElementOpt.cpp
-
CMakeLists.txt
-
test/CodeGen/AArch64/
-
CodeGen/
-
AArch64/
-
arm64-neon-2velem.ll

Differential D21571

[AArch64] Avoid generating indexed vector instructions for Exynos
ClosedPublic

Authored by az on Jun 21 2016, 3:06 PM.

Download Raw Diff

Details

Reviewers

sebpop
rengolin
t.p.northover
evandro

Commits

rGeb65d72d9cf0: [AArch64] Avoid generating indexed vector instructions for Exynos
rL283663: [AArch64] Avoid generating indexed vector instructions for Exynos

Summary

Avoid generating indexed vector instructions for Exynos. This is needed for fmla/fmls/fmul/fmulx.

For example, the instruction fmla v0.4s, v1.4s, v2.s[1] is less efficient than the instructions dup v2.4s, v2.s[1] ; fmla v0.4s, v1.4s, v2.4s

Diff Detail

Repository: rL LLVM

Event Timeline

az updated this revision to Diff 61447.Jun 21 2016, 3:06 PM

az retitled this revision from to Avoid generating indexed vector instructions for Exynos.

az updated this object.

az added a reviewer: evandro.

az added a subscriber: llvm-commits.

flyingforyou added a subscriber: flyingforyou.Jun 21 2016, 3:18 PM

flyingforyou added inline comments.

llvm/test/CodeGen/AArch64/arm64-neon-2velem.ll
451 ↗	(On Diff #61447)	Is this line necessary?

evandro retitled this revision from Avoid generating indexed vector instructions for Exynos to [AArch64] Avoid generating indexed vector instructions for Exynos.Jun 22 2016, 12:57 PM

evandro added reviewers: rengolin, t.p.northover.

Hi Evandro,

Unless Exynos chips can't handle indexed lanes at all, this looks like a case for the cost model, not CPU flags.

cheers,
--renato

evandro added inline comments.Jun 22 2016, 1:13 PM

llvm/lib/Target/AArch64/AArch64.td
96 ↗	(On Diff #61447)	The ISA manual refers to such operations as "vector by element", so I'd prefer something like 's/VectorIndexing/VectorByElement/'.
llvm/lib/Target/AArch64/AArch64InstrInfo.td
311 ↗	(On Diff #61447)	s/Indexing/ByElement/
llvm/lib/Target/AArch64/AArch64Subtarget.h
82 ↗	(On Diff #61447)	s/Indexing/ByElement/
189 ↗	(On Diff #61447)	s/Indexing/ByElement/
llvm/test/CodeGen/AArch64/arm64-neon-2velem.ll
2 ↗	(On Diff #61447)	Refrain from setting the CPU in tests. Rather, use the feature that you added with "-mattr=no-vector-instruction-indexing", or rather, "-mattr=no-vector-by-element"

Redone this work as an optimization instead of modifying TableGen. The optimization is based on the latency of instructions and only the latencies for Exynos enable this optimization. This is mainly tested on intrinsic code.

Herald added subscribers: rengolin, aemerson. · View Herald TranscriptAug 1 2016, 4:26 PM

az added inline comments.Aug 1 2016, 4:35 PM

llvm/test/CodeGen/AArch64/arm64-neon-2velem.ll
2 ↗	(On Diff #66395)	It is not a feature anymore. It is now an optimization that is currently triggered for Exynos only.

This looks better now, taking the scheduling costs into account. Thanks!

Though, I'm dubious as to whether AArch64InstrInfo.cpp is really the right place to land this. This looks like a job for a new pass...

What is posted here is a sub-pass in the backend peephole optimizer. The code implementing this optimization is in AArch64/AArch64InstrInfo.cpp. This is very similar to the optimizations optimizeCondBranch(), or optimizeCmpInstr () where both are sub-passes of peephole and the main implementation is in AArch64/AArch64InstrInfo.cpp. I can leave it mainly as is with minor modification such as moving the code to some other file or even a new file. I am also perfectly fine with creating a new pass. This optimization does not deserve, in my opinion, a new pass but it may be good idea to create a new pass so that we can include future similar optimizations in there. Let me know what you think. Thanks for looking at this.

This version takes the previous patch functionality and code and puts it into a new AArch64 standalone optimization pass.

This optimization does not deserve, in my opinion, a new pass but it may be good idea to create a new pass so that we can include future similar optimizations in there. Let me know what you think. Thanks for looking at this.

I'm not a big fan of adding optimisations to AArch64InstrInfo because that's a place for more generic codegen. I agree with you, this is a bit heavy for such a small gain on a single core, but it's also a big piece of unrelated code to live inside InstrInfo.

It's possible that we could fuse all those passes into one, so we don't have to iterate over everything multiple times. But maybe this is not a job for this commit.

As it stands, I'm happy either way, as long as we make this more sensibly in due time. @t.p.northover, any comments on that?

However, my inline comments are still pertinent, no matter the choice, since that code will land on a new file or in AArch64InstrInfo anyway.

cheers,
--renato

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
83 ↗	(On Diff #67039)	Nit: "replaceInstruction" sounds like it actually replaces the instruction. This should be named "shouldReplaceInstruction".
257 ↗	(On Diff #67039)	You only use Src2IsKill to set getKillRegState(Src2IsKill), maybe you should cache the result of getKillRegState(Src2IsKill) instead? Same to others.
263 ↗	(On Diff #67039)	Why are you setting DupDest in reuseDUP, and re-setting it here? Same for 4 ops below.
267 ↗	(On Diff #67039)	Coding style: no newline between "}" and "else". All other cases, too.
284 ↗	(On Diff #67039)	This looks odd and may be prone to future errors. Please change to: } else { return false; }
315 ↗	(On Diff #67039)	This seems like an odd pattern. Can't you just add the MIs to a list when they change and delete all of them in the end?
llvm/test/CodeGen/AArch64/arm64-neon-2velem.ll
387 ↗	(On Diff #67039)	What about the dups?

Addressed all the comments (sorry for the delay as I was away)

az marked 5 inline comments as done.Aug 31 2016, 4:46 PM

az added inline comments.

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
264 ↗	(On Diff #69921)	If DupDest is set in reuseDup (i.e. we found a dup instruction that we can reuse), then we are not re-setting it here given that reuseDup returns true. Otherwise, DupDest is set here. I added more comments in this new revision but I can rewrite this function to make things more clear if needed.

ping

As it stands, I'm happy either way, as long as we make this more sensibly in due time. @t.p.northover, any comments on that?

It seems like @rengolin is ok with this change.
@t.p.northover could you please review as well? Thanks!

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
285 ↗	(On Diff #69921)	@az please address this comment as well.

addressed a couple of overlooked comment

Herald added subscribers: mgorny, beanz. · View Herald TranscriptSep 19 2016, 4:29 PM

Ping again! We need this patch.

fix some formatting issues.

Perhaps I missed this, but how is the pass being enabled for *only* Exynos? AFAICT, it's enabled for all AArch64 targets.

sebpop added inline comments.Sep 28 2016, 9:11 AM

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
12 ↗	(On Diff #72841)	Let's add a bit more description of what this pass does, and an example from below...
131 ↗	(On Diff #72841)	Remove the else as there is a return stmt in then clause.
167 ↗	(On Diff #72841)	... this example, and the comment at the top of the file can contain some of the text above.
177 ↗	(On Diff #72841)	Remove this empty stmt.
llvm/test/CodeGen/AArch64/arm64-neon-2velem.ll
2 ↗	(On Diff #72841)	Let's also add a comment here saying that we need the instructions cost of Exynos-M1 to trigger the transform.

In D21571#555504, @mcrosier wrote:

Perhaps I missed this, but how is the pass being enabled for *only* Exynos? AFAICT, it's enabled for all AArch64 targets.

I think Sebastian answered my question indirectly; the instructions cost of Exynos-M1 triggers the transform.

I would prefer we also have a target feature (as was done in the first version of the patch) that early exits runOnMachineFunction for non-Exynos-M1 subtargets. Otherwise, we're doing a lot of unnecessary work (i.e., switching over every instruction in the function) for non-Exynos-M1 subtargets.

In D21571#555555, @mcrosier wrote:

I would prefer we also have a target feature (as was done in the first version of the patch) that early exits runOnMachineFunction for non-Exynos-M1 subtargets. Otherwise, we're doing a lot of unnecessary work (i.e., switching over every instruction in the function) for non-Exynos-M1 subtargets.

Thanks Chad for catching this.
There seems to be another compile time improvement that we could do: see comment inline.

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
106 ↗	(On Diff #72841)	Let's move this init() call together with ST and TII outside the for(BB) for(Insn) loops: we can call it in runOnFunction().

Added a simple check to exit this pass early on so that no analysis is done for targets that do not need it.
Since there was a push against adding any target feature when this patch was first introduced, the check is mainly comparing the latency of indexed fmla with its replacement.

rengolin added inline comments.Sep 28 2016, 5:25 PM

llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
463 ↗	(On Diff #72920)	Unnecessary white space change

flyingforyou added inline comments.Sep 28 2016, 5:49 PM

llvm/test/CodeGen/AArch64/arm64-neon-2velem.ll
2 ↗	(On Diff #72920)	I think more proper check-prefix is `EXYNOSM1`, if future exynos core has a possibility that can be different from now.

sebpop added inline comments.Sep 28 2016, 9:49 PM

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
127 ↗	(On Diff #72920)	Please move the above 10 lines of code out of the loops to...
341 ↗	(On Diff #72920)	... move the code here.
llvm/test/CodeGen/AArch64/arm64-neon-2velem.ll
2 ↗	(On Diff #72920)	As EXYNOS is only a marker for FileCheck, I think it is not important to change to EXYNOSM1: it is clear from the -mcpu flag that these checks are for Exynos-M1.
3 ↗	(On Diff #72920)	s/triggers/trigger/

sebpop added inline comments.Sep 28 2016, 10:06 PM

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
171 ↗	(On Diff #72920)	This only checks for 1 pair of the current 12 pairs of instructions that may be replaced. I think we should put all the 12 pairs in a vector and iterate through all of them.
224 ↗	(On Diff #72920)	In that case the 12 cases currently handled can be refactored like this: (DupMCID, MulMCID) = find(MI.getOpcode()) Adding new transform patterns would be by adding them to the vector.

Moved some code outside a loop for better compiler time.

az marked 5 inline comments as done.Sep 30 2016, 9:39 AM

az added inline comments.

llvm/lib/Target/AArch64/AArch64VectorByElementOpt.cpp
171 ↗	(On Diff #72920)	Since all the instructions of concern, so far, are closely related and are done by the same hardware unit, then they are most likely showing the same behavior. No need to increase compile time for now.

LGTM.
Before commit, let's wait for @t.p.northover or another maintainer to accept as well.
@evandro maybe you can approve this patch as you are the maintainer of Exynos-M1.

This revision is now accepted and ready to land.Sep 30 2016, 3:02 PM

rengolin added inline comments.Sep 30 2016, 6:57 PM

llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
127 ↗	(On Diff #73069)	Quick question: what's the point of this flag? It's hidden, enabled b default and we're not testing it...

Removed the EnableVectorByElement flag as it is not that useful for now.

az marked an inline comment as done.Oct 4 2016, 9:53 AM

az added inline comments.

llvm/lib/Target/AArch64/AArch64TargetMachine.cpp
127 ↗	(On Diff #73069)	Removed it.

In D21571#555555, @mcrosier wrote:

I would prefer we also have a target feature (as was done in the first version of the patch) that early exits runOnMachineFunction for non-Exynos-M1 subtargets. Otherwise, we're doing a lot of unnecessary work (i.e., switching over every instruction in the function) for non-Exynos-M1 subtargets.

Just to make sure your question is answered: earlyExitVectElement() does that job in a generic way. :)

evandro accepted this revision.Oct 7 2016, 1:55 PM

evandro edited edge metadata.

Closed by commit rL283663: [AArch64] Avoid generating indexed vector instructions for Exynos (authored by spop). · Explain WhyOct 8 2016, 7:18 AM

This revision was automatically updated to reflect the committed changes.

kristof.beyls mentioned this in D38196: [AArch64] Avoid interleaved SIMD store instructions for Exynos.Oct 26 2017, 2:30 PM

Revision Contents

Path

Size

llvm/

trunk/

lib/

Target/

AArch64/

AArch64.h

2 lines

AArch64TargetMachine.cpp

2 lines

AArch64VectorByElementOpt.cpp

371 lines

CMakeLists.txt

1 line

test/

CodeGen/

AArch64/

arm64-neon-2velem.ll

241 lines

Diff 74035

llvm/trunk/lib/Target/AArch64/AArch64.h

	Show All 29 Lines
	FunctionPass *createAArch64RedundantCopyEliminationPass();			FunctionPass *createAArch64RedundantCopyEliminationPass();
	FunctionPass *createAArch64ConditionalCompares();			FunctionPass *createAArch64ConditionalCompares();
	FunctionPass *createAArch64AdvSIMDScalar();			FunctionPass *createAArch64AdvSIMDScalar();
	FunctionPass *createAArch64ISelDag(AArch64TargetMachine &TM,			FunctionPass *createAArch64ISelDag(AArch64TargetMachine &TM,
	CodeGenOpt::Level OptLevel);			CodeGenOpt::Level OptLevel);
	FunctionPass *createAArch64StorePairSuppressPass();			FunctionPass *createAArch64StorePairSuppressPass();
	FunctionPass *createAArch64ExpandPseudoPass();			FunctionPass *createAArch64ExpandPseudoPass();
	FunctionPass *createAArch64LoadStoreOptimizationPass();			FunctionPass *createAArch64LoadStoreOptimizationPass();
				FunctionPass *createAArch64VectorByElementOptPass();
	ModulePass *createAArch64PromoteConstantPass();			ModulePass *createAArch64PromoteConstantPass();
	FunctionPass *createAArch64ConditionOptimizerPass();			FunctionPass *createAArch64ConditionOptimizerPass();
	FunctionPass *createAArch64AddressTypePromotionPass();			FunctionPass *createAArch64AddressTypePromotionPass();
	FunctionPass *createAArch64A57FPLoadBalancing();			FunctionPass *createAArch64A57FPLoadBalancing();
	FunctionPass *createAArch64A53Fix835769();			FunctionPass *createAArch64A53Fix835769();

	FunctionPass *createAArch64CleanupLocalDynamicTLSPass();			FunctionPass *createAArch64CleanupLocalDynamicTLSPass();

	FunctionPass *createAArch64CollectLOHPass();			FunctionPass *createAArch64CollectLOHPass();

	void initializeAArch64A53Fix835769Pass(PassRegistry&);			void initializeAArch64A53Fix835769Pass(PassRegistry&);
	void initializeAArch64A57FPLoadBalancingPass(PassRegistry&);			void initializeAArch64A57FPLoadBalancingPass(PassRegistry&);
	void initializeAArch64AddressTypePromotionPass(PassRegistry&);			void initializeAArch64AddressTypePromotionPass(PassRegistry&);
	void initializeAArch64AdvSIMDScalarPass(PassRegistry&);			void initializeAArch64AdvSIMDScalarPass(PassRegistry&);
	void initializeAArch64CollectLOHPass(PassRegistry&);			void initializeAArch64CollectLOHPass(PassRegistry&);
	void initializeAArch64ConditionalComparesPass(PassRegistry&);			void initializeAArch64ConditionalComparesPass(PassRegistry&);
	void initializeAArch64ConditionOptimizerPass(PassRegistry&);			void initializeAArch64ConditionOptimizerPass(PassRegistry&);
	void initializeAArch64DeadRegisterDefinitionsPass(PassRegistry&);			void initializeAArch64DeadRegisterDefinitionsPass(PassRegistry&);
	void initializeAArch64ExpandPseudoPass(PassRegistry&);			void initializeAArch64ExpandPseudoPass(PassRegistry&);
	void initializeAArch64LoadStoreOptPass(PassRegistry&);			void initializeAArch64LoadStoreOptPass(PassRegistry&);
				void initializeAArch64VectorByElementOptPass(PassRegistry&);
	void initializeAArch64PromoteConstantPass(PassRegistry&);			void initializeAArch64PromoteConstantPass(PassRegistry&);
	void initializeAArch64RedundantCopyEliminationPass(PassRegistry&);			void initializeAArch64RedundantCopyEliminationPass(PassRegistry&);
	void initializeAArch64StorePairSuppressPass(PassRegistry&);			void initializeAArch64StorePairSuppressPass(PassRegistry&);
	void initializeLDTLSCleanupPass(PassRegistry&);			void initializeLDTLSCleanupPass(PassRegistry&);
	} // end namespace llvm			} // end namespace llvm

	#endif			#endif

llvm/trunk/lib/Target/AArch64/AArch64TargetMachine.cpp

Show First 20 Lines • Show All 135 Lines • ▼ Show 20 Lines	extern "C" void LLVMInitializeAArch64Target() {
initializeAArch64AddressTypePromotionPass(*PR);		initializeAArch64AddressTypePromotionPass(*PR);
initializeAArch64AdvSIMDScalarPass(*PR);		initializeAArch64AdvSIMDScalarPass(*PR);
initializeAArch64CollectLOHPass(*PR);		initializeAArch64CollectLOHPass(*PR);
initializeAArch64ConditionalComparesPass(*PR);		initializeAArch64ConditionalComparesPass(*PR);
initializeAArch64ConditionOptimizerPass(*PR);		initializeAArch64ConditionOptimizerPass(*PR);
initializeAArch64DeadRegisterDefinitionsPass(*PR);		initializeAArch64DeadRegisterDefinitionsPass(*PR);
initializeAArch64ExpandPseudoPass(*PR);		initializeAArch64ExpandPseudoPass(*PR);
initializeAArch64LoadStoreOptPass(*PR);		initializeAArch64LoadStoreOptPass(*PR);
		initializeAArch64VectorByElementOptPass(*PR);
initializeAArch64PromoteConstantPass(*PR);		initializeAArch64PromoteConstantPass(*PR);
initializeAArch64RedundantCopyEliminationPass(*PR);		initializeAArch64RedundantCopyEliminationPass(*PR);
initializeAArch64StorePairSuppressPass(*PR);		initializeAArch64StorePairSuppressPass(*PR);
initializeLDTLSCleanupPass(*PR);		initializeLDTLSCleanupPass(*PR);
}		}

//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//
// AArch64 Lowering public interface.		// AArch64 Lowering public interface.
▲ Show 20 Lines • Show All 265 Lines • ▼ Show 20 Lines	bool AArch64PassConfig::addILPOpts() {
if (EnableCCMP)		if (EnableCCMP)
addPass(createAArch64ConditionalCompares());		addPass(createAArch64ConditionalCompares());
if (EnableMCR)		if (EnableMCR)
addPass(&MachineCombinerID);		addPass(&MachineCombinerID);
if (EnableEarlyIfConversion)		if (EnableEarlyIfConversion)
addPass(&EarlyIfConverterID);		addPass(&EarlyIfConverterID);
if (EnableStPairSuppress)		if (EnableStPairSuppress)
addPass(createAArch64StorePairSuppressPass());		addPass(createAArch64StorePairSuppressPass());
		addPass(createAArch64VectorByElementOptPass());
return true;		return true;
}		}

void AArch64PassConfig::addPreRegAlloc() {		void AArch64PassConfig::addPreRegAlloc() {
// Use AdvSIMD scalar instructions whenever profitable.		// Use AdvSIMD scalar instructions whenever profitable.
if (TM->getOptLevel() != CodeGenOpt::None && EnableAdvSIMDScalar) {		if (TM->getOptLevel() != CodeGenOpt::None && EnableAdvSIMDScalar) {
addPass(createAArch64AdvSIMDScalar());		addPass(createAArch64AdvSIMDScalar());
// The AdvSIMD pass may produce copies that can be rewritten to		// The AdvSIMD pass may produce copies that can be rewritten to
Show All 38 Lines

llvm/trunk/lib/Target/AArch64/AArch64VectorByElementOpt.cpp

				//=- AArch64VectorByElementOpt.cpp - AArch64 vector by element inst opt pass =//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// This file contains a pass that performs optimization for vector by element
				// SIMD instructions.
				//
				// Certain SIMD instructions with vector element operand are not efficient.
				// Rewrite them into SIMD instructions with vector operands. This rewrite
				// is driven by the latency of the instructions.
				//
				// Example:
				// fmla v0.4s, v1.4s, v2.s[1]
				// is rewritten into
				// dup v3.4s, v2.s[1]
				// fmla v0.4s, v1.4s, v3.4s
				//===----------------------------------------------------------------------===//

				#include "AArch64InstrInfo.h"
				#include "llvm/ADT/Statistic.h"
				#include "llvm/CodeGen/MachineInstrBuilder.h"
				#include "llvm/CodeGen/MachineRegisterInfo.h"
				#include "llvm/CodeGen/TargetSchedule.h"

				using namespace llvm;

				#define DEBUG_TYPE "aarch64-vectorbyelement-opt"

				STATISTIC(NumModifiedInstr,
				"Number of vector by element instructions modified");

				#define AARCH64_VECTOR_BY_ELEMENT_OPT_NAME \
				"AArch64 vector by element instruction optimization pass"

				namespace {

				struct AArch64VectorByElementOpt : public MachineFunctionPass {
				static char ID;
				AArch64VectorByElementOpt() : MachineFunctionPass(ID) {
				initializeAArch64VectorByElementOptPass(*PassRegistry::getPassRegistry());
				}

				const TargetInstrInfo *TII;
				MachineRegisterInfo *MRI;
				TargetSchedModel SchedModel;

				/// Based only on latency of instructions, determine if it is cost efficient
				/// to replace the instruction InstDesc by the two instructions InstDescRep1
				/// and InstDescRep2.
				/// Return true if replacement is recommended.
				bool
				shouldReplaceInstruction(MachineFunction MF, const MCInstrDesc InstDesc,
				const MCInstrDesc *InstDescRep1,
				const MCInstrDesc *InstDescRep2,
				std::map<unsigned, bool> &VecInstElemTable) const;

				/// Determine if we need to exit the vector by element instruction
				/// optimization pass early. This makes sure that Targets with no need
				/// for this optimization do not spent any compile time on this pass.
				/// This check is done by comparing the latency of an indexed FMLA
				/// instruction to the latency of the DUP + the latency of a vector
				/// FMLA instruction. We do not check on other related instructions such
				/// as FMLS as we assume that if the situation shows up for one
				/// instruction, then it is likely to show up for the related ones.
				/// Return true if early exit of the pass is recommended.
				bool earlyExitVectElement(MachineFunction *MF);

				/// Check whether an equivalent DUP instruction has already been
				/// created or not.
				/// Return true when the dup instruction already exists. In this case,
				/// DestReg will point to the destination of the already created DUP.
				bool reuseDUP(MachineInstr &MI, unsigned DupOpcode, unsigned SrcReg,
				unsigned LaneNumber, unsigned *DestReg) const;

				/// Certain SIMD instructions with vector element operand are not efficient.
				/// Rewrite them into SIMD instructions with vector operands. This rewrite
				/// is driven by the latency of the instructions.
				/// Return true if the SIMD instruction is modified.
				bool optimizeVectElement(MachineInstr &MI,
				std::map<unsigned, bool> *VecInstElemTable) const;

				bool runOnMachineFunction(MachineFunction &Fn) override;

				StringRef getPassName() const override {
				return AARCH64_VECTOR_BY_ELEMENT_OPT_NAME;
				}
				};
				char AArch64VectorByElementOpt::ID = 0;
				} // namespace

				INITIALIZE_PASS(AArch64VectorByElementOpt, "aarch64-vectorbyelement-opt",
				AARCH64_VECTOR_BY_ELEMENT_OPT_NAME, false, false)

				/// Based only on latency of instructions, determine if it is cost efficient
				/// to replace the instruction InstDesc by the two instructions InstDescRep1
				/// and InstDescRep2. Note that it is assumed in this fuction that an
				/// instruction of type InstDesc is always replaced by the same two
				/// instructions as results are cached here.
				/// Return true if replacement is recommended.
				bool AArch64VectorByElementOpt::shouldReplaceInstruction(
				MachineFunction MF, const MCInstrDesc InstDesc,
				const MCInstrDesc InstDescRep1, const MCInstrDesc InstDescRep2,
				std::map<unsigned, bool> &VecInstElemTable) const {
				// Check if replacment decision is alredy available in the cached table.
				// if so, return it.
				if (!VecInstElemTable.empty() &&
				VecInstElemTable.find(InstDesc->getOpcode()) != VecInstElemTable.end())
				return VecInstElemTable[InstDesc->getOpcode()];

				unsigned SCIdx = InstDesc->getSchedClass();
				unsigned SCIdxRep1 = InstDescRep1->getSchedClass();
				unsigned SCIdxRep2 = InstDescRep2->getSchedClass();
				const MCSchedClassDesc *SCDesc =
				SchedModel.getMCSchedModel()->getSchedClassDesc(SCIdx);
				const MCSchedClassDesc *SCDescRep1 =
				SchedModel.getMCSchedModel()->getSchedClassDesc(SCIdxRep1);
				const MCSchedClassDesc *SCDescRep2 =
				SchedModel.getMCSchedModel()->getSchedClassDesc(SCIdxRep2);

				// If a subtarget does not define resources for any of the instructions
				// of interest, then return false for no replacement.
				if (!SCDesc->isValid() \|\| SCDesc->isVariant() \|\| !SCDescRep1->isValid() \|\|
				SCDescRep1->isVariant() \|\| !SCDescRep2->isValid() \|\|
				SCDescRep2->isVariant()) {
				VecInstElemTable[InstDesc->getOpcode()] = false;
				return false;
				}

				if (SchedModel.computeInstrLatency(InstDesc->getOpcode()) >
				SchedModel.computeInstrLatency(InstDescRep1->getOpcode()) +
				SchedModel.computeInstrLatency(InstDescRep2->getOpcode())) {
				VecInstElemTable[InstDesc->getOpcode()] = true;
				return true;
				}
				VecInstElemTable[InstDesc->getOpcode()] = false;
				return false;
				}

				/// Determine if we need to exit the vector by element instruction
				/// optimization pass early. This makes sure that Targets with no need
				/// for this optimization do not spent any compile time on this pass.
				/// This check is done by comparing the latency of an indexed FMLA
				/// instruction to the latency of the DUP + the latency of a vector
				/// FMLA instruction. We do not check on other related instructions such
				/// as FMLS as we assume that if the situation shows up for one
				/// instruction, then it is likely to show up for the related ones.
				/// Return true if early exit of the pass is recommended.
				bool AArch64VectorByElementOpt::earlyExitVectElement(MachineFunction *MF) {
				std::map<unsigned, bool> VecInstElemTable;
				const MCInstrDesc *IndexMulMCID = &TII->get(AArch64::FMLAv4i32_indexed);
				const MCInstrDesc *DupMCID = &TII->get(AArch64::DUPv4i32lane);
				const MCInstrDesc *MulMCID = &TII->get(AArch64::FMULv4f32);

				if (!shouldReplaceInstruction(MF, IndexMulMCID, DupMCID, MulMCID,
				VecInstElemTable))
				return true;
				return false;
				}

				/// Check whether an equivalent DUP instruction has already been
				/// created or not.
				/// Return true when the dup instruction already exists. In this case,
				/// DestReg will point to the destination of the already created DUP.
				bool AArch64VectorByElementOpt::reuseDUP(MachineInstr &MI, unsigned DupOpcode,
				unsigned SrcReg, unsigned LaneNumber,
				unsigned *DestReg) const {
				for (MachineBasicBlock::iterator MII = MI, MIE = MI.getParent()->begin();
				MII != MIE;) {
				MII--;
				MachineInstr CurrentMI = &MII;

				if (CurrentMI->getOpcode() == DupOpcode &&
				CurrentMI->getNumOperands() == 3 &&
				CurrentMI->getOperand(1).getReg() == SrcReg &&
				CurrentMI->getOperand(2).getImm() == LaneNumber) {
				*DestReg = CurrentMI->getOperand(0).getReg();
				return true;
				}
				}

				return false;
				}

				/// Certain SIMD instructions with vector element operand are not efficient.
				/// Rewrite them into SIMD instructions with vector operands. This rewrite
				/// is driven by the latency of the instructions.
				/// The instruction of concerns are for the time being fmla, fmls, fmul,
				/// and fmulx and hence they are hardcoded.
				///
				/// Example:
				/// fmla v0.4s, v1.4s, v2.s[1]
				/// is rewritten into
				/// dup v3.4s, v2.s[1] // dup not necessary if redundant
				/// fmla v0.4s, v1.4s, v3.4s
				/// Return true if the SIMD instruction is modified.
				bool AArch64VectorByElementOpt::optimizeVectElement(
				MachineInstr &MI, std::map<unsigned, bool> *VecInstElemTable) const {
				const MCInstrDesc MulMCID, DupMCID;
				const TargetRegisterClass *RC = &AArch64::FPR128RegClass;

				switch (MI.getOpcode()) {
				default:
				return false;

				// 4X32 instructions
				case AArch64::FMLAv4i32_indexed:
				DupMCID = &TII->get(AArch64::DUPv4i32lane);
				MulMCID = &TII->get(AArch64::FMLAv4f32);
				break;
				case AArch64::FMLSv4i32_indexed:
				DupMCID = &TII->get(AArch64::DUPv4i32lane);
				MulMCID = &TII->get(AArch64::FMLSv4f32);
				break;
				case AArch64::FMULXv4i32_indexed:
				DupMCID = &TII->get(AArch64::DUPv4i32lane);
				MulMCID = &TII->get(AArch64::FMULXv4f32);
				break;
				case AArch64::FMULv4i32_indexed:
				DupMCID = &TII->get(AArch64::DUPv4i32lane);
				MulMCID = &TII->get(AArch64::FMULv4f32);
				break;

				// 2X64 instructions
				case AArch64::FMLAv2i64_indexed:
				DupMCID = &TII->get(AArch64::DUPv2i64lane);
				MulMCID = &TII->get(AArch64::FMLAv2f64);
				break;
				case AArch64::FMLSv2i64_indexed:
				DupMCID = &TII->get(AArch64::DUPv2i64lane);
				MulMCID = &TII->get(AArch64::FMLSv2f64);
				break;
				case AArch64::FMULXv2i64_indexed:
				DupMCID = &TII->get(AArch64::DUPv2i64lane);
				MulMCID = &TII->get(AArch64::FMULXv2f64);
				break;
				case AArch64::FMULv2i64_indexed:
				DupMCID = &TII->get(AArch64::DUPv2i64lane);
				MulMCID = &TII->get(AArch64::FMULv2f64);
				break;

				// 2X32 instructions
				case AArch64::FMLAv2i32_indexed:
				RC = &AArch64::FPR64RegClass;
				DupMCID = &TII->get(AArch64::DUPv2i32lane);
				MulMCID = &TII->get(AArch64::FMLAv2f32);
				break;
				case AArch64::FMLSv2i32_indexed:
				RC = &AArch64::FPR64RegClass;
				DupMCID = &TII->get(AArch64::DUPv2i32lane);
				MulMCID = &TII->get(AArch64::FMLSv2f32);
				break;
				case AArch64::FMULXv2i32_indexed:
				RC = &AArch64::FPR64RegClass;
				DupMCID = &TII->get(AArch64::DUPv2i32lane);
				MulMCID = &TII->get(AArch64::FMULXv2f32);
				break;
				case AArch64::FMULv2i32_indexed:
				RC = &AArch64::FPR64RegClass;
				DupMCID = &TII->get(AArch64::DUPv2i32lane);
				MulMCID = &TII->get(AArch64::FMULv2f32);
				break;
				}

				if (!shouldReplaceInstruction(MI.getParent()->getParent(),
				&TII->get(MI.getOpcode()), DupMCID, MulMCID,
				*VecInstElemTable))
				return false;

				const DebugLoc &DL = MI.getDebugLoc();
				MachineBasicBlock &MBB = *MI.getParent();
				MachineRegisterInfo &MRI = MBB.getParent()->getRegInfo();

				// get the operands of the current SIMD arithmetic instruction.
				unsigned MulDest = MI.getOperand(0).getReg();
				unsigned SrcReg0 = MI.getOperand(1).getReg();
				unsigned Src0IsKill = getKillRegState(MI.getOperand(1).isKill());
				unsigned SrcReg1 = MI.getOperand(2).getReg();
				unsigned Src1IsKill = getKillRegState(MI.getOperand(2).isKill());
				unsigned DupDest;

				// Instructions of interest have either 4 or 5 operands.
				if (MI.getNumOperands() == 5) {
				unsigned SrcReg2 = MI.getOperand(3).getReg();
				unsigned Src2IsKill = getKillRegState(MI.getOperand(3).isKill());
				unsigned LaneNumber = MI.getOperand(4).getImm();

				// Create a new DUP instruction. Note that if an equivalent DUP instruction
				// has already been created before, then use that one instread of creating
				// a new one.
				if (!reuseDUP(MI, DupMCID->getOpcode(), SrcReg2, LaneNumber, &DupDest)) {
				DupDest = MRI.createVirtualRegister(RC);
				BuildMI(MBB, MI, DL, *DupMCID, DupDest)
				.addReg(SrcReg2, Src2IsKill)
				.addImm(LaneNumber);
				}
				BuildMI(MBB, MI, DL, *MulMCID, MulDest)
				.addReg(SrcReg0, Src0IsKill)
				.addReg(SrcReg1, Src1IsKill)
				.addReg(DupDest, Src2IsKill);
				} else if (MI.getNumOperands() == 4) {
				unsigned LaneNumber = MI.getOperand(3).getImm();
				if (!reuseDUP(MI, DupMCID->getOpcode(), SrcReg1, LaneNumber, &DupDest)) {
				DupDest = MRI.createVirtualRegister(RC);
				BuildMI(MBB, MI, DL, *DupMCID, DupDest)
				.addReg(SrcReg1, Src1IsKill)
				.addImm(LaneNumber);
				}
				BuildMI(MBB, MI, DL, *MulMCID, MulDest)
				.addReg(SrcReg0, Src0IsKill)
				.addReg(DupDest, Src1IsKill);
				} else {
				return false;
				}

				++NumModifiedInstr;
				return true;
				}

				bool AArch64VectorByElementOpt::runOnMachineFunction(MachineFunction &MF) {
				if (skipFunction(*MF.getFunction()))
				return false;

				TII = MF.getSubtarget().getInstrInfo();
				MRI = &MF.getRegInfo();
				const TargetSubtargetInfo &ST = MF.getSubtarget();
				const AArch64InstrInfo *AAII =
				static_cast<const AArch64InstrInfo *>(ST.getInstrInfo());
				if (!AAII)
				return false;
				SchedModel.init(ST.getSchedModel(), &ST, AAII);
				if (!SchedModel.hasInstrSchedModel())
				return false;

				// A simple check to exit this pass early for targets that do not need it.
				if (earlyExitVectElement(&MF))
				return false;

				bool Changed = false;
				std::map<unsigned, bool> VecInstElemTable;
				SmallVector<MachineInstr *, 8> RemoveMIs;

				for (MachineBasicBlock &MBB : MF) {
				for (MachineBasicBlock::iterator MII = MBB.begin(), MIE = MBB.end();
				MII != MIE;) {
				MachineInstr &MI = *MII;
				if (optimizeVectElement(MI, &VecInstElemTable)) {
				// Add MI to the list of instructions to be removed given that it has
				// been replaced.
				RemoveMIs.push_back(&MI);
				Changed = true;
				}
				++MII;
				}
				}

				for (MachineInstr *MI : RemoveMIs)
				MI->eraseFromParent();

				return Changed;
				}

				/// createAArch64VectorByElementOptPass - returns an instance of the
				/// vector by element optimization pass.
				FunctionPass *llvm::createAArch64VectorByElementOptPass() {
				return new AArch64VectorByElementOpt();
				}

llvm/trunk/lib/Target/AArch64/CMakeLists.txt

Show First 20 Lines • Show All 56 Lines • ▼ Show 20 Lines	add_llvm_target(AArch64CodeGen
AArch64PBQPRegAlloc.cpp		AArch64PBQPRegAlloc.cpp
AArch64RegisterInfo.cpp		AArch64RegisterInfo.cpp
AArch64SelectionDAGInfo.cpp		AArch64SelectionDAGInfo.cpp
AArch64StorePairSuppress.cpp		AArch64StorePairSuppress.cpp
AArch64Subtarget.cpp		AArch64Subtarget.cpp
AArch64TargetMachine.cpp		AArch64TargetMachine.cpp
AArch64TargetObjectFile.cpp		AArch64TargetObjectFile.cpp
AArch64TargetTransformInfo.cpp		AArch64TargetTransformInfo.cpp
		AArch64VectorByElementOpt.cpp
${GLOBAL_ISEL_BUILD_FILES}		${GLOBAL_ISEL_BUILD_FILES}
)		)

add_dependencies(LLVMAArch64CodeGen intrinsics_gen)		add_dependencies(LLVMAArch64CodeGen intrinsics_gen)

add_subdirectory(TargetInfo)		add_subdirectory(TargetInfo)
add_subdirectory(AsmParser)		add_subdirectory(AsmParser)
add_subdirectory(Disassembler)		add_subdirectory(Disassembler)
add_subdirectory(InstPrinter)		add_subdirectory(InstPrinter)
add_subdirectory(MCTargetDesc)		add_subdirectory(MCTargetDesc)
add_subdirectory(Utils)		add_subdirectory(Utils)

llvm/trunk/test/CodeGen/AArch64/arm64-neon-2velem.ll

; RUN: llc < %s -verify-machineinstrs -mtriple=arm64-none-linux-gnu -mattr=+neon -fp-contract=fast \| FileCheck %s		; RUN: llc < %s -verify-machineinstrs -mtriple=arm64-none-linux-gnu -mattr=+neon -fp-contract=fast \| FileCheck %s
		; RUN: llc < %s -verify-machineinstrs -mtriple=arm64-none-linux-gnu -mattr=+neon -fp-contract=fast -mcpu=exynos-m1 \| FileCheck --check-prefix=EXYNOS %s
		; The instruction latencies of Exynos-M1 trigger the transform we see under the Exynos check.

declare <2 x double> @llvm.aarch64.neon.fmulx.v2f64(<2 x double>, <2 x double>)		declare <2 x double> @llvm.aarch64.neon.fmulx.v2f64(<2 x double>, <2 x double>)

declare <4 x float> @llvm.aarch64.neon.fmulx.v4f32(<4 x float>, <4 x float>)		declare <4 x float> @llvm.aarch64.neon.fmulx.v4f32(<4 x float>, <4 x float>)

declare <2 x float> @llvm.aarch64.neon.fmulx.v2f32(<2 x float>, <2 x float>)		declare <2 x float> @llvm.aarch64.neon.fmulx.v2f32(<2 x float>, <2 x float>)

declare <4 x i32> @llvm.aarch64.neon.sqrdmulh.v4i32(<4 x i32>, <4 x i32>)		declare <4 x i32> @llvm.aarch64.neon.sqrdmulh.v4i32(<4 x i32>, <4 x i32>)
▲ Show 20 Lines • Show All 367 Lines • ▼ Show 20 Lines	entry:
%mul = mul <4 x i32> %shuffle, %a		%mul = mul <4 x i32> %shuffle, %a
ret <4 x i32> %mul		ret <4 x i32> %mul
}		}

define <2 x float> @test_vfma_lane_f32(<2 x float> %a, <2 x float> %b, <2 x float> %v) {		define <2 x float> @test_vfma_lane_f32(<2 x float> %a, <2 x float> %b, <2 x float> %v) {
; CHECK-LABEL: test_vfma_lane_f32:		; CHECK-LABEL: test_vfma_lane_f32:
; CHECK: fmla {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, {{v[0-9]+}}.s[1]		; CHECK: fmla {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, {{v[0-9]+}}.s[1]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vfma_lane_f32:
		; EXYNOS: dup [[x:v[0-9]+]].2s, {{v[0-9]+}}.s[1]
		; EXYNOS: fmla {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, [[x]].2s
		; EXYNOS-NEXT: ret
entry:		entry:
%lane = shufflevector <2 x float> %v, <2 x float> undef, <2 x i32> <i32 1, i32 1>		%lane = shufflevector <2 x float> %v, <2 x float> undef, <2 x i32> <i32 1, i32 1>
%0 = tail call <2 x float> @llvm.fma.v2f32(<2 x float> %lane, <2 x float> %b, <2 x float> %a)		%0 = tail call <2 x float> @llvm.fma.v2f32(<2 x float> %lane, <2 x float> %b, <2 x float> %a)
ret <2 x float> %0		ret <2 x float> %0
}		}

declare <2 x float> @llvm.fma.v2f32(<2 x float>, <2 x float>, <2 x float>)		declare <2 x float> @llvm.fma.v2f32(<2 x float>, <2 x float>, <2 x float>)

define <4 x float> @test_vfmaq_lane_f32(<4 x float> %a, <4 x float> %b, <2 x float> %v) {		define <4 x float> @test_vfmaq_lane_f32(<4 x float> %a, <4 x float> %b, <2 x float> %v) {
; CHECK-LABEL: test_vfmaq_lane_f32:		; CHECK-LABEL: test_vfmaq_lane_f32:
; CHECK: fmla {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[1]		; CHECK: fmla {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[1]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vfmaq_lane_f32:
		; EXYNOS: dup [[x:v[0-9]+]].4s, {{v[0-9]+}}.s[1]
		; EXYNOS: fmla {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, [[x]].4s
		; EXYNOS-NEXT: ret
entry:		entry:
%lane = shufflevector <2 x float> %v, <2 x float> undef, <4 x i32> <i32 1, i32 1, i32 1, i32 1>		%lane = shufflevector <2 x float> %v, <2 x float> undef, <4 x i32> <i32 1, i32 1, i32 1, i32 1>
%0 = tail call <4 x float> @llvm.fma.v4f32(<4 x float> %lane, <4 x float> %b, <4 x float> %a)		%0 = tail call <4 x float> @llvm.fma.v4f32(<4 x float> %lane, <4 x float> %b, <4 x float> %a)
ret <4 x float> %0		ret <4 x float> %0
}		}

declare <4 x float> @llvm.fma.v4f32(<4 x float>, <4 x float>, <4 x float>)		declare <4 x float> @llvm.fma.v4f32(<4 x float>, <4 x float>, <4 x float>)

define <2 x float> @test_vfma_laneq_f32(<2 x float> %a, <2 x float> %b, <4 x float> %v) {		define <2 x float> @test_vfma_laneq_f32(<2 x float> %a, <2 x float> %b, <4 x float> %v) {
; CHECK-LABEL: test_vfma_laneq_f32:		; CHECK-LABEL: test_vfma_laneq_f32:
; CHECK: fmla {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, {{v[0-9]+}}.s[3]		; CHECK: fmla {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, {{v[0-9]+}}.s[3]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vfma_laneq_f32:
		; EXYNOS: dup [[x:v[0-9]+]].2s, {{v[0-9]+}}.s[3]
		; EXYNOS: fmla {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, [[x]].2s
		; EXYNOS-NEXT: ret
entry:		entry:
%lane = shufflevector <4 x float> %v, <4 x float> undef, <2 x i32> <i32 3, i32 3>		%lane = shufflevector <4 x float> %v, <4 x float> undef, <2 x i32> <i32 3, i32 3>
%0 = tail call <2 x float> @llvm.fma.v2f32(<2 x float> %lane, <2 x float> %b, <2 x float> %a)		%0 = tail call <2 x float> @llvm.fma.v2f32(<2 x float> %lane, <2 x float> %b, <2 x float> %a)
ret <2 x float> %0		ret <2 x float> %0
}		}

define <4 x float> @test_vfmaq_laneq_f32(<4 x float> %a, <4 x float> %b, <4 x float> %v) {		define <4 x float> @test_vfmaq_laneq_f32(<4 x float> %a, <4 x float> %b, <4 x float> %v) {
; CHECK-LABEL: test_vfmaq_laneq_f32:		; CHECK-LABEL: test_vfmaq_laneq_f32:
; CHECK: fmla {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[3]		; CHECK: fmla {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[3]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vfmaq_laneq_f32:
		; EXYNOS: dup [[x:v[0-9]+]].4s, {{v[0-9]+}}.s[3]
		; EXYNOS: fmla {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, [[x]].4s
		; EXYNOS-NEXT: ret
entry:		entry:
%lane = shufflevector <4 x float> %v, <4 x float> undef, <4 x i32> <i32 3, i32 3, i32 3, i32 3>		%lane = shufflevector <4 x float> %v, <4 x float> undef, <4 x i32> <i32 3, i32 3, i32 3, i32 3>
%0 = tail call <4 x float> @llvm.fma.v4f32(<4 x float> %lane, <4 x float> %b, <4 x float> %a)		%0 = tail call <4 x float> @llvm.fma.v4f32(<4 x float> %lane, <4 x float> %b, <4 x float> %a)
ret <4 x float> %0		ret <4 x float> %0
}		}

define <2 x float> @test_vfms_lane_f32(<2 x float> %a, <2 x float> %b, <2 x float> %v) {		define <2 x float> @test_vfms_lane_f32(<2 x float> %a, <2 x float> %b, <2 x float> %v) {
; CHECK-LABEL: test_vfms_lane_f32:		; CHECK-LABEL: test_vfms_lane_f32:
; CHECK: fmls {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, {{v[0-9]+}}.s[1]		; CHECK: fmls {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, {{v[0-9]+}}.s[1]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vfms_lane_f32:
		; EXYNOS: dup [[x:v[0-9]+]].2s, {{v[0-9]+}}.s[1]
		; EXYNOS: fmls {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, [[x]].2s
		; EXYNOS-NEXT: ret
entry:		entry:
%sub = fsub <2 x float> <float -0.000000e+00, float -0.000000e+00>, %v		%sub = fsub <2 x float> <float -0.000000e+00, float -0.000000e+00>, %v
%lane = shufflevector <2 x float> %sub, <2 x float> undef, <2 x i32> <i32 1, i32 1>		%lane = shufflevector <2 x float> %sub, <2 x float> undef, <2 x i32> <i32 1, i32 1>
%0 = tail call <2 x float> @llvm.fma.v2f32(<2 x float> %lane, <2 x float> %b, <2 x float> %a)		%0 = tail call <2 x float> @llvm.fma.v2f32(<2 x float> %lane, <2 x float> %b, <2 x float> %a)
ret <2 x float> %0		ret <2 x float> %0
}		}

define <4 x float> @test_vfmsq_lane_f32(<4 x float> %a, <4 x float> %b, <2 x float> %v) {		define <4 x float> @test_vfmsq_lane_f32(<4 x float> %a, <4 x float> %b, <2 x float> %v) {
; CHECK-LABEL: test_vfmsq_lane_f32:		; CHECK-LABEL: test_vfmsq_lane_f32:
; CHECK: fmls {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[1]		; CHECK: fmls {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[1]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vfmsq_lane_f32:
		; EXYNOS: dup [[x:v[0-9]+]].4s, {{v[0-9]+}}.s[1]
		; EXYNOS: fmls {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, [[x]].4s
		; EXYNOS-NEXT: ret
entry:		entry:
%sub = fsub <2 x float> <float -0.000000e+00, float -0.000000e+00>, %v		%sub = fsub <2 x float> <float -0.000000e+00, float -0.000000e+00>, %v
%lane = shufflevector <2 x float> %sub, <2 x float> undef, <4 x i32> <i32 1, i32 1, i32 1, i32 1>		%lane = shufflevector <2 x float> %sub, <2 x float> undef, <4 x i32> <i32 1, i32 1, i32 1, i32 1>
%0 = tail call <4 x float> @llvm.fma.v4f32(<4 x float> %lane, <4 x float> %b, <4 x float> %a)		%0 = tail call <4 x float> @llvm.fma.v4f32(<4 x float> %lane, <4 x float> %b, <4 x float> %a)
ret <4 x float> %0		ret <4 x float> %0
}		}

define <2 x float> @test_vfms_laneq_f32(<2 x float> %a, <2 x float> %b, <4 x float> %v) {		define <2 x float> @test_vfms_laneq_f32(<2 x float> %a, <2 x float> %b, <4 x float> %v) {
; CHECK-LABEL: test_vfms_laneq_f32:		; CHECK-LABEL: test_vfms_laneq_f32:
; CHECK: fmls {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, {{v[0-9]+}}.s[3]		; CHECK: fmls {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, {{v[0-9]+}}.s[3]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vfms_laneq_f32:
		; EXYNOS: dup [[x:v[0-9]+]].2s, {{v[0-9]+}}.s[3]
		; EXYNOS: fmls {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, [[x]].2s
		; EXYNOS-NEXT: ret
entry:		entry:
%sub = fsub <4 x float> <float -0.000000e+00, float -0.000000e+00, float -0.000000e+00, float -0.000000e+00>, %v		%sub = fsub <4 x float> <float -0.000000e+00, float -0.000000e+00, float -0.000000e+00, float -0.000000e+00>, %v
%lane = shufflevector <4 x float> %sub, <4 x float> undef, <2 x i32> <i32 3, i32 3>		%lane = shufflevector <4 x float> %sub, <4 x float> undef, <2 x i32> <i32 3, i32 3>
%0 = tail call <2 x float> @llvm.fma.v2f32(<2 x float> %lane, <2 x float> %b, <2 x float> %a)		%0 = tail call <2 x float> @llvm.fma.v2f32(<2 x float> %lane, <2 x float> %b, <2 x float> %a)
ret <2 x float> %0		ret <2 x float> %0
}		}

define <4 x float> @test_vfmsq_laneq_f32(<4 x float> %a, <4 x float> %b, <4 x float> %v) {		define <4 x float> @test_vfmsq_laneq_f32(<4 x float> %a, <4 x float> %b, <4 x float> %v) {
; CHECK-LABEL: test_vfmsq_laneq_f32:		; CHECK-LABEL: test_vfmsq_laneq_f32:
; CHECK: fmls {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[3]		; CHECK: fmls {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[3]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vfmsq_laneq_f32:
		; EXYNOS: dup [[x:v[0-9]+]].4s, {{v[0-9]+}}.s[3]
		; EXYNOS: fmls {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, [[x]].4s
		; EXYNOS-NEXT: ret
entry:		entry:
%sub = fsub <4 x float> <float -0.000000e+00, float -0.000000e+00, float -0.000000e+00, float -0.000000e+00>, %v		%sub = fsub <4 x float> <float -0.000000e+00, float -0.000000e+00, float -0.000000e+00, float -0.000000e+00>, %v
%lane = shufflevector <4 x float> %sub, <4 x float> undef, <4 x i32> <i32 3, i32 3, i32 3, i32 3>		%lane = shufflevector <4 x float> %sub, <4 x float> undef, <4 x i32> <i32 3, i32 3, i32 3, i32 3>
%0 = tail call <4 x float> @llvm.fma.v4f32(<4 x float> %lane, <4 x float> %b, <4 x float> %a)		%0 = tail call <4 x float> @llvm.fma.v4f32(<4 x float> %lane, <4 x float> %b, <4 x float> %a)
ret <4 x float> %0		ret <4 x float> %0
}		}

define <2 x double> @test_vfmaq_lane_f64(<2 x double> %a, <2 x double> %b, <1 x double> %v) {		define <2 x double> @test_vfmaq_lane_f64(<2 x double> %a, <2 x double> %b, <1 x double> %v) {
; CHECK-LABEL: test_vfmaq_lane_f64:		; CHECK-LABEL: test_vfmaq_lane_f64:
; CHECK: fmla {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, {{v[0-9]+}}.d[0]		; CHECK: fmla {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, {{v[0-9]+}}.d[0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vfmaq_lane_f64:
		; EXYNOS: dup [[x:v[0-9]+]].2d, {{v[0-9]+}}.d[0]
		; EXYNOS: fmla {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, [[x]].2d
		; EXYNOS-NEXT: ret
entry:		entry:
%lane = shufflevector <1 x double> %v, <1 x double> undef, <2 x i32> zeroinitializer		%lane = shufflevector <1 x double> %v, <1 x double> undef, <2 x i32> zeroinitializer
%0 = tail call <2 x double> @llvm.fma.v2f64(<2 x double> %lane, <2 x double> %b, <2 x double> %a)		%0 = tail call <2 x double> @llvm.fma.v2f64(<2 x double> %lane, <2 x double> %b, <2 x double> %a)
ret <2 x double> %0		ret <2 x double> %0
}		}

declare <2 x double> @llvm.fma.v2f64(<2 x double>, <2 x double>, <2 x double>)		declare <2 x double> @llvm.fma.v2f64(<2 x double>, <2 x double>, <2 x double>)

define <2 x double> @test_vfmaq_laneq_f64(<2 x double> %a, <2 x double> %b, <2 x double> %v) {		define <2 x double> @test_vfmaq_laneq_f64(<2 x double> %a, <2 x double> %b, <2 x double> %v) {
; CHECK-LABEL: test_vfmaq_laneq_f64:		; CHECK-LABEL: test_vfmaq_laneq_f64:
; CHECK: fmla {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, {{v[0-9]+}}.d[1]		; CHECK: fmla {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, {{v[0-9]+}}.d[1]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vfmaq_laneq_f64:
		; EXYNOS: dup [[x:v[0-9]+]].2d, {{v[0-9]+}}.d[1]
		; EXYNOS: fmla {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, [[x]].2d
		; EXYNOS-NEXT: ret
entry:		entry:
%lane = shufflevector <2 x double> %v, <2 x double> undef, <2 x i32> <i32 1, i32 1>		%lane = shufflevector <2 x double> %v, <2 x double> undef, <2 x i32> <i32 1, i32 1>
%0 = tail call <2 x double> @llvm.fma.v2f64(<2 x double> %lane, <2 x double> %b, <2 x double> %a)		%0 = tail call <2 x double> @llvm.fma.v2f64(<2 x double> %lane, <2 x double> %b, <2 x double> %a)
ret <2 x double> %0		ret <2 x double> %0
}		}

define <2 x double> @test_vfmsq_lane_f64(<2 x double> %a, <2 x double> %b, <1 x double> %v) {		define <2 x double> @test_vfmsq_lane_f64(<2 x double> %a, <2 x double> %b, <1 x double> %v) {
; CHECK-LABEL: test_vfmsq_lane_f64:		; CHECK-LABEL: test_vfmsq_lane_f64:
; CHECK: fmls {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, {{v[0-9]+}}.d[0]		; CHECK: fmls {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, {{v[0-9]+}}.d[0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vfmsq_lane_f64:
		; EXYNOS: dup [[x:v[0-9]+]].2d, {{v[0-9]+}}.d[0]
		; EXYNOS: fmls {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, [[x]].2d
		; EXYNOS-NEXT: ret
entry:		entry:
%sub = fsub <1 x double> <double -0.000000e+00>, %v		%sub = fsub <1 x double> <double -0.000000e+00>, %v
%lane = shufflevector <1 x double> %sub, <1 x double> undef, <2 x i32> zeroinitializer		%lane = shufflevector <1 x double> %sub, <1 x double> undef, <2 x i32> zeroinitializer
%0 = tail call <2 x double> @llvm.fma.v2f64(<2 x double> %lane, <2 x double> %b, <2 x double> %a)		%0 = tail call <2 x double> @llvm.fma.v2f64(<2 x double> %lane, <2 x double> %b, <2 x double> %a)
ret <2 x double> %0		ret <2 x double> %0
}		}

define <2 x double> @test_vfmsq_laneq_f64(<2 x double> %a, <2 x double> %b, <2 x double> %v) {		define <2 x double> @test_vfmsq_laneq_f64(<2 x double> %a, <2 x double> %b, <2 x double> %v) {
; CHECK-LABEL: test_vfmsq_laneq_f64:		; CHECK-LABEL: test_vfmsq_laneq_f64:
; CHECK: fmls {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, {{v[0-9]+}}.d[1]		; CHECK: fmls {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, {{v[0-9]+}}.d[1]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vfmsq_laneq_f64:
		; EXYNOS: dup [[x:v[0-9]+]].2d, {{v[0-9]+}}.d[1]
		; EXYNOS: fmls {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, [[x]].2d
		; EXYNOS-NEXT: ret
entry:		entry:
%sub = fsub <2 x double> <double -0.000000e+00, double -0.000000e+00>, %v		%sub = fsub <2 x double> <double -0.000000e+00, double -0.000000e+00>, %v
%lane = shufflevector <2 x double> %sub, <2 x double> undef, <2 x i32> <i32 1, i32 1>		%lane = shufflevector <2 x double> %sub, <2 x double> undef, <2 x i32> <i32 1, i32 1>
%0 = tail call <2 x double> @llvm.fma.v2f64(<2 x double> %lane, <2 x double> %b, <2 x double> %a)		%0 = tail call <2 x double> @llvm.fma.v2f64(<2 x double> %lane, <2 x double> %b, <2 x double> %a)
ret <2 x double> %0		ret <2 x double> %0
}		}

define float @test_vfmas_laneq_f32(float %a, float %b, <4 x float> %v) {		define float @test_vfmas_laneq_f32(float %a, float %b, <4 x float> %v) {
; CHECK-LABEL: test_vfmas_laneq_f32		; CHECK-LABEL: test_vfmas_laneq_f32
; CHECK: fmla {{s[0-9]+}}, {{s[0-9]+}}, {{v[0-9]+}}.s[3]		; CHECK: fmla {{s[0-9]+}}, {{s[0-9]+}}, {{v[0-9]+}}.s[3]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXNOS-LABEL: test_vfmas_laneq_f32
		; EXNOS: fmla {{s[0-9]+}}, {{s[0-9]+}}, {{v[0-9]+}}.s[3]
		; EXNOS-NEXT: ret
entry:		entry:
%extract = extractelement <4 x float> %v, i32 3		%extract = extractelement <4 x float> %v, i32 3
%0 = tail call float @llvm.fma.f32(float %b, float %extract, float %a)		%0 = tail call float @llvm.fma.f32(float %b, float %extract, float %a)
ret float %0		ret float %0
}		}

declare float @llvm.fma.f32(float, float, float)		declare float @llvm.fma.f32(float, float, float)

Show All 9 Lines
}		}

declare double @llvm.fma.f64(double, double, double)		declare double @llvm.fma.f64(double, double, double)

define float @test_vfmss_lane_f32(float %a, float %b, <2 x float> %v) {		define float @test_vfmss_lane_f32(float %a, float %b, <2 x float> %v) {
; CHECK-LABEL: test_vfmss_lane_f32		; CHECK-LABEL: test_vfmss_lane_f32
; CHECK: fmls {{s[0-9]+}}, {{s[0-9]+}}, {{v[0-9]+}}.s[1]		; CHECK: fmls {{s[0-9]+}}, {{s[0-9]+}}, {{v[0-9]+}}.s[1]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vfmss_lane_f32
		; EXYNOS: fmls {{s[0-9]+}}, {{s[0-9]+}}, {{v[0-9]+}}.s[1]
		; EXYNOS-NEXT: ret
entry:		entry:
%extract.rhs = extractelement <2 x float> %v, i32 1		%extract.rhs = extractelement <2 x float> %v, i32 1
%extract = fsub float -0.000000e+00, %extract.rhs		%extract = fsub float -0.000000e+00, %extract.rhs
%0 = tail call float @llvm.fma.f32(float %b, float %extract, float %a)		%0 = tail call float @llvm.fma.f32(float %b, float %extract, float %a)
ret float %0		ret float %0
}		}

define float @test_vfmss_laneq_f32(float %a, float %b, <4 x float> %v) {		define float @test_vfmss_laneq_f32(float %a, float %b, <4 x float> %v) {
; CHECK-LABEL: test_vfmss_laneq_f32		; CHECK-LABEL: test_vfmss_laneq_f32
; CHECK: fmls {{s[0-9]+}}, {{s[0-9]+}}, {{v[0-9]+}}.s[3]		; CHECK: fmls {{s[0-9]+}}, {{s[0-9]+}}, {{v[0-9]+}}.s[3]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
entry:		entry:
%extract.rhs = extractelement <4 x float> %v, i32 3		%extract.rhs = extractelement <4 x float> %v, i32 3
%extract = fsub float -0.000000e+00, %extract.rhs		%extract = fsub float -0.000000e+00, %extract.rhs
%0 = tail call float @llvm.fma.f32(float %b, float %extract, float %a)		%0 = tail call float @llvm.fma.f32(float %b, float %extract, float %a)
ret float %0		ret float %0
}		}

define double @test_vfmsd_laneq_f64(double %a, double %b, <2 x double> %v) {		define double @test_vfmsd_laneq_f64(double %a, double %b, <2 x double> %v) {
; CHECK-LABEL: test_vfmsd_laneq_f64		; CHECK-LABEL: test_vfmsd_laneq_f64
; CHECK: fmls {{d[0-9]+}}, {{d[0-9]+}}, {{v[0-9]+}}.d[1]		; CHECK: fmls {{d[0-9]+}}, {{d[0-9]+}}, {{v[0-9]+}}.d[1]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vfmsd_laneq_f64
		; EXYNOS: fmls {{d[0-9]+}}, {{d[0-9]+}}, {{v[0-9]+}}.d[1]
		; EXYNOS-NEXT: ret
entry:		entry:
%extract.rhs = extractelement <2 x double> %v, i32 1		%extract.rhs = extractelement <2 x double> %v, i32 1
%extract = fsub double -0.000000e+00, %extract.rhs		%extract = fsub double -0.000000e+00, %extract.rhs
%0 = tail call double @llvm.fma.f64(double %b, double %extract, double %a)		%0 = tail call double @llvm.fma.f64(double %b, double %extract, double %a)
ret double %0		ret double %0
}		}

define double @test_vfmsd_lane_f64_0(double %a, double %b, <1 x double> %v) {		define double @test_vfmsd_lane_f64_0(double %a, double %b, <1 x double> %v) {
; CHCK-LABEL: test_vfmsd_lane_f64_0		; CHCK-LABEL: test_vfmsd_lane_f64_0
; CHCK: fmsub {{d[0-9]+}}, {{d[0-9]+}}, {{d[0-9]+}}, {{d[0-9]+}}		; CHCK: fmsub {{d[0-9]+}}, {{d[0-9]+}}, {{d[0-9]+}}, {{d[0-9]+}}
; CHCK-NEXT: ret		; CHCK-NEXT: ret
entry:		entry:
%tmp0 = fsub <1 x double> <double -0.000000e+00>, %v		%tmp0 = fsub <1 x double> <double -0.000000e+00>, %v
%tmp1 = extractelement <1 x double> %tmp0, i32 0		%tmp1 = extractelement <1 x double> %tmp0, i32 0
%0 = tail call double @llvm.fma.f64(double %b, double %tmp1, double %a)		%0 = tail call double @llvm.fma.f64(double %b, double %tmp1, double %a)
ret double %0		ret double %0
}		}

define float @test_vfmss_lane_f32_0(float %a, float %b, <2 x float> %v) {		define float @test_vfmss_lane_f32_0(float %a, float %b, <2 x float> %v) {
; CHECK-LABEL: test_vfmss_lane_f32_0		; CHECK-LABEL: test_vfmss_lane_f32_0
; CHECK: fmls {{s[0-9]+}}, {{s[0-9]+}}, {{v[0-9]+}}.s[1]		; CHECK: fmls {{s[0-9]+}}, {{s[0-9]+}}, {{v[0-9]+}}.s[1]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vfmss_lane_f32_0
		; EXYNOS: fmls {{s[0-9]+}}, {{s[0-9]+}}, {{v[0-9]+}}.s[1]
		; EXYNOS-NEXT: ret
entry:		entry:
%tmp0 = fsub <2 x float> <float -0.000000e+00, float -0.000000e+00>, %v		%tmp0 = fsub <2 x float> <float -0.000000e+00, float -0.000000e+00>, %v
%tmp1 = extractelement <2 x float> %tmp0, i32 1		%tmp1 = extractelement <2 x float> %tmp0, i32 1
%0 = tail call float @llvm.fma.f32(float %b, float %tmp1, float %a)		%0 = tail call float @llvm.fma.f32(float %b, float %tmp1, float %a)
ret float %0		ret float %0
}		}

define float @test_vfmss_laneq_f32_0(float %a, float %b, <4 x float> %v) {		define float @test_vfmss_laneq_f32_0(float %a, float %b, <4 x float> %v) {
▲ Show 20 Lines • Show All 809 Lines • ▼ Show 20 Lines	entry:
%vqrdmulh2.i = tail call <4 x i32> @llvm.aarch64.neon.sqrdmulh.v4i32(<4 x i32> %a, <4 x i32> %shuffle)		%vqrdmulh2.i = tail call <4 x i32> @llvm.aarch64.neon.sqrdmulh.v4i32(<4 x i32> %a, <4 x i32> %shuffle)
ret <4 x i32> %vqrdmulh2.i		ret <4 x i32> %vqrdmulh2.i
}		}

define <2 x float> @test_vmul_lane_f32(<2 x float> %a, <2 x float> %v) {		define <2 x float> @test_vmul_lane_f32(<2 x float> %a, <2 x float> %v) {
; CHECK-LABEL: test_vmul_lane_f32:		; CHECK-LABEL: test_vmul_lane_f32:
; CHECK: fmul {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, {{v[0-9]+}}.s[1]		; CHECK: fmul {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, {{v[0-9]+}}.s[1]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vmul_lane_f32:
		; EXYNOS: dup [[x:v[0-9]+]].2s, {{v[0-9]+}}.s[1]
		; EXYNOS: fmul {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, [[x]].2s
		; EXYNOS-NEXT: ret
entry:		entry:
%shuffle = shufflevector <2 x float> %v, <2 x float> undef, <2 x i32> <i32 1, i32 1>		%shuffle = shufflevector <2 x float> %v, <2 x float> undef, <2 x i32> <i32 1, i32 1>
%mul = fmul <2 x float> %shuffle, %a		%mul = fmul <2 x float> %shuffle, %a
ret <2 x float> %mul		ret <2 x float> %mul
}		}

define <1 x double> @test_vmul_lane_f64(<1 x double> %a, <1 x double> %v) {		define <1 x double> @test_vmul_lane_f64(<1 x double> %a, <1 x double> %v) {
; CHECK-LABEL: test_vmul_lane_f64:		; CHECK-LABEL: test_vmul_lane_f64:
; CHECK: fmul {{d[0-9]+}}, {{d[0-9]+}}, {{d[0-9]+}}		; CHECK: fmul {{d[0-9]+}}, {{d[0-9]+}}, {{d[0-9]+}}
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vmul_lane_f64:
		; EXYNOS: fmul {{d[0-9]+}}, {{d[0-9]+}}, {{d[0-9]+}}
		; EXYNOS-NEXT: ret
entry:		entry:
%0 = bitcast <1 x double> %a to <8 x i8>		%0 = bitcast <1 x double> %a to <8 x i8>
%1 = bitcast <8 x i8> %0 to double		%1 = bitcast <8 x i8> %0 to double
%extract = extractelement <1 x double> %v, i32 0		%extract = extractelement <1 x double> %v, i32 0
%2 = fmul double %1, %extract		%2 = fmul double %1, %extract
%3 = insertelement <1 x double> undef, double %2, i32 0		%3 = insertelement <1 x double> undef, double %2, i32 0
ret <1 x double> %3		ret <1 x double> %3
}		}

define <4 x float> @test_vmulq_lane_f32(<4 x float> %a, <2 x float> %v) {		define <4 x float> @test_vmulq_lane_f32(<4 x float> %a, <2 x float> %v) {
; CHECK-LABEL: test_vmulq_lane_f32:		; CHECK-LABEL: test_vmulq_lane_f32:
; CHECK: fmul {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[1]		; CHECK: fmul {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[1]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vmulq_lane_f32:
		; EXYNOS: dup [[x:v[0-9]+]].4s, {{v[0-9]+}}.s[1]
		; EXYNOS: fmul {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, [[x]].4s
		; EXYNOS-NEXT: ret
entry:		entry:
%shuffle = shufflevector <2 x float> %v, <2 x float> undef, <4 x i32> <i32 1, i32 1, i32 1, i32 1>		%shuffle = shufflevector <2 x float> %v, <2 x float> undef, <4 x i32> <i32 1, i32 1, i32 1, i32 1>
%mul = fmul <4 x float> %shuffle, %a		%mul = fmul <4 x float> %shuffle, %a
ret <4 x float> %mul		ret <4 x float> %mul
}		}

define <2 x double> @test_vmulq_lane_f64(<2 x double> %a, <1 x double> %v) {		define <2 x double> @test_vmulq_lane_f64(<2 x double> %a, <1 x double> %v) {
; CHECK-LABEL: test_vmulq_lane_f64:		; CHECK-LABEL: test_vmulq_lane_f64:
; CHECK: fmul {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, {{v[0-9]+}}.d[0]		; CHECK: fmul {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, {{v[0-9]+}}.d[0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vmulq_lane_f64:
		; EXYNOS: dup [[x:v[0-9]+]].2d, {{v[0-9]+}}.d[0]
		; EXYNOS: fmul {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, {{v[0-9]+}}.2d
		; EXYNOS-NEXT: ret
entry:		entry:
%shuffle = shufflevector <1 x double> %v, <1 x double> undef, <2 x i32> zeroinitializer		%shuffle = shufflevector <1 x double> %v, <1 x double> undef, <2 x i32> zeroinitializer
%mul = fmul <2 x double> %shuffle, %a		%mul = fmul <2 x double> %shuffle, %a
ret <2 x double> %mul		ret <2 x double> %mul
}		}

define <2 x float> @test_vmul_laneq_f32(<2 x float> %a, <4 x float> %v) {		define <2 x float> @test_vmul_laneq_f32(<2 x float> %a, <4 x float> %v) {
; CHECK-LABEL: test_vmul_laneq_f32:		; CHECK-LABEL: test_vmul_laneq_f32:
; CHECK: fmul {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, {{v[0-9]+}}.s[3]		; CHECK: fmul {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, {{v[0-9]+}}.s[3]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vmul_laneq_f32:
		; EXYNOS: dup [[x:v[0-9]+]].2s, {{v[0-9]+}}.s[3]
		; EXYNOS: fmul {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, [[x]].2s
		; EXYNOS-NEXT: ret
entry:		entry:
%shuffle = shufflevector <4 x float> %v, <4 x float> undef, <2 x i32> <i32 3, i32 3>		%shuffle = shufflevector <4 x float> %v, <4 x float> undef, <2 x i32> <i32 3, i32 3>
%mul = fmul <2 x float> %shuffle, %a		%mul = fmul <2 x float> %shuffle, %a
ret <2 x float> %mul		ret <2 x float> %mul
}		}

define <1 x double> @test_vmul_laneq_f64(<1 x double> %a, <2 x double> %v) {		define <1 x double> @test_vmul_laneq_f64(<1 x double> %a, <2 x double> %v) {
; CHECK-LABEL: test_vmul_laneq_f64:		; CHECK-LABEL: test_vmul_laneq_f64:
; CHECK: fmul {{d[0-9]+}}, {{d[0-9]+}}, {{v[0-9]+}}.d[1]		; CHECK: fmul {{d[0-9]+}}, {{d[0-9]+}}, {{v[0-9]+}}.d[1]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vmul_laneq_f64:
		; EXYNOS: fmul {{d[0-9]+}}, {{d[0-9]+}}, {{v[0-9]+}}.d[1]
		; EXYNOS-NEXT: ret
entry:		entry:
%0 = bitcast <1 x double> %a to <8 x i8>		%0 = bitcast <1 x double> %a to <8 x i8>
%1 = bitcast <8 x i8> %0 to double		%1 = bitcast <8 x i8> %0 to double
%extract = extractelement <2 x double> %v, i32 1		%extract = extractelement <2 x double> %v, i32 1
%2 = fmul double %1, %extract		%2 = fmul double %1, %extract
%3 = insertelement <1 x double> undef, double %2, i32 0		%3 = insertelement <1 x double> undef, double %2, i32 0
ret <1 x double> %3		ret <1 x double> %3
}		}

define <4 x float> @test_vmulq_laneq_f32(<4 x float> %a, <4 x float> %v) {		define <4 x float> @test_vmulq_laneq_f32(<4 x float> %a, <4 x float> %v) {
; CHECK-LABEL: test_vmulq_laneq_f32:		; CHECK-LABEL: test_vmulq_laneq_f32:
; CHECK: fmul {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[3]		; CHECK: fmul {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[3]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vmulq_laneq_f32:
		; EXYNOS: dup [[x:v[0-9]+]].4s, {{v[0-9]+}}.s[3]
		; EXYNOS: fmul {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, [[x]].4s
		; EXYNOS-NEXT: ret
entry:		entry:
%shuffle = shufflevector <4 x float> %v, <4 x float> undef, <4 x i32> <i32 3, i32 3, i32 3, i32 3>		%shuffle = shufflevector <4 x float> %v, <4 x float> undef, <4 x i32> <i32 3, i32 3, i32 3, i32 3>
%mul = fmul <4 x float> %shuffle, %a		%mul = fmul <4 x float> %shuffle, %a
ret <4 x float> %mul		ret <4 x float> %mul
}		}

define <2 x double> @test_vmulq_laneq_f64(<2 x double> %a, <2 x double> %v) {		define <2 x double> @test_vmulq_laneq_f64(<2 x double> %a, <2 x double> %v) {
; CHECK-LABEL: test_vmulq_laneq_f64:		; CHECK-LABEL: test_vmulq_laneq_f64:
; CHECK: fmul {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, {{v[0-9]+}}.d[1]		; CHECK: fmul {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, {{v[0-9]+}}.d[1]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vmulq_laneq_f64:
		; EXYNOS: dup [[x:v[0-9]+]].2d, {{v[0-9]+}}.d[1]
		; EXYNOS: fmul {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, [[x]].2d
		; EXYNOS-NEXT: ret
entry:		entry:
%shuffle = shufflevector <2 x double> %v, <2 x double> undef, <2 x i32> <i32 1, i32 1>		%shuffle = shufflevector <2 x double> %v, <2 x double> undef, <2 x i32> <i32 1, i32 1>
%mul = fmul <2 x double> %shuffle, %a		%mul = fmul <2 x double> %shuffle, %a
ret <2 x double> %mul		ret <2 x double> %mul
}		}

define <2 x float> @test_vmulx_lane_f32(<2 x float> %a, <2 x float> %v) {		define <2 x float> @test_vmulx_lane_f32(<2 x float> %a, <2 x float> %v) {
; CHECK-LABEL: test_vmulx_lane_f32:		; CHECK-LABEL: test_vmulx_lane_f32:
; CHECK: mulx {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, {{v[0-9]+}}.s[1]		; CHECK: mulx {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, {{v[0-9]+}}.s[1]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vmulx_lane_f32:
		; EXYNOS: dup [[x:v[0-9]+]].2s, {{v[0-9]+}}.s[1]
		; EXYNOS: mulx {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, [[x]].2s
		; EXYNOS-NEXT: ret
entry:		entry:
%shuffle = shufflevector <2 x float> %v, <2 x float> undef, <2 x i32> <i32 1, i32 1>		%shuffle = shufflevector <2 x float> %v, <2 x float> undef, <2 x i32> <i32 1, i32 1>
%vmulx2.i = tail call <2 x float> @llvm.aarch64.neon.fmulx.v2f32(<2 x float> %a, <2 x float> %shuffle)		%vmulx2.i = tail call <2 x float> @llvm.aarch64.neon.fmulx.v2f32(<2 x float> %a, <2 x float> %shuffle)
ret <2 x float> %vmulx2.i		ret <2 x float> %vmulx2.i
}		}

define <4 x float> @test_vmulxq_lane_f32(<4 x float> %a, <2 x float> %v) {		define <4 x float> @test_vmulxq_lane_f32(<4 x float> %a, <2 x float> %v) {
; CHECK-LABEL: test_vmulxq_lane_f32:		; CHECK-LABEL: test_vmulxq_lane_f32:
; CHECK: mulx {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[1]		; CHECK: mulx {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[1]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vmulxq_lane_f32:
		; EXYNOS: dup [[x:v[0-9]+]].4s, {{v[0-9]+}}.s[1]
		; EXYNOS: mulx {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, [[x]].4s
		; Exynos-NEXT: ret
entry:		entry:
%shuffle = shufflevector <2 x float> %v, <2 x float> undef, <4 x i32> <i32 1, i32 1, i32 1, i32 1>		%shuffle = shufflevector <2 x float> %v, <2 x float> undef, <4 x i32> <i32 1, i32 1, i32 1, i32 1>
%vmulx2.i = tail call <4 x float> @llvm.aarch64.neon.fmulx.v4f32(<4 x float> %a, <4 x float> %shuffle)		%vmulx2.i = tail call <4 x float> @llvm.aarch64.neon.fmulx.v4f32(<4 x float> %a, <4 x float> %shuffle)
ret <4 x float> %vmulx2.i		ret <4 x float> %vmulx2.i
}		}

define <2 x double> @test_vmulxq_lane_f64(<2 x double> %a, <1 x double> %v) {		define <2 x double> @test_vmulxq_lane_f64(<2 x double> %a, <1 x double> %v) {
; CHECK-LABEL: test_vmulxq_lane_f64:		; CHECK-LABEL: test_vmulxq_lane_f64:
; CHECK: mulx {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, {{v[0-9]+}}.d[0]		; CHECK: mulx {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, {{v[0-9]+}}.d[0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vmulxq_lane_f64:
		; EXYNOS: dup [[x:v[0-9]+]].2d, {{v[0-9]+}}.d[0]
		; EXYNOS: mulx {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, [[x]].2d
		; EXYNOS-NEXT: ret
entry:		entry:
%shuffle = shufflevector <1 x double> %v, <1 x double> undef, <2 x i32> zeroinitializer		%shuffle = shufflevector <1 x double> %v, <1 x double> undef, <2 x i32> zeroinitializer
%vmulx2.i = tail call <2 x double> @llvm.aarch64.neon.fmulx.v2f64(<2 x double> %a, <2 x double> %shuffle)		%vmulx2.i = tail call <2 x double> @llvm.aarch64.neon.fmulx.v2f64(<2 x double> %a, <2 x double> %shuffle)
ret <2 x double> %vmulx2.i		ret <2 x double> %vmulx2.i
}		}

define <2 x float> @test_vmulx_laneq_f32(<2 x float> %a, <4 x float> %v) {		define <2 x float> @test_vmulx_laneq_f32(<2 x float> %a, <4 x float> %v) {
; CHECK-LABEL: test_vmulx_laneq_f32:		; CHECK-LABEL: test_vmulx_laneq_f32:
; CHECK: mulx {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, {{v[0-9]+}}.s[3]		; CHECK: mulx {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, {{v[0-9]+}}.s[3]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vmulx_laneq_f32:
		; EXYNOS: dup [[x:v[0-9]+]].2s, {{v[0-9]+}}.s[3]
		; EXYNOS: mulx {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, [[x]].2s
		; EXYNOS-NEXT: ret
entry:		entry:
%shuffle = shufflevector <4 x float> %v, <4 x float> undef, <2 x i32> <i32 3, i32 3>		%shuffle = shufflevector <4 x float> %v, <4 x float> undef, <2 x i32> <i32 3, i32 3>
%vmulx2.i = tail call <2 x float> @llvm.aarch64.neon.fmulx.v2f32(<2 x float> %a, <2 x float> %shuffle)		%vmulx2.i = tail call <2 x float> @llvm.aarch64.neon.fmulx.v2f32(<2 x float> %a, <2 x float> %shuffle)
ret <2 x float> %vmulx2.i		ret <2 x float> %vmulx2.i
}		}

define <4 x float> @test_vmulxq_laneq_f32(<4 x float> %a, <4 x float> %v) {		define <4 x float> @test_vmulxq_laneq_f32(<4 x float> %a, <4 x float> %v) {
; CHECK-LABEL: test_vmulxq_laneq_f32:		; CHECK-LABEL: test_vmulxq_laneq_f32:
; CHECK: mulx {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[3]		; CHECK: mulx {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[3]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vmulxq_laneq_f32:
		; EXYNOS: dup [[x:v[0-9]+]].4s, {{v[0-9]+}}.s[3]
		; EXYNOS: mulx {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, [[x]].4s
		; EXYNOS-NEXT: ret
entry:		entry:
%shuffle = shufflevector <4 x float> %v, <4 x float> undef, <4 x i32> <i32 3, i32 3, i32 3, i32 3>		%shuffle = shufflevector <4 x float> %v, <4 x float> undef, <4 x i32> <i32 3, i32 3, i32 3, i32 3>
%vmulx2.i = tail call <4 x float> @llvm.aarch64.neon.fmulx.v4f32(<4 x float> %a, <4 x float> %shuffle)		%vmulx2.i = tail call <4 x float> @llvm.aarch64.neon.fmulx.v4f32(<4 x float> %a, <4 x float> %shuffle)
ret <4 x float> %vmulx2.i		ret <4 x float> %vmulx2.i
}		}

define <2 x double> @test_vmulxq_laneq_f64(<2 x double> %a, <2 x double> %v) {		define <2 x double> @test_vmulxq_laneq_f64(<2 x double> %a, <2 x double> %v) {
; CHECK-LABEL: test_vmulxq_laneq_f64:		; CHECK-LABEL: test_vmulxq_laneq_f64:
; CHECK: mulx {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, {{v[0-9]+}}.d[1]		; CHECK: mulx {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, {{v[0-9]+}}.d[1]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vmulxq_laneq_f64:
		; EXYNOS: dup [[x:v[0-9]+]].2d, {{v[0-9]+}}.d[1]
		; EXYNOS: mulx {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, [[x]].2d
		; EXYNOS-NEXT: ret
entry:		entry:
%shuffle = shufflevector <2 x double> %v, <2 x double> undef, <2 x i32> <i32 1, i32 1>		%shuffle = shufflevector <2 x double> %v, <2 x double> undef, <2 x i32> <i32 1, i32 1>
%vmulx2.i = tail call <2 x double> @llvm.aarch64.neon.fmulx.v2f64(<2 x double> %a, <2 x double> %shuffle)		%vmulx2.i = tail call <2 x double> @llvm.aarch64.neon.fmulx.v2f64(<2 x double> %a, <2 x double> %shuffle)
ret <2 x double> %vmulx2.i		ret <2 x double> %vmulx2.i
}		}

define <4 x i16> @test_vmla_lane_s16_0(<4 x i16> %a, <4 x i16> %b, <4 x i16> %v) {		define <4 x i16> @test_vmla_lane_s16_0(<4 x i16> %a, <4 x i16> %b, <4 x i16> %v) {
; CHECK-LABEL: test_vmla_lane_s16_0:		; CHECK-LABEL: test_vmla_lane_s16_0:
▲ Show 20 Lines • Show All 330 Lines • ▼ Show 20 Lines	entry:
%mul = mul <4 x i32> %shuffle, %a		%mul = mul <4 x i32> %shuffle, %a
ret <4 x i32> %mul		ret <4 x i32> %mul
}		}

define <2 x float> @test_vfma_lane_f32_0(<2 x float> %a, <2 x float> %b, <2 x float> %v) {		define <2 x float> @test_vfma_lane_f32_0(<2 x float> %a, <2 x float> %b, <2 x float> %v) {
; CHECK-LABEL: test_vfma_lane_f32_0:		; CHECK-LABEL: test_vfma_lane_f32_0:
; CHECK: fmla {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, {{v[0-9]+}}.s[0]		; CHECK: fmla {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, {{v[0-9]+}}.s[0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vfma_lane_f32_0:
		; EXYNOS: dup [[x:v[0-9]+]].2s, {{v[0-9]+}}.s[0]
		; EXYNOS: fmla {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, [[x]].2s
		; EXYNOS-NEXT: ret
entry:		entry:
%lane = shufflevector <2 x float> %v, <2 x float> undef, <2 x i32> zeroinitializer		%lane = shufflevector <2 x float> %v, <2 x float> undef, <2 x i32> zeroinitializer
%0 = tail call <2 x float> @llvm.fma.v2f32(<2 x float> %lane, <2 x float> %b, <2 x float> %a)		%0 = tail call <2 x float> @llvm.fma.v2f32(<2 x float> %lane, <2 x float> %b, <2 x float> %a)
ret <2 x float> %0		ret <2 x float> %0
}		}

define <4 x float> @test_vfmaq_lane_f32_0(<4 x float> %a, <4 x float> %b, <2 x float> %v) {		define <4 x float> @test_vfmaq_lane_f32_0(<4 x float> %a, <4 x float> %b, <2 x float> %v) {
; CHECK-LABEL: test_vfmaq_lane_f32_0:		; CHECK-LABEL: test_vfmaq_lane_f32_0:
; CHECK: fmla {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[0]		; CHECK: fmla {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vfmaq_lane_f32_0:
		; EXYNOS: dup [[x:v[0-9]+]].4s, {{v[0-9]+}}.s[0]
		; EXYNOS: fmla {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, [[x]].4s
		; EXYNOS-NEXT: ret
entry:		entry:
%lane = shufflevector <2 x float> %v, <2 x float> undef, <4 x i32> zeroinitializer		%lane = shufflevector <2 x float> %v, <2 x float> undef, <4 x i32> zeroinitializer
%0 = tail call <4 x float> @llvm.fma.v4f32(<4 x float> %lane, <4 x float> %b, <4 x float> %a)		%0 = tail call <4 x float> @llvm.fma.v4f32(<4 x float> %lane, <4 x float> %b, <4 x float> %a)
ret <4 x float> %0		ret <4 x float> %0
}		}

define <2 x float> @test_vfma_laneq_f32_0(<2 x float> %a, <2 x float> %b, <4 x float> %v) {		define <2 x float> @test_vfma_laneq_f32_0(<2 x float> %a, <2 x float> %b, <4 x float> %v) {
; CHECK-LABEL: test_vfma_laneq_f32_0:		; CHECK-LABEL: test_vfma_laneq_f32_0:
; CHECK: fmla {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, {{v[0-9]+}}.s[0]		; CHECK: fmla {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, {{v[0-9]+}}.s[0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vfma_laneq_f32_0:
		; EXYNOS: dup [[x:v[0-9]+]].2s, {{v[0-9]+}}.s[0]
		; EXYNOS: fmla {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, [[x]].2s
		; EXYNOS-NEXT: ret
entry:		entry:
%lane = shufflevector <4 x float> %v, <4 x float> undef, <2 x i32> zeroinitializer		%lane = shufflevector <4 x float> %v, <4 x float> undef, <2 x i32> zeroinitializer
%0 = tail call <2 x float> @llvm.fma.v2f32(<2 x float> %lane, <2 x float> %b, <2 x float> %a)		%0 = tail call <2 x float> @llvm.fma.v2f32(<2 x float> %lane, <2 x float> %b, <2 x float> %a)
ret <2 x float> %0		ret <2 x float> %0
}		}

define <4 x float> @test_vfmaq_laneq_f32_0(<4 x float> %a, <4 x float> %b, <4 x float> %v) {		define <4 x float> @test_vfmaq_laneq_f32_0(<4 x float> %a, <4 x float> %b, <4 x float> %v) {
; CHECK-LABEL: test_vfmaq_laneq_f32_0:		; CHECK-LABEL: test_vfmaq_laneq_f32_0:
; CHECK: fmla {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[0]		; CHECK: fmla {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vfmaq_laneq_f32_0:
		; EXYNOS: dup [[x:v[0-9]+]].4s, {{v[0-9]+}}.s[0]
		; EXYNOS: fmla {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, [[x]].4s
		; EXYNOS-NEXT: ret
entry:		entry:
%lane = shufflevector <4 x float> %v, <4 x float> undef, <4 x i32> zeroinitializer		%lane = shufflevector <4 x float> %v, <4 x float> undef, <4 x i32> zeroinitializer
%0 = tail call <4 x float> @llvm.fma.v4f32(<4 x float> %lane, <4 x float> %b, <4 x float> %a)		%0 = tail call <4 x float> @llvm.fma.v4f32(<4 x float> %lane, <4 x float> %b, <4 x float> %a)
ret <4 x float> %0		ret <4 x float> %0
}		}

define <2 x float> @test_vfms_lane_f32_0(<2 x float> %a, <2 x float> %b, <2 x float> %v) {		define <2 x float> @test_vfms_lane_f32_0(<2 x float> %a, <2 x float> %b, <2 x float> %v) {
; CHECK-LABEL: test_vfms_lane_f32_0:		; CHECK-LABEL: test_vfms_lane_f32_0:
; CHECK: fmls {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, {{v[0-9]+}}.s[0]		; CHECK: fmls {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, {{v[0-9]+}}.s[0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vfms_lane_f32_0:
		; EXYNOS: dup [[x:v[0-9]+]].2s, {{v[0-9]+}}.s[0]
		; EXYNOS: fmls {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, [[x]].2s
		; EXYNOS-NEXT: ret
entry:		entry:
%sub = fsub <2 x float> <float -0.000000e+00, float -0.000000e+00>, %v		%sub = fsub <2 x float> <float -0.000000e+00, float -0.000000e+00>, %v
%lane = shufflevector <2 x float> %sub, <2 x float> undef, <2 x i32> zeroinitializer		%lane = shufflevector <2 x float> %sub, <2 x float> undef, <2 x i32> zeroinitializer
%0 = tail call <2 x float> @llvm.fma.v2f32(<2 x float> %lane, <2 x float> %b, <2 x float> %a)		%0 = tail call <2 x float> @llvm.fma.v2f32(<2 x float> %lane, <2 x float> %b, <2 x float> %a)
ret <2 x float> %0		ret <2 x float> %0
}		}

define <4 x float> @test_vfmsq_lane_f32_0(<4 x float> %a, <4 x float> %b, <2 x float> %v) {		define <4 x float> @test_vfmsq_lane_f32_0(<4 x float> %a, <4 x float> %b, <2 x float> %v) {
; CHECK-LABEL: test_vfmsq_lane_f32_0:		; CHECK-LABEL: test_vfmsq_lane_f32_0:
; CHECK: fmls {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[0]		; CHECK: fmls {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vfmsq_lane_f32_0:
		; EXYNOS: dup [[x:v[0-9]+]].4s, {{v[0-9]+}}.s[0]
		; EXYNOS: fmls {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, [[x]].4s
		; EXYNOS-NEXT: ret
entry:		entry:
%sub = fsub <2 x float> <float -0.000000e+00, float -0.000000e+00>, %v		%sub = fsub <2 x float> <float -0.000000e+00, float -0.000000e+00>, %v
%lane = shufflevector <2 x float> %sub, <2 x float> undef, <4 x i32> zeroinitializer		%lane = shufflevector <2 x float> %sub, <2 x float> undef, <4 x i32> zeroinitializer
%0 = tail call <4 x float> @llvm.fma.v4f32(<4 x float> %lane, <4 x float> %b, <4 x float> %a)		%0 = tail call <4 x float> @llvm.fma.v4f32(<4 x float> %lane, <4 x float> %b, <4 x float> %a)
ret <4 x float> %0		ret <4 x float> %0
}		}

define <2 x float> @test_vfms_laneq_f32_0(<2 x float> %a, <2 x float> %b, <4 x float> %v) {		define <2 x float> @test_vfms_laneq_f32_0(<2 x float> %a, <2 x float> %b, <4 x float> %v) {
; CHECK-LABEL: test_vfms_laneq_f32_0:		; CHECK-LABEL: test_vfms_laneq_f32_0:
; CHECK: fmls {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, {{v[0-9]+}}.s[0]		; CHECK: fmls {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, {{v[0-9]+}}.s[0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vfms_laneq_f32_0:
		; EXYNOS: dup [[x:v[0-9]+]].2s, {{v[0-9]+}}.s[0]
		; EXYNOS: fmls {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, [[x]].2s
		; EXYNOS-NEXT: ret
entry:		entry:
%sub = fsub <4 x float> <float -0.000000e+00, float -0.000000e+00, float -0.000000e+00, float -0.000000e+00>, %v		%sub = fsub <4 x float> <float -0.000000e+00, float -0.000000e+00, float -0.000000e+00, float -0.000000e+00>, %v
%lane = shufflevector <4 x float> %sub, <4 x float> undef, <2 x i32> zeroinitializer		%lane = shufflevector <4 x float> %sub, <4 x float> undef, <2 x i32> zeroinitializer
%0 = tail call <2 x float> @llvm.fma.v2f32(<2 x float> %lane, <2 x float> %b, <2 x float> %a)		%0 = tail call <2 x float> @llvm.fma.v2f32(<2 x float> %lane, <2 x float> %b, <2 x float> %a)
ret <2 x float> %0		ret <2 x float> %0
}		}

define <4 x float> @test_vfmsq_laneq_f32_0(<4 x float> %a, <4 x float> %b, <4 x float> %v) {		define <4 x float> @test_vfmsq_laneq_f32_0(<4 x float> %a, <4 x float> %b, <4 x float> %v) {
; CHECK-LABEL: test_vfmsq_laneq_f32_0:		; CHECK-LABEL: test_vfmsq_laneq_f32_0:
; CHECK: fmls {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[0]		; CHECK: fmls {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vfmsq_laneq_f32_0:
		; EXYNOS: dup [[x:v[0-9]+]].4s, {{v[0-9]+}}.s[0]
		; EXYNOS: fmls {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, [[x]].4s
		; EXYNOS-NEXT: ret
entry:		entry:
%sub = fsub <4 x float> <float -0.000000e+00, float -0.000000e+00, float -0.000000e+00, float -0.000000e+00>, %v		%sub = fsub <4 x float> <float -0.000000e+00, float -0.000000e+00, float -0.000000e+00, float -0.000000e+00>, %v
%lane = shufflevector <4 x float> %sub, <4 x float> undef, <4 x i32> zeroinitializer		%lane = shufflevector <4 x float> %sub, <4 x float> undef, <4 x i32> zeroinitializer
%0 = tail call <4 x float> @llvm.fma.v4f32(<4 x float> %lane, <4 x float> %b, <4 x float> %a)		%0 = tail call <4 x float> @llvm.fma.v4f32(<4 x float> %lane, <4 x float> %b, <4 x float> %a)
ret <4 x float> %0		ret <4 x float> %0
}		}

define <2 x double> @test_vfmaq_laneq_f64_0(<2 x double> %a, <2 x double> %b, <2 x double> %v) {		define <2 x double> @test_vfmaq_laneq_f64_0(<2 x double> %a, <2 x double> %b, <2 x double> %v) {
; CHECK-LABEL: test_vfmaq_laneq_f64_0:		; CHECK-LABEL: test_vfmaq_laneq_f64_0:
; CHECK: fmla {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, {{v[0-9]+}}.d[0]		; CHECK: fmla {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, {{v[0-9]+}}.d[0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vfmaq_laneq_f64_0:
		; EXYNOS: dup [[x:v[0-9]+]].2d, {{v[0-9]+}}.d[0]
		; EXYNOS: fmla {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, [[x]].2d
		; EXYNOS-NEXT: ret
entry:		entry:
%lane = shufflevector <2 x double> %v, <2 x double> undef, <2 x i32> zeroinitializer		%lane = shufflevector <2 x double> %v, <2 x double> undef, <2 x i32> zeroinitializer
%0 = tail call <2 x double> @llvm.fma.v2f64(<2 x double> %lane, <2 x double> %b, <2 x double> %a)		%0 = tail call <2 x double> @llvm.fma.v2f64(<2 x double> %lane, <2 x double> %b, <2 x double> %a)
ret <2 x double> %0		ret <2 x double> %0
}		}

define <2 x double> @test_vfmsq_laneq_f64_0(<2 x double> %a, <2 x double> %b, <2 x double> %v) {		define <2 x double> @test_vfmsq_laneq_f64_0(<2 x double> %a, <2 x double> %b, <2 x double> %v) {
; CHECK-LABEL: test_vfmsq_laneq_f64_0:		; CHECK-LABEL: test_vfmsq_laneq_f64_0:
; CHECK: fmls {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, {{v[0-9]+}}.d[0]		; CHECK: fmls {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, {{v[0-9]+}}.d[0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vfmsq_laneq_f64_0:
		; EXYNOS: dup [[x:v[0-9]+]].2d, {{v[0-9]+}}.d[0]
		; EXYNOS: fmls {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, [[x]].2d
		; EXYNOS-NEXT: ret
entry:		entry:
%sub = fsub <2 x double> <double -0.000000e+00, double -0.000000e+00>, %v		%sub = fsub <2 x double> <double -0.000000e+00, double -0.000000e+00>, %v
%lane = shufflevector <2 x double> %sub, <2 x double> undef, <2 x i32> zeroinitializer		%lane = shufflevector <2 x double> %sub, <2 x double> undef, <2 x i32> zeroinitializer
%0 = tail call <2 x double> @llvm.fma.v2f64(<2 x double> %lane, <2 x double> %b, <2 x double> %a)		%0 = tail call <2 x double> @llvm.fma.v2f64(<2 x double> %lane, <2 x double> %b, <2 x double> %a)
ret <2 x double> %0		ret <2 x double> %0
}		}

define <4 x i32> @test_vmlal_lane_s16_0(<4 x i32> %a, <4 x i16> %b, <4 x i16> %v) {		define <4 x i32> @test_vmlal_lane_s16_0(<4 x i32> %a, <4 x i16> %b, <4 x i16> %v) {
▲ Show 20 Lines • Show All 787 Lines • ▼ Show 20 Lines	entry:
%vqrdmulh2.i = tail call <4 x i32> @llvm.aarch64.neon.sqrdmulh.v4i32(<4 x i32> %a, <4 x i32> %shuffle)		%vqrdmulh2.i = tail call <4 x i32> @llvm.aarch64.neon.sqrdmulh.v4i32(<4 x i32> %a, <4 x i32> %shuffle)
ret <4 x i32> %vqrdmulh2.i		ret <4 x i32> %vqrdmulh2.i
}		}

define <2 x float> @test_vmul_lane_f32_0(<2 x float> %a, <2 x float> %v) {		define <2 x float> @test_vmul_lane_f32_0(<2 x float> %a, <2 x float> %v) {
; CHECK-LABEL: test_vmul_lane_f32_0:		; CHECK-LABEL: test_vmul_lane_f32_0:
; CHECK: fmul {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, {{v[0-9]+}}.s[0]		; CHECK: fmul {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, {{v[0-9]+}}.s[0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vmul_lane_f32_0:
		; EXYNOS: dup [[x:v[0-9]+]].2s, {{v[0-9]+}}.s[0]
		; EXYNOS: fmul {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, [[x]].2s
		; EXYNOS-NEXT: ret
entry:		entry:
%shuffle = shufflevector <2 x float> %v, <2 x float> undef, <2 x i32> zeroinitializer		%shuffle = shufflevector <2 x float> %v, <2 x float> undef, <2 x i32> zeroinitializer
%mul = fmul <2 x float> %shuffle, %a		%mul = fmul <2 x float> %shuffle, %a
ret <2 x float> %mul		ret <2 x float> %mul
}		}

define <4 x float> @test_vmulq_lane_f32_0(<4 x float> %a, <2 x float> %v) {		define <4 x float> @test_vmulq_lane_f32_0(<4 x float> %a, <2 x float> %v) {
; CHECK-LABEL: test_vmulq_lane_f32_0:		; CHECK-LABEL: test_vmulq_lane_f32_0:
; CHECK: fmul {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[0]		; CHECK: fmul {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vmulq_lane_f32_0:
		; EXYNOS: dup [[x:v[0-9]+]].4s, {{v[0-9]+}}.s[0]
		; EXYNOS: fmul {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, [[x]].4s
		; EXYNOS-NEXT: ret
entry:		entry:
%shuffle = shufflevector <2 x float> %v, <2 x float> undef, <4 x i32> zeroinitializer		%shuffle = shufflevector <2 x float> %v, <2 x float> undef, <4 x i32> zeroinitializer
%mul = fmul <4 x float> %shuffle, %a		%mul = fmul <4 x float> %shuffle, %a
ret <4 x float> %mul		ret <4 x float> %mul
}		}

define <2 x float> @test_vmul_laneq_f32_0(<2 x float> %a, <4 x float> %v) {		define <2 x float> @test_vmul_laneq_f32_0(<2 x float> %a, <4 x float> %v) {
; CHECK-LABEL: test_vmul_laneq_f32_0:		; CHECK-LABEL: test_vmul_laneq_f32_0:
; CHECK: fmul {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, {{v[0-9]+}}.s[0]		; CHECK: fmul {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, {{v[0-9]+}}.s[0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vmul_laneq_f32_0:
		; EXYNOS: dup [[x:v[0-9]+]].2s, {{v[0-9]+}}.s[0]
		; EXYNOS: fmul {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, [[x]].2s
		; EXYNOS-NEXT: ret
entry:		entry:
%shuffle = shufflevector <4 x float> %v, <4 x float> undef, <2 x i32> zeroinitializer		%shuffle = shufflevector <4 x float> %v, <4 x float> undef, <2 x i32> zeroinitializer
%mul = fmul <2 x float> %shuffle, %a		%mul = fmul <2 x float> %shuffle, %a
ret <2 x float> %mul		ret <2 x float> %mul
}		}

define <1 x double> @test_vmul_laneq_f64_0(<1 x double> %a, <2 x double> %v) {		define <1 x double> @test_vmul_laneq_f64_0(<1 x double> %a, <2 x double> %v) {
; CHECK-LABEL: test_vmul_laneq_f64_0:		; CHECK-LABEL: test_vmul_laneq_f64_0:
; CHECK: fmul {{d[0-9]+}}, {{d[0-9]+}}, {{v[0-9]+}}.d[0]		; CHECK: fmul {{d[0-9]+}}, {{d[0-9]+}}, {{v[0-9]+}}.d[0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vmul_laneq_f64_0:
		; EXYNOS: fmul {{d[0-9]+}}, {{d[0-9]+}}, {{v[0-9]+}}.d[0]
		; EXYNOS-NEXT: ret
entry:		entry:
%0 = bitcast <1 x double> %a to <8 x i8>		%0 = bitcast <1 x double> %a to <8 x i8>
%1 = bitcast <8 x i8> %0 to double		%1 = bitcast <8 x i8> %0 to double
%extract = extractelement <2 x double> %v, i32 0		%extract = extractelement <2 x double> %v, i32 0
%2 = fmul double %1, %extract		%2 = fmul double %1, %extract
%3 = insertelement <1 x double> undef, double %2, i32 0		%3 = insertelement <1 x double> undef, double %2, i32 0
ret <1 x double> %3		ret <1 x double> %3
}		}

define <4 x float> @test_vmulq_laneq_f32_0(<4 x float> %a, <4 x float> %v) {		define <4 x float> @test_vmulq_laneq_f32_0(<4 x float> %a, <4 x float> %v) {
; CHECK-LABEL: test_vmulq_laneq_f32_0:		; CHECK-LABEL: test_vmulq_laneq_f32_0:
; CHECK: fmul {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[0]		; CHECK: fmul {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vmulq_laneq_f32_0:
		; EXYNOS: dup [[x:v[0-9]+]].4s, {{v[0-9]+}}.s[0]
		; EXYNOS: fmul {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, [[x]].4s
		; EXYNOS-NEXT: ret
entry:		entry:
%shuffle = shufflevector <4 x float> %v, <4 x float> undef, <4 x i32> zeroinitializer		%shuffle = shufflevector <4 x float> %v, <4 x float> undef, <4 x i32> zeroinitializer
%mul = fmul <4 x float> %shuffle, %a		%mul = fmul <4 x float> %shuffle, %a
ret <4 x float> %mul		ret <4 x float> %mul
}		}

define <2 x double> @test_vmulq_laneq_f64_0(<2 x double> %a, <2 x double> %v) {		define <2 x double> @test_vmulq_laneq_f64_0(<2 x double> %a, <2 x double> %v) {
; CHECK-LABEL: test_vmulq_laneq_f64_0:		; CHECK-LABEL: test_vmulq_laneq_f64_0:
; CHECK: fmul {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, {{v[0-9]+}}.d[0]		; CHECK: fmul {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, {{v[0-9]+}}.d[0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vmulq_laneq_f64_0:
		; EXYNOS: dup [[x:v[0-9]+]].2d, {{v[0-9]+}}.d[0]
		; EXYNOS: fmul {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, [[x]].2d
		; EXYNOS-NEXT: ret
entry:		entry:
%shuffle = shufflevector <2 x double> %v, <2 x double> undef, <2 x i32> zeroinitializer		%shuffle = shufflevector <2 x double> %v, <2 x double> undef, <2 x i32> zeroinitializer
%mul = fmul <2 x double> %shuffle, %a		%mul = fmul <2 x double> %shuffle, %a
ret <2 x double> %mul		ret <2 x double> %mul
}		}

define <2 x float> @test_vmulx_lane_f32_0(<2 x float> %a, <2 x float> %v) {		define <2 x float> @test_vmulx_lane_f32_0(<2 x float> %a, <2 x float> %v) {
; CHECK-LABEL: test_vmulx_lane_f32_0:		; CHECK-LABEL: test_vmulx_lane_f32_0:
; CHECK: mulx {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, {{v[0-9]+}}.s[0]		; CHECK: mulx {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, {{v[0-9]+}}.s[0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vmulx_lane_f32_0:
		; EXYNOS: dup [[x:v[0-9]+]].2s, {{v[0-9]+}}.s[0]
		; EXYNOS: mulx {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, [[x]].2s
		; EXYNOS-NEXT: ret
entry:		entry:
%shuffle = shufflevector <2 x float> %v, <2 x float> undef, <2 x i32> zeroinitializer		%shuffle = shufflevector <2 x float> %v, <2 x float> undef, <2 x i32> zeroinitializer
%vmulx2.i = tail call <2 x float> @llvm.aarch64.neon.fmulx.v2f32(<2 x float> %a, <2 x float> %shuffle)		%vmulx2.i = tail call <2 x float> @llvm.aarch64.neon.fmulx.v2f32(<2 x float> %a, <2 x float> %shuffle)
ret <2 x float> %vmulx2.i		ret <2 x float> %vmulx2.i
}		}

define <4 x float> @test_vmulxq_lane_f32_0(<4 x float> %a, <2 x float> %v) {		define <4 x float> @test_vmulxq_lane_f32_0(<4 x float> %a, <2 x float> %v) {
; CHECK-LABEL: test_vmulxq_lane_f32_0:		; CHECK-LABEL: test_vmulxq_lane_f32_0:
; CHECK: mulx {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[0]		; CHECK: mulx {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vmulxq_lane_f32_0:
		; EXYNOS: dup [[x:v[0-9]+]].4s, {{v[0-9]+}}.s[0]
		; EXYNOS: mulx {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, [[x]].4s
		; EXYNOS-NEXT: ret
entry:		entry:
%shuffle = shufflevector <2 x float> %v, <2 x float> undef, <4 x i32> zeroinitializer		%shuffle = shufflevector <2 x float> %v, <2 x float> undef, <4 x i32> zeroinitializer
%vmulx2.i = tail call <4 x float> @llvm.aarch64.neon.fmulx.v4f32(<4 x float> %a, <4 x float> %shuffle)		%vmulx2.i = tail call <4 x float> @llvm.aarch64.neon.fmulx.v4f32(<4 x float> %a, <4 x float> %shuffle)
ret <4 x float> %vmulx2.i		ret <4 x float> %vmulx2.i
}		}

define <2 x double> @test_vmulxq_lane_f64_0(<2 x double> %a, <1 x double> %v) {		define <2 x double> @test_vmulxq_lane_f64_0(<2 x double> %a, <1 x double> %v) {
; CHECK-LABEL: test_vmulxq_lane_f64_0:		; CHECK-LABEL: test_vmulxq_lane_f64_0:
; CHECK: mulx {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, {{v[0-9]+}}.d[0]		; CHECK: mulx {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, {{v[0-9]+}}.d[0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vmulxq_lane_f64_0:
		; EXYNOS: dup [[x:v[0-9]+]].2d, {{v[0-9]+}}.d[0]
		; EXYNOS: mulx {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, [[x]].2d
		; EXYNOS-NEXT: ret
entry:		entry:
%shuffle = shufflevector <1 x double> %v, <1 x double> undef, <2 x i32> zeroinitializer		%shuffle = shufflevector <1 x double> %v, <1 x double> undef, <2 x i32> zeroinitializer
%vmulx2.i = tail call <2 x double> @llvm.aarch64.neon.fmulx.v2f64(<2 x double> %a, <2 x double> %shuffle)		%vmulx2.i = tail call <2 x double> @llvm.aarch64.neon.fmulx.v2f64(<2 x double> %a, <2 x double> %shuffle)
ret <2 x double> %vmulx2.i		ret <2 x double> %vmulx2.i
}		}

define <2 x float> @test_vmulx_laneq_f32_0(<2 x float> %a, <4 x float> %v) {		define <2 x float> @test_vmulx_laneq_f32_0(<2 x float> %a, <4 x float> %v) {
; CHECK-LABEL: test_vmulx_laneq_f32_0:		; CHECK-LABEL: test_vmulx_laneq_f32_0:
; CHECK: mulx {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, {{v[0-9]+}}.s[0]		; CHECK: mulx {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, {{v[0-9]+}}.s[0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vmulx_laneq_f32_0:
		; EXYNOS: dup [[x:v[0-9]+]].2s, {{v[0-9]+}}.s[0]
		; EXYNOS: mulx {{v[0-9]+}}.2s, {{v[0-9]+}}.2s, [[x]].2s
		; EXYNOS-NEXT: ret
entry:		entry:
%shuffle = shufflevector <4 x float> %v, <4 x float> undef, <2 x i32> zeroinitializer		%shuffle = shufflevector <4 x float> %v, <4 x float> undef, <2 x i32> zeroinitializer
%vmulx2.i = tail call <2 x float> @llvm.aarch64.neon.fmulx.v2f32(<2 x float> %a, <2 x float> %shuffle)		%vmulx2.i = tail call <2 x float> @llvm.aarch64.neon.fmulx.v2f32(<2 x float> %a, <2 x float> %shuffle)
ret <2 x float> %vmulx2.i		ret <2 x float> %vmulx2.i
}		}

define <4 x float> @test_vmulxq_laneq_f32_0(<4 x float> %a, <4 x float> %v) {		define <4 x float> @test_vmulxq_laneq_f32_0(<4 x float> %a, <4 x float> %v) {
; CHECK-LABEL: test_vmulxq_laneq_f32_0:		; CHECK-LABEL: test_vmulxq_laneq_f32_0:
; CHECK: mulx {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[0]		; CHECK: mulx {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vmulxq_laneq_f32_0:
		; EXYNOS: dup [[x:v[0-9]+]].4s, {{v[0-9]+}}.s[0]
		; EXYNOS: mulx {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, [[x]].4s
		; EXYNOS-NEXT: ret
entry:		entry:
%shuffle = shufflevector <4 x float> %v, <4 x float> undef, <4 x i32> zeroinitializer		%shuffle = shufflevector <4 x float> %v, <4 x float> undef, <4 x i32> zeroinitializer
%vmulx2.i = tail call <4 x float> @llvm.aarch64.neon.fmulx.v4f32(<4 x float> %a, <4 x float> %shuffle)		%vmulx2.i = tail call <4 x float> @llvm.aarch64.neon.fmulx.v4f32(<4 x float> %a, <4 x float> %shuffle)
ret <4 x float> %vmulx2.i		ret <4 x float> %vmulx2.i
}		}

define <2 x double> @test_vmulxq_laneq_f64_0(<2 x double> %a, <2 x double> %v) {		define <2 x double> @test_vmulxq_laneq_f64_0(<2 x double> %a, <2 x double> %v) {
; CHECK-LABEL: test_vmulxq_laneq_f64_0:		; CHECK-LABEL: test_vmulxq_laneq_f64_0:
; CHECK: mulx {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, {{v[0-9]+}}.d[0]		; CHECK: mulx {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, {{v[0-9]+}}.d[0]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
		; EXYNOS-LABEL: test_vmulxq_laneq_f64_0:
		; EXYNOS: dup [[x:v[0-9]+]].2d, {{v[0-9]+}}.d[0]
		; EXYNOS: mulx {{v[0-9]+}}.2d, {{v[0-9]+}}.2d, [[x]].2d
		; EXYNOS-NEXT: ret
entry:		entry:
%shuffle = shufflevector <2 x double> %v, <2 x double> undef, <2 x i32> zeroinitializer		%shuffle = shufflevector <2 x double> %v, <2 x double> undef, <2 x i32> zeroinitializer
%vmulx2.i = tail call <2 x double> @llvm.aarch64.neon.fmulx.v2f64(<2 x double> %a, <2 x double> %shuffle)		%vmulx2.i = tail call <2 x double> @llvm.aarch64.neon.fmulx.v2f64(<2 x double> %a, <2 x double> %shuffle)
ret <2 x double> %vmulx2.i		ret <2 x double> %vmulx2.i
}		}

		define <4 x float> @optimize_dup(<4 x float> %a, <4 x float> %b, <4 x float> %c, <4 x float> %v) {
		; CHECK-LABEL: optimize_dup:
		; CHECK: fmla {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[3]
		; CHECK: fmls {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[3]
		; CHECK-NEXT: ret
		; EXYNOS-LABEL: optimize_dup:
		; EXYNOS: dup [[x:v[0-9]+]].4s, {{v[0-9]+}}.s[3]
		; EXYNOS: fmla {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, [[x]].4s
		; EXYNOS: fmls {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, [[x]].4s
		; EXYNOS-NEXT: ret
		entry:
		%lane1 = shufflevector <4 x float> %v, <4 x float> undef, <4 x i32> <i32 3, i32 3, i32 3, i32 3>
		%0 = tail call <4 x float> @llvm.fma.v4f32(<4 x float> %lane1, <4 x float> %b, <4 x float> %a)
		%lane2 = shufflevector <4 x float> %v, <4 x float> undef, <4 x i32> <i32 3, i32 3, i32 3, i32 3>
		%1 = fmul <4 x float> %lane2, %c
		%s = fsub <4 x float> %0, %1
		ret <4 x float> %s
		}

		define <4 x float> @no_optimize_dup(<4 x float> %a, <4 x float> %b, <4 x float> %c, <4 x float> %v) {
		; CHECK-LABEL: no_optimize_dup:
		; CHECK: fmla {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[3]
		; CHECK: fmls {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, {{v[0-9]+}}.s[1]
		; CHECK-NEXT: ret
		; EXYNOS-LABEL: no_optimize_dup:
		; EXYNOS: dup [[x:v[0-9]+]].4s, {{v[0-9]+}}.s[3]
		; EXYNOS: fmla {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, [[x]].4s
		; EXYNOS: dup [[y:v[0-9]+]].4s, {{v[0-9]+}}.s[1]
		; EXYNOS: fmls {{v[0-9]+}}.4s, {{v[0-9]+}}.4s, [[y]].4s
		; EXYNOS-NEXT: ret
		entry:
		%lane1 = shufflevector <4 x float> %v, <4 x float> undef, <4 x i32> <i32 3, i32 3, i32 3, i32 3>
		%0 = tail call <4 x float> @llvm.fma.v4f32(<4 x float> %lane1, <4 x float> %b, <4 x float> %a)
		%lane2 = shufflevector <4 x float> %v, <4 x float> undef, <4 x i32> <i32 1, i32 1, i32 1, i32 1>
		%1 = fmul <4 x float> %lane2, %c
		%s = fsub <4 x float> %0, %1
		ret <4 x float> %s
		}

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Avoid generating indexed vector instructions for ExynosClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 74035

llvm/trunk/lib/Target/AArch64/AArch64.h

llvm/trunk/lib/Target/AArch64/AArch64TargetMachine.cpp

llvm/trunk/lib/Target/AArch64/AArch64VectorByElementOpt.cpp

llvm/trunk/lib/Target/AArch64/CMakeLists.txt

llvm/trunk/test/CodeGen/AArch64/arm64-neon-2velem.ll

[AArch64] Avoid generating indexed vector instructions for Exynos
ClosedPublic