This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/ARM/
-
Target/
-
ARM/
3/3
ARMISelLowering.h
7/7
ARMISelLowering.cpp
2/2
ARMInstrMVE.td
9/10
ARMSelectionDAGInfo.cpp
-
ARMSubtarget.h
-
ARMTargetTransformInfo.h
-
test/CodeGen/Thumb2/
-
CodeGen/
-
Thumb2/
-
LowOverheadLoops/
1/2
memcall.ll
1/1
mve-tp-loop.ll
-
mve-tp-loop.mir

Differential D99723

[ARM] Transforming memcpy to Tail predicated Loop
ClosedPublic

Authored by malharJ on Apr 1 2021, 6:02 AM.

Download Raw Diff

Details

Reviewers

dmgreen
SjoerdMeijer

Commits

rG9ff38e2d9dd7: [ARM] Transforming memcpy to Tail predicated Loop
rGb856f4a232cb: [ARM] Transforming memcpy to Tail predicated Loop

Summary

This patch converts llvm.memcpy intrinsic into Tail Predicated
Hardware loops for a target that supports the Arm M-profile
Vector Extension (MVE).

From an implementation point of view, the patch

adds an ARM specific SDAG Node (to which the llvm.memcpy intrinsic is lowered to, during first phase of ISel)
adds a corresponding TableGen entry to generate a pseudo instruction, with a custom inserter, on matching the above node.
Adds a custom inserter function that expands the pseudo instruction into MIR suitable to be (by later passes) into a WLSTP loop.

Diff Detail

Repository: rG LLVM Github Monorepo

Event Timeline

malharJ created this revision.Apr 1 2021, 6:02 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald TranscriptApr 1 2021, 6:02 AM

malharJ requested review of this revision.Apr 1 2021, 6:02 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 1 2021, 6:02 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

lebedev.ri retitled this revision from Transforming memcpy to Tail predicated Loop to [ARM] Transforming memcpy to Tail predicated Loop.Apr 1 2021, 6:07 AM

Herald added a subscriber: danielkiss. · View Herald TranscriptApr 1 2021, 6:07 AM

I know you've worked on this for a while and investigated different strategies, but I think we also need to argue here why we would like to emit a memcpy loop instead of e.g. having optimised versions in the clib. In other words, is this the best we can do for all different alignments, sizes, etc.?

Harbormaster completed remote builds in B96699: Diff 334666.Apr 1 2021, 7:14 AM

Added some comments to better illustrate transform.
Also renamed some variables to maintain consistency.

dmgreen added inline comments.Apr 1 2021, 8:18 AM

llvm/lib/Target/ARM/ARMISelLowering.cpp
11103	Can you look into all these clang-tidy errors. LLVM usually uses CamelCase for variable names, and the MBB_ looks a little odd. They should start with a capital and I would drop the "t2", that's not adding much.
llvm/lib/Target/ARM/ARMInstrMVE.td
6874	Can you improve this formatting. If you clang-formatted it, it doesn't do well with .td files and manually making it look more like the others will do better.
llvm/test/CodeGen/Thumb2/LowOverheadLoops/memcall.ll
21	Why does this not use r3 directly?
llvm/test/CodeGen/Thumb2/mve_tp_loop.ll
35 ↗	(On Diff #334684)	What is this generating r3 for? I thought those should be removed.
55 ↗	(On Diff #334684)	Why is this using printf? It looks like an execution test, not a unit test. Is it testing anything specifically? If so it can probably use any call, not a variadic version of printf.
63 ↗	(On Diff #334684)	Remove hidden and local_unnamed_addr #0

Harbormaster completed remote builds in B96713: Diff 334684.Apr 1 2021, 8:24 AM

Addressed comments (review comments + clang-tidy and clang-format fixes)

Updated transform to not generate preHeader block due to issues with phi-node-elimination pass placing copy/movs in the generated preHeader.

Details provided in comment below.

malharJ added inline comments.Apr 5 2021, 1:40 AM

llvm/lib/Target/ARM/ARMInstrMVE.td
6874	Had clang formatted it but yeah doesnt look good. Updated now.
llvm/test/CodeGen/Thumb2/LowOverheadLoops/memcall.ll
21	This seems to be an issue with generating a preHeader during the transform .. The phi-node-elimination pass is lowering the phi instructions (in the TP loopBody) as COPY operations (into the PreHeader). In this instance, the copy/mov can be seen below on line 32: mov r7, r3 I've fixed this issue as of now by not generating an extra preHeader during the transform .. so the mov ends up above the t2WhileLoopStartLR and overall it seems to work. Please see my comment about the latest changes for more details on this.
llvm/test/CodeGen/Thumb2/mve_tp_loop.ll
35 ↗	(On Diff #334684)	Good spot, I think this is because DCE was not happening for the instructions calculating iterationCount. I had a quick look at ARMLowOverheadLoops::IterationCountDCE( ), and it seems that the expectation from the generated MIR is: // $lr = big-itercount-expression // .. // $lr = t2DoLoopStart/t2WhileLoopStartLR renamable $lr // vector.body: I've updated the final expression (in the generated MIR) calculating iteration count to now return the result in LR (earlier it was returning in one of the rGPRs) and these are now getting removed.
55 ↗	(On Diff #334684)	So the intent of the test was just to check whether code surrounding memcpy call site is properly transformed. I simply used printf to prevent the code from getting optimized away (seems like a poor way now that I think about it). I've removed this test now since the transformation involving nested loops (in memcall.ll) is already testing the mentioned intent.

So I've updated transform to not generate a preHeader block as there seems to be an issue
when generating a preHeader during the transform:

The issue:

The phi-node-elimination pass introduces COPY operations (for each PHI instruction in the TP loopBody) into the preHeader.

While most of them get removed by simple-register-coalescing pass, one copy in particular is not
getting removed. This is the one involving memcpy transfer size/vector element count. Regarding
why the register coalescing is unable to get rid of this particular copy/mov, I had a look at the
llc --debug output and it seems that it cant remove the mov/copy because the liveness range of
element count register intersects with liveness range of the target of the copy/mov.

An example of the generated (incorrect) assembly is shown below:

Relevant MIR:

TP Entry
         ...
	lr = t2WhileLoopStartLR r4 (r4 may be holding something other than element count)

TP preHeader
	...
	mov r4, r2 (assume r2 holds element count)
	...
TP body
	...
	VCTP r4
	...

Existing logic:

So this value (r4 above) feeds into the loopBody PHI nodes and then the VCTP receives it (which is fine).
But when the ARMLowOverHeadsLoop pass tries to use element count operand of VCTP to feed back to t2WhileLoopStartLR,
it is providing r4 (which is incorrect because the mov is happening after the t2WhileLoopStartLR).

So I tried to see if I could fix this by looking into LowOverheadLoop::ValidateTailPredicate(),
as it defines the "TPNumElements" variable. There is some logic there that handles the case for
local redefinitions of the elementCount physical register, by moving it forward/backward using ReachingDefAnalysis.
But in this instance, we have a redefinition (the mov) in a different BasicBlock so that code doesn't seem to fix this.

I'm not entirely certain if it's acceptable to not generate the preHeader, but unless there is a reasonably
simple fix for the above issue, I can't see another way.

Harbormaster completed remote builds in B97102: Diff 335218.Apr 5 2021, 2:09 AM

Fixed some more clang-format errors.

Harbormaster completed remote builds in B97106: Diff 335222.Apr 5 2021, 3:28 AM

malharJ edited the summary of this revision. (Show Details)Apr 7 2021, 6:40 PM

I'm a little worried that WLSTP is going to cause problems, with it not used anywhere else. Lets at least add an option for disabling it needed.

llvm/lib/Target/ARM/ARMBaseInstrInfo.h
370 ↗	(On Diff #335222)	Please split this out into a separate review. I think it makes sense (I'm pretty sure I remember writing it, it must have been lost in a refactoring).
llvm/lib/Target/ARM/ARMISelLowering.cpp
11307	When will this happen?
11326	-> "for a more natural layout"? I think there may be benefits from getting the order roughly correct at this stage, if we are relying on WLS branches. They can be fixed up later, but if we get them more correct at this point, that can only help.
llvm/lib/Target/ARM/ARMISelLowering.h
54–343	Don't format any of this - it's unrelated.
llvm/lib/Target/ARM/ARMSelectionDAGInfo.cpp
151	Can we add an option that turns this inline memcpy on/off. If the option is true, we always use the MEMCPYLOOP, if it's false we never do, and if it's unset we use this default logic. Also consider pulling the if logic into a lambda for readability.
llvm/test/CodeGen/Thumb2/mve_tp_loop.ll
9 ↗	(On Diff #335222)	Can you make sure there are tests where the i32 is a different type, one that is not legal like an i64.
llvm/test/CodeGen/Thumb2/mve_tp_loop.mir
136 ↗	(On Diff #335222)	Some of this can be removed, to help keep the test smaller.

addressed some of the review comments:

added a cli option for generation of TP loop for memcpy
simplified the mir test

I'm a little worried that WLSTP is going to cause problems ...

Would it better to use DLSTP in that case ? or perhaps a command line option
for choosing between DLSTP/WLSTP implementations ?

llvm/lib/Target/ARM/ARMBaseInstrInfo.h
370 ↗	(On Diff #335222)	Do you mean just this single line update as a separate review ?
llvm/lib/Target/ARM/ARMISelLowering.cpp
11307	This happens if MVE_MEMCPYLOOPINST is the only instruction in the block. splitAt() returns the same block if there is nothing after the instruction at which the split is done.. This happens when for loops are implicitly converted to memcpys, the memcpy call ends up being the only instruction in the preheader. There is already a test case for this as test2 in llvm/test/CodeGen/Thumb2/mve_tp_loop.mir
llvm/lib/Target/ARM/ARMISelLowering.h
54–343	I had to fix it because patch was failing on clang-format error.
llvm/test/CodeGen/Thumb2/mve_tp_loop.ll
9 ↗	(On Diff #335222)	Do you mean something like: define void @test(i8* noalias %X, i8* noalias %Y, i64 %n){ call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 4 %X, i8* align 4 %Y, i64 %n, i1 false) ret void } I get an error when I try to generate the assembly. Since i64 is illegal, what is the expectation here ? As a side note, if I generate the IR from C code, the IR always truncates the memcpy size variable to a i32 before calling llvm.memcpy( )

Harbormaster completed remote builds in B98018: Diff 336495.Apr 9 2021, 10:28 AM

dmgreen added inline comments.Apr 12 2021, 6:28 AM

llvm/lib/Target/ARM/ARMBaseInstrInfo.h
370 ↗	(On Diff #335222)	Yeah... it should preferably be a separate patch. Do you have a test case? Some reason why you changed it?
llvm/lib/Target/ARM/ARMISelLowering.cpp
11307	OK. I thought it was more eager about putting branches on the end of blocks, even if they are fallthroughs.
llvm/lib/Target/ARM/ARMISelLowering.h
54–343	That's fine. We can ignore the precommit bot where it's more noisy than helpful.
llvm/test/CodeGen/Thumb2/mve_tp_loop.ll
9 ↗	(On Diff #335222)	Yeah sure. I think they can still come up, from other places creating memcpy calls. You can probably use DAG.getZExtOrTrunc(Size, MVT::i32), instead of using the size directly.

Addressed remaining review comments:

Separated a change into it's own patch and added as dependency
minor formatting updates
added a testcase with size of type other than i32

malharJ marked an inline comment as done.Apr 13 2021, 3:57 AM

malharJ added inline comments.

llvm/lib/Target/ARM/ARMBaseInstrInfo.h
370 ↗	(On Diff #335222)	Ok, I've now created a separate patch for this: https://reviews.llvm.org/D100376

malharJ added a parent revision: D100376: [ARM] Prevent phi-node-elimination from generating copy above t2WhileLoopStartLR.Apr 13 2021, 4:00 AM

Harbormaster completed remote builds in B98454: Diff 337096.Apr 13 2021, 4:43 AM

dmgreen mentioned this in D100435: [ARM] Transforming memset to Tail predicated Loop.Apr 14 2021, 1:30 AM

dmgreen added inline comments.Apr 14 2021, 1:22 PM

llvm/lib/Target/ARM/ARMISelLowering.cpp
11282	Great comment by the way. Is it possible to make the loop MBB look like a bit like a loop? To show there is a backedge too.
llvm/lib/Target/ARM/ARMSelectionDAGInfo.cpp
150	`[=]` ->`[&]` is more standard, even if it doesn't make a lot of difference here.
152	Probably better as: if (DAG.getMachineFunction().getFunction().hasOptNone()) return false; if (!ConstantSize && (Alignment >= Align(4)) return true; if (...) ... The EnableMemcpyTPLoop logic could be in here too, as it's just returning true/false at the right time. What do we do for -Oz and -Os?
llvm/test/CodeGen/Thumb2/mve_tp_loop.ll
2 ↗	(On Diff #337096)	Shouldn't have -O1 or cpu, use the -mtriple from other similar tests. The test can be called llvm/test/CodeGen/Thumb2/mve-tp-loop.ll.

Addressed review comments:

renamed test files
disabled inline memcpy for optimize size cases (-Os, -Oz) and added tests for the same
also added tests for constant size inputs to ensure the threshold values are tested as well.
minor formatting changes

llvm/lib/Target/ARM/ARMSelectionDAGInfo.cpp
152	Ok. And I guess it would be best to disable in the case of -Os/-Oz in case there are multiple memcpys in the source. I've made the update and added tests as well.

dmgreen added inline comments.Apr 15 2021, 2:44 AM

llvm/lib/Target/ARM/ARMSelectionDAGInfo.cpp
152	func -> Func. Or maybe just F, which is quite common in LLVM.
159	if (EnableMemcpyTPLoop == cl::BOU_FALSE) return false; Is probably better, if it works as I expect. That keeps the indenting down, and the last if currently isn't in the block it looks like it should be. Oh, and move EnableMemcpyTPLoop above the OptSIze/OptNone, in case we want to try and force it. (Even if OptNone doesn't work, using that combo is unlikely to be useful at any rate.)
164	Add a return false at the end?

Addressed review comments:

moved cli option (when set) to be of higher priority than optNone/optSize
minor formatting updates.

llvm/lib/Target/ARM/ARMSelectionDAGInfo.cpp
159	Ok, my bad there with the braces. I've moved the cases when the cli option is set to be of higher priority than the optNone/optSize cases ... but the unset case is of Lower priority than (the optNone/optSize) since the user is no longer passing the cli option. Hopefully that sounds sensible.
164	yep, had missed that out.

Harbormaster completed remote builds in B98843: Diff 337669.Apr 15 2021, 3:29 AM

Harbormaster completed remote builds in B98854: Diff 337684.Apr 15 2021, 4:55 AM

Rebased patch and removed the dependency as it has been closed.

malharJ removed parent revisions: D100376: [ARM] Prevent phi-node-elimination from generating copy above t2WhileLoopStartLR, D99649: [ARM] Updates to arm-block-placement pass.Apr 17 2021, 3:50 PM

malharJ edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B99351: Diff 338345.Apr 17 2021, 4:27 PM

For comparison:
https://github.com/llvm/llvm-project/blob/main/libc/src/string/memory_utils/memcpy_utils.h

malharJ added a child revision: D100435: [ARM] Transforming memset to Tail predicated Loop.Apr 18 2021, 9:27 AM

In D99723#2697318, @tschuett wrote:

For comparison:
https://github.com/llvm/llvm-project/blob/main/libc/src/string/memory_utils/memcpy_utils.h

Yeah thanks, but this is for a different architecture. On M class we have access to MVE tail predicated loops that can be much more efficient for emitting inline memcpys. A-Class Arm with Neon will be very different.

This looks good to me now, with a couple of extra nits.

llvm/lib/Target/ARM/ARMISelLowering.cpp
1618	Remove newline.
llvm/lib/Target/ARM/ARMSelectionDAGInfo.h
19 ↗	(On Diff #338345)	Is this needed here? Can it be in the cpp file?

This revision is now accepted and ready to land.Apr 19 2021, 1:23 AM

Minor formatting updates.

Herald added a subscriber: tmatheson. · View Herald TranscriptApr 25 2021, 10:34 AM

Harbormaster completed remote builds in B100820: Diff 340370.Apr 25 2021, 11:23 AM

Thanks. Can you rebase and make sure the patch is clang-formatted?

Rebased patch + minor formatting updates.

Harbormaster completed remote builds in B101154: Diff 340828.Apr 27 2021, 7:42 AM

Fix for bug spotted by dmgreen (thank you):

Added an update to ensure that the block containing memcpy pseudo is always
split using splitAt().

An example case where this is important is when updating
phi instructions in successive blocks, which is taken care of by splitAt()
which calls transferSuccessorsAndUpdatePHIs() internally.
A test has been added for the same.

malharJ added inline comments.May 4 2021, 10:11 PM

llvm/test/CodeGen/Thumb2/mve-tp-loop.ll
242–251	Not entirely sure why this isn't a TP loop, might need to check ArmLOL pass as to why it's being reverted..

Harbormaster completed remote builds in B102670: Diff 342948.May 4 2021, 10:52 PM

tmatheson removed a subscriber: tmatheson.May 5 2021, 2:06 AM

Thanks. It looks like the arm low overhead loop pass doesn't like that two loops have the same preheader. Which makes sense, I don't like that either.

What do you think about committing this with the flag off for the time being and flipping the switch when we have sorted out some of the problems this is running into? memset especially seems to come up in a lot of cases, and can run into problem with so many low overhead loops together.

Changed cli option for conversion of memcpy to TP loop to be disabled by default.
The disable may be temporary, and will be removed after some more testing.

A custom enum replaces the cl::boolOrDefault to implement the required functionality.

Harbormaster completed remote builds in B102761: Diff 343066.May 5 2021, 9:50 AM

Thanks. LGTM

llvm/lib/Target/ARM/ARMSelectionDAGInfo.cpp
153	Perhaps use == TPLoop::ForceDisable to make it clear.

This revision was landed with ongoing or failed builds.May 6 2021, 1:39 AM

Closed by commit rGb856f4a232cb: [ARM] Transforming memcpy to Tail predicated Loop (authored by malharJ). · Explain Why

This revision was automatically updated to reflect the committed changes.

malharJ added a commit: rGb856f4a232cb: [ARM] Transforming memcpy to Tail predicated Loop.

Thanks a lot for the review !

malharJ added a reverting change: rGfc690777fce0: Revert "[ARM] Transforming memcpy to Tail predicated Loop".May 6 2021, 4:42 AM

malharJ reopened this revision.May 6 2021, 4:58 AM

This revision is now accepted and ready to land.May 6 2021, 4:58 AM

Fix for MachineVerifier error during Buildbot failure
https://lab.llvm.org/buildbot/#/builders/16/builds/10462

The failure is happening during testing because the NoPHIs property is being
set to true by MIRParserImpl::computeFunctionProperties( ) as there are No phis (prior to transformation),
but during the transform phis are generated.
This results in an error during MachineVerifier, since the function is labelled
with NoPHIs=true while there are phi insructions in the code.

This fix resets the property to false during the transform.

Harbormaster completed remote builds in B102960: Diff 343364.May 6 2021, 6:02 AM

Ah yes. Sorry I didn't suggest adding that to the tests - it can be useful.

Setting NoPHIs seems a bit odd. It's a side effect of the mir test having no PHI's as it's loaded but them being added here. I don't have a better suggestion for fixing it though, other than adding existing PHI's to the mir test which needlessly complicates it.

This sounds like a good fix to me.

This revision was landed with ongoing or failed builds.May 6 2021, 3:26 PM

Closed by commit rG9ff38e2d9dd7: [ARM] Transforming memcpy to Tail predicated Loop (authored by malharJ). · Explain Why

This revision was automatically updated to reflect the committed changes.

malharJ added a commit: rG9ff38e2d9dd7: [ARM] Transforming memcpy to Tail predicated Loop.

Revision Contents

Path

Size

llvm/

lib/

Target/

ARM/

ARMISelLowering.h

4 lines

ARMISelLowering.cpp

225 lines

ARMInstrMVE.td

12 lines

ARMSelectionDAGInfo.cpp

44 lines

ARMSubtarget.h

5 lines

ARMTargetTransformInfo.h

5 lines

test/

CodeGen/

Thumb2/

LowOverheadLoops/

memcall.ll

51 lines

mve-tp-loop.ll

285 lines

mve-tp-loop.mir

127 lines

Diff 343513

llvm/lib/Target/ARM/ARMISelLowering.h

	Show First 20 Lines • Show All 45 Lines • ▼ Show 20 Lines
	class SelectionDAG;			class SelectionDAG;
	class TargetLibraryInfo;			class TargetLibraryInfo;
	class TargetMachine;			class TargetMachine;
	class TargetRegisterInfo;			class TargetRegisterInfo;
	class VectorType;			class VectorType;

	namespace ARMISD {			namespace ARMISD {

	// ARM Specific DAG Nodes			// ARM Specific DAG Nodes
	enum NodeType : unsigned {			enum NodeType : unsigned {
	// Start the numbering where the builtin ops and target ops leave off.			// Start the numbering where the builtin ops and target ops leave off.
	FIRST_NUMBER = ISD::BUILTIN_OP_END,			FIRST_NUMBER = ISD::BUILTIN_OP_END,

	Wrapper, // Wrapper - A wrapper node for TargetConstantPool,			Wrapper, // Wrapper - A wrapper node for TargetConstantPool,
	// TargetExternalSymbol, and TargetGlobalAddress.			// TargetExternalSymbol, and TargetGlobalAddress.
	WrapperPIC, // WrapperPIC - A wrapper node for TargetGlobalAddress in			WrapperPIC, // WrapperPIC - A wrapper node for TargetGlobalAddress in
	// PIC mode.			// PIC mode.
	WrapperJT, // WrapperJT - A wrapper node for TargetJumpTable			WrapperJT, // WrapperJT - A wrapper node for TargetJumpTable

	// Add pseudo op to model memcpy for struct byval.			// Add pseudo op to model memcpy for struct byval.
	COPY_STRUCT_BYVAL,			COPY_STRUCT_BYVAL,

	CALL, // Function call.			CALL, // Function call.
	CALL_PRED, // Function call that's predicable.			CALL_PRED, // Function call that's predicable.
	CALL_NOLINK, // Function call with branch not branch-and-link.			CALL_NOLINK, // Function call with branch not branch-and-link.
	tSECALL, // CMSE non-secure function call.			tSECALL, // CMSE non-secure function call.
	BRCOND, // Conditional branch.			BRCOND, // Conditional branch.
	BR_JT, // Jumptable branch.			BR_JT, // Jumptable branch.
	BR2_JT, // Jumptable branch (2 level - jumptable entry is a jump).			BR2_JT, // Jumptable branch (2 level - jumptable entry is a jump).
	RET_FLAG, // Return with a flag operand.			RET_FLAG, // Return with a flag operand.
	SERET_FLAG, // CMSE Entry function return with a flag operand.			SERET_FLAG, // CMSE Entry function return with a flag operand.
	INTRET_FLAG, // Interrupt return with an LR-offset and a flag operand.			INTRET_FLAG, // Interrupt return with an LR-offset and a flag operand.

	PIC_ADD, // Add with a PC operand and a PIC label.			PIC_ADD, // Add with a PC operand and a PIC label.

	ASRL, // MVE long arithmetic shift right.			ASRL, // MVE long arithmetic shift right.
	LSRL, // MVE long shift right.			LSRL, // MVE long shift right.
	LSLL, // MVE long shift left.			LSLL, // MVE long shift left.

	CMP, // ARM compare instructions.			CMP, // ARM compare instructions.
	CMN, // ARM CMN instructions.			CMN, // ARM CMN instructions.
	CMPZ, // ARM compare that sets only Z flag.			CMPZ, // ARM compare that sets only Z flag.
	CMPFP, // ARM VFP compare instruction, sets FPSCR.			CMPFP, // ARM VFP compare instruction, sets FPSCR.
	CMPFPE, // ARM VFP signalling compare instruction, sets FPSCR.			CMPFPE, // ARM VFP signalling compare instruction, sets FPSCR.
	CMPFPw0, // ARM VFP compare against zero instruction, sets FPSCR.			CMPFPw0, // ARM VFP compare against zero instruction, sets FPSCR.
	CMPFPEw0, // ARM VFP signalling compare against zero instruction, sets			CMPFPEw0, // ARM VFP signalling compare against zero instruction, sets
	// FPSCR.			// FPSCR.
	FMSTAT, // ARM fmstat instruction.			FMSTAT, // ARM fmstat instruction.

	CMOV, // ARM conditional move instructions.			CMOV, // ARM conditional move instructions.
	SUBS, // Flag-setting subtraction.			SUBS, // Flag-setting subtraction.

	SSAT, // Signed saturation			SSAT, // Signed saturation
	USAT, // Unsigned saturation			USAT, // Unsigned saturation

	BCC_i64,			BCC_i64,

	SRL_FLAG, // V,Flag = srl_flag X -> srl X, 1 + save carry out.			SRL_FLAG, // V,Flag = srl_flag X -> srl X, 1 + save carry out.
	SRA_FLAG, // V,Flag = sra_flag X -> sra X, 1 + save carry out.			SRA_FLAG, // V,Flag = sra_flag X -> sra X, 1 + save carry out.
	RRX, // V = RRX X, Flag -> srl X, 1 + shift in carry flag.			RRX, // V = RRX X, Flag -> srl X, 1 + shift in carry flag.

	ADDC, // Add with carry			ADDC, // Add with carry
	ADDE, // Add using carry			ADDE, // Add using carry
	SUBC, // Sub with carry			SUBC, // Sub with carry
	SUBE, // Sub using carry			SUBE, // Sub using carry
	LSLS, // Shift left producing carry			LSLS, // Shift left producing carry

	VMOVRRD, // double to two gprs.			VMOVRRD, // double to two gprs.
	VMOVDRR, // Two gprs to double.			VMOVDRR, // Two gprs to double.
	VMOVSR, // move gpr to single, used for f32 literal constructed in a gpr			VMOVSR, // move gpr to single, used for f32 literal constructed in a gpr

	EH_SJLJ_SETJMP, // SjLj exception handling setjmp.			EH_SJLJ_SETJMP, // SjLj exception handling setjmp.
	EH_SJLJ_LONGJMP, // SjLj exception handling longjmp.			EH_SJLJ_LONGJMP, // SjLj exception handling longjmp.
	EH_SJLJ_SETUP_DISPATCH, // SjLj exception handling setup_dispatch.			EH_SJLJ_SETUP_DISPATCH, // SjLj exception handling setup_dispatch.

	TC_RETURN, // Tail call return pseudo.			TC_RETURN, // Tail call return pseudo.

	THREAD_POINTER,			THREAD_POINTER,

	DYN_ALLOC, // Dynamic allocation on the stack.			DYN_ALLOC, // Dynamic allocation on the stack.

	MEMBARRIER_MCR, // Memory barrier (MCR)			MEMBARRIER_MCR, // Memory barrier (MCR)

	PRELOAD, // Preload			PRELOAD, // Preload

	WIN__CHKSTK, // Windows' __chkstk call to do stack probing.			WIN__CHKSTK, // Windows' __chkstk call to do stack probing.
	WIN__DBZCHK, // Windows' divide by zero check			WIN__DBZCHK, // Windows' divide by zero check

	WLS, // Low-overhead loops, While Loop Start branch. See t2WhileLoopStart			WLS, // Low-overhead loops, While Loop Start branch. See t2WhileLoopStart
	WLSSETUP, // Setup for the iteration count of a WLS. See t2WhileLoopSetup.			WLSSETUP, // Setup for the iteration count of a WLS. See t2WhileLoopSetup.
	LOOP_DEC, // Really a part of LE, performs the sub			LOOP_DEC, // Really a part of LE, performs the sub
	LE, // Low-overhead loops, Loop End			LE, // Low-overhead loops, Loop End

	PREDICATE_CAST, // Predicate cast for MVE i1 types			PREDICATE_CAST, // Predicate cast for MVE i1 types
	VECTOR_REG_CAST, // Reinterpret the current contents of a vector register			VECTOR_REG_CAST, // Reinterpret the current contents of a vector register

	VCMP, // Vector compare.			VCMP, // Vector compare.
	VCMPZ, // Vector compare to zero.			VCMPZ, // Vector compare to zero.
	VTST, // Vector test bits.			VTST, // Vector test bits.

	// Vector shift by vector			// Vector shift by vector
	VSHLs, // ...left/right by signed			VSHLs, // ...left/right by signed
	VSHLu, // ...left/right by unsigned			VSHLu, // ...left/right by unsigned

	// Vector shift by immediate:			// Vector shift by immediate:
	VSHLIMM, // ...left			VSHLIMM, // ...left
	VSHRsIMM, // ...right (signed)			VSHRsIMM, // ...right (signed)
	VSHRuIMM, // ...right (unsigned)			VSHRuIMM, // ...right (unsigned)

	// Vector rounding shift by immediate:			// Vector rounding shift by immediate:
	VRSHRsIMM, // ...right (signed)			VRSHRsIMM, // ...right (signed)
	VRSHRuIMM, // ...right (unsigned)			VRSHRuIMM, // ...right (unsigned)
	VRSHRNIMM, // ...right narrow			VRSHRNIMM, // ...right narrow

	// Vector saturating shift by immediate:			// Vector saturating shift by immediate:
	VQSHLsIMM, // ...left (signed)			VQSHLsIMM, // ...left (signed)
	VQSHLuIMM, // ...left (unsigned)			VQSHLuIMM, // ...left (unsigned)
	VQSHLsuIMM, // ...left (signed to unsigned)			VQSHLsuIMM, // ...left (signed to unsigned)
	VQSHRNsIMM, // ...right narrow (signed)			VQSHRNsIMM, // ...right narrow (signed)
	VQSHRNuIMM, // ...right narrow (unsigned)			VQSHRNuIMM, // ...right narrow (unsigned)
	VQSHRNsuIMM, // ...right narrow (signed to unsigned)			VQSHRNsuIMM, // ...right narrow (signed to unsigned)

	// Vector saturating rounding shift by immediate:			// Vector saturating rounding shift by immediate:
	VQRSHRNsIMM, // ...right narrow (signed)			VQRSHRNsIMM, // ...right narrow (signed)
	VQRSHRNuIMM, // ...right narrow (unsigned)			VQRSHRNuIMM, // ...right narrow (unsigned)
	VQRSHRNsuIMM, // ...right narrow (signed to unsigned)			VQRSHRNsuIMM, // ...right narrow (signed to unsigned)

	// Vector shift and insert:			// Vector shift and insert:
	VSLIIMM, // ...left			VSLIIMM, // ...left
	VSRIIMM, // ...right			VSRIIMM, // ...right

	// Vector get lane (VMOV scalar to ARM core register)			// Vector get lane (VMOV scalar to ARM core register)
	// (These are used for 8- and 16-bit element types only.)			// (These are used for 8- and 16-bit element types only.)
	VGETLANEu, // zero-extend vector extract element			VGETLANEu, // zero-extend vector extract element
	VGETLANEs, // sign-extend vector extract element			VGETLANEs, // sign-extend vector extract element

	// Vector move immediate and move negated immediate:			// Vector move immediate and move negated immediate:
	VMOVIMM,			VMOVIMM,
	VMVNIMM,			VMVNIMM,

	// Vector move f32 immediate:			// Vector move f32 immediate:
	VMOVFPIMM,			VMOVFPIMM,

	// Move H <-> R, clearing top 16 bits			// Move H <-> R, clearing top 16 bits
	VMOVrh,			VMOVrh,
	VMOVhr,			VMOVhr,

	// Vector duplicate:			// Vector duplicate:
	VDUP,			VDUP,
	VDUPLANE,			VDUPLANE,

	// Vector shuffles:			// Vector shuffles:
	VEXT, // extract			VEXT, // extract
	VREV64, // reverse elements within 64-bit doublewords			VREV64, // reverse elements within 64-bit doublewords
	VREV32, // reverse elements within 32-bit words			VREV32, // reverse elements within 32-bit words
	VREV16, // reverse elements within 16-bit halfwords			VREV16, // reverse elements within 16-bit halfwords
	VZIP, // zip (interleave)			VZIP, // zip (interleave)
	VUZP, // unzip (deinterleave)			VUZP, // unzip (deinterleave)
	VTRN, // transpose			VTRN, // transpose
	VTBL1, // 1-register shuffle with mask			VTBL1, // 1-register shuffle with mask
	VTBL2, // 2-register shuffle with mask			VTBL2, // 2-register shuffle with mask
	VMOVN, // MVE vmovn			VMOVN, // MVE vmovn

	// MVE Saturating truncates			// MVE Saturating truncates
	VQMOVNs, // Vector (V) Saturating (Q) Move and Narrow (N), signed (s)			VQMOVNs, // Vector (V) Saturating (Q) Move and Narrow (N), signed (s)
	VQMOVNu, // Vector (V) Saturating (Q) Move and Narrow (N), unsigned (u)			VQMOVNu, // Vector (V) Saturating (Q) Move and Narrow (N), unsigned (u)

	// MVE float <> half converts			// MVE float <> half converts
	VCVTN, // MVE vcvt f32 -> f16, truncating into either the bottom or top			VCVTN, // MVE vcvt f32 -> f16, truncating into either the bottom or top
	// lanes			// lanes
	VCVTL, // MVE vcvt f16 -> f32, extending from either the bottom or top lanes			VCVTL, // MVE vcvt f16 -> f32, extending from either the bottom or top lanes

	// MVE VIDUP instruction, taking a start value and increment.			// MVE VIDUP instruction, taking a start value and increment.
	VIDUP,			VIDUP,

	// Vector multiply long:			// Vector multiply long:
	VMULLs, // ...signed			VMULLs, // ...signed
	VMULLu, // ...unsigned			VMULLu, // ...unsigned

	VQDMULH, // MVE vqdmulh instruction			VQDMULH, // MVE vqdmulh instruction

	// MVE reductions			// MVE reductions
	VADDVs, // sign- or zero-extend the elements of a vector to i32,			VADDVs, // sign- or zero-extend the elements of a vector to i32,
	VADDVu, // add them all together, and return an i32 of their sum			VADDVu, // add them all together, and return an i32 of their sum
	VADDVps, // Same as VADDV[su] but with a v4i1 predicate mask			VADDVps, // Same as VADDV[su] but with a v4i1 predicate mask
	VADDVpu,			VADDVpu,
	VADDLVs, // sign- or zero-extend elements to i64 and sum, returning			VADDLVs, // sign- or zero-extend elements to i64 and sum, returning
	VADDLVu, // the low and high 32-bit halves of the sum			VADDLVu, // the low and high 32-bit halves of the sum
	VADDLVAs, // Same as VADDLV[su] but also add an input accumulator			VADDLVAs, // Same as VADDLV[su] but also add an input accumulator
	VADDLVAu, // provided as low and high halves			VADDLVAu, // provided as low and high halves
	VADDLVps, // Same as VADDLV[su] but with a v4i1 predicate mask			VADDLVps, // Same as VADDLV[su] but with a v4i1 predicate mask
	VADDLVpu,			VADDLVpu,
	VADDLVAps, // Same as VADDLVp[su] but with a v4i1 predicate mask			VADDLVAps, // Same as VADDLVp[su] but with a v4i1 predicate mask
	VADDLVApu,			VADDLVApu,
	VMLAVs, // sign- or zero-extend the elements of two vectors to i32, multiply			VMLAVs, // sign- or zero-extend the elements of two vectors to i32, multiply
	// them			// them
	VMLAVu, // and add the results together, returning an i32 of their sum			VMLAVu, // and add the results together, returning an i32 of their sum
	VMLAVps, // Same as VMLAV[su] with a v4i1 predicate mask			VMLAVps, // Same as VMLAV[su] with a v4i1 predicate mask
	VMLAVpu,			VMLAVpu,
	VMLALVs, // Same as VMLAV but with i64, returning the low and			VMLALVs, // Same as VMLAV but with i64, returning the low and
	VMLALVu, // high 32-bit halves of the sum			VMLALVu, // high 32-bit halves of the sum
	VMLALVps, // Same as VMLALV[su] with a v4i1 predicate mask			VMLALVps, // Same as VMLALV[su] with a v4i1 predicate mask
	VMLALVpu,			VMLALVpu,
	VMLALVAs, // Same as VMLALV but also add an input accumulator			VMLALVAs, // Same as VMLALV but also add an input accumulator
	VMLALVAu, // provided as low and high halves			VMLALVAu, // provided as low and high halves
	VMLALVAps, // Same as VMLALVA[su] with a v4i1 predicate mask			VMLALVAps, // Same as VMLALVA[su] with a v4i1 predicate mask
	VMLALVApu,			VMLALVApu,
	VMINVu, // Find minimum unsigned value of a vector and register			VMINVu, // Find minimum unsigned value of a vector and register
	VMINVs, // Find minimum signed value of a vector and register			VMINVs, // Find minimum signed value of a vector and register
	VMAXVu, // Find maximum unsigned value of a vector and register			VMAXVu, // Find maximum unsigned value of a vector and register
	VMAXVs, // Find maximum signed value of a vector and register			VMAXVs, // Find maximum signed value of a vector and register

	SMULWB, // Signed multiply word by half word, bottom			SMULWB, // Signed multiply word by half word, bottom
	SMULWT, // Signed multiply word by half word, top			SMULWT, // Signed multiply word by half word, top
	UMLAL, // 64bit Unsigned Accumulate Multiply			UMLAL, // 64bit Unsigned Accumulate Multiply
	SMLAL, // 64bit Signed Accumulate Multiply			SMLAL, // 64bit Signed Accumulate Multiply
	UMAAL, // 64-bit Unsigned Accumulate Accumulate Multiply			UMAAL, // 64-bit Unsigned Accumulate Accumulate Multiply
	SMLALBB, // 64-bit signed accumulate multiply bottom, bottom 16			SMLALBB, // 64-bit signed accumulate multiply bottom, bottom 16
	SMLALBT, // 64-bit signed accumulate multiply bottom, top 16			SMLALBT, // 64-bit signed accumulate multiply bottom, top 16
	SMLALTB, // 64-bit signed accumulate multiply top, bottom 16			SMLALTB, // 64-bit signed accumulate multiply top, bottom 16
	SMLALTT, // 64-bit signed accumulate multiply top, top 16			SMLALTT, // 64-bit signed accumulate multiply top, top 16
	SMLALD, // Signed multiply accumulate long dual			SMLALD, // Signed multiply accumulate long dual
	SMLALDX, // Signed multiply accumulate long dual exchange			SMLALDX, // Signed multiply accumulate long dual exchange
	SMLSLD, // Signed multiply subtract long dual			SMLSLD, // Signed multiply subtract long dual
	SMLSLDX, // Signed multiply subtract long dual exchange			SMLSLDX, // Signed multiply subtract long dual exchange
	SMMLAR, // Signed multiply long, round and add			SMMLAR, // Signed multiply long, round and add
	SMMLSR, // Signed multiply long, subtract and round			SMMLSR, // Signed multiply long, subtract and round

	// Single Lane QADD8 and QADD16. Only the bottom lane. That's what the b			// Single Lane QADD8 and QADD16. Only the bottom lane. That's what the b
	// stands for.			// stands for.
	QADD8b,			QADD8b,
	QSUB8b,			QSUB8b,
	QADD16b,			QADD16b,
	QSUB16b,			QSUB16b,

	// Operands of the standard BUILD_VECTOR node are not legalized, which			// Operands of the standard BUILD_VECTOR node are not legalized, which
	// is fine if BUILD_VECTORs are always lowered to shuffles or other			// is fine if BUILD_VECTORs are always lowered to shuffles or other
	// operations, but for ARM some BUILD_VECTORs are legal as-is and their			// operations, but for ARM some BUILD_VECTORs are legal as-is and their
	// operands need to be legalized. Define an ARM-specific version of			// operands need to be legalized. Define an ARM-specific version of
	// BUILD_VECTOR for this purpose.			// BUILD_VECTOR for this purpose.
	BUILD_VECTOR,			BUILD_VECTOR,

	// Bit-field insert			// Bit-field insert
	BFI,			BFI,

	// Vector OR with immediate			// Vector OR with immediate
	VORRIMM,			VORRIMM,
	// Vector AND with NOT of immediate			// Vector AND with NOT of immediate
	VBICIMM,			VBICIMM,

	// Pseudo vector bitwise select			// Pseudo vector bitwise select
	VBSP,			VBSP,

	// Pseudo-instruction representing a memory copy using ldm/stm			// Pseudo-instruction representing a memory copy using ldm/stm
	// instructions.			// instructions.
	MEMCPY,			MEMCPY,

				// Pseudo-instruction representing a memory copy using a tail predicated
				// loop
				MEMCPYLOOP,

	// V8.1MMainline condition select			// V8.1MMainline condition select
	CSINV, // Conditional select invert.			CSINV, // Conditional select invert.
	CSNEG, // Conditional select negate.			CSNEG, // Conditional select negate.
	CSINC, // Conditional select increment.			CSINC, // Conditional select increment.

	// Vector load N-element structure to all lanes:			// Vector load N-element structure to all lanes:
	VLD1DUP = ISD::FIRST_TARGET_MEMORY_OPCODE,			VLD1DUP = ISD::FIRST_TARGET_MEMORY_OPCODE,
	VLD2DUP,			VLD2DUP,
	VLD3DUP,			VLD3DUP,
	VLD4DUP,			VLD4DUP,

	// NEON loads with post-increment base updates:			// NEON loads with post-increment base updates:
	VLD1_UPD,			VLD1_UPD,
	VLD2_UPD,			VLD2_UPD,
	VLD3_UPD,			VLD3_UPD,
	VLD4_UPD,			VLD4_UPD,
	VLD2LN_UPD,			VLD2LN_UPD,
	VLD3LN_UPD,			VLD3LN_UPD,
	VLD4LN_UPD,			VLD4LN_UPD,
	VLD1DUP_UPD,			VLD1DUP_UPD,
	VLD2DUP_UPD,			VLD2DUP_UPD,
	VLD3DUP_UPD,			VLD3DUP_UPD,
	VLD4DUP_UPD,			VLD4DUP_UPD,

	// NEON stores with post-increment base updates:			// NEON stores with post-increment base updates:
	VST1_UPD,			VST1_UPD,
	VST2_UPD,			VST2_UPD,
	VST3_UPD,			VST3_UPD,
	VST4_UPD,			VST4_UPD,
	VST2LN_UPD,			VST2LN_UPD,
	VST3LN_UPD,			VST3LN_UPD,
	VST4LN_UPD,			VST4LN_UPD,

	// Load/Store of dual registers			// Load/Store of dual registers
	LDRD,			LDRD,
	STRD			STRD
	};			};
				dmgreenUnsubmitted Done Reply Inline Actions Don't format any of this - it's unrelated. dmgreen: Don't format any of this - it's unrelated.
				malharJAuthorUnsubmitted Done Reply Inline Actions I had to fix it because patch was failing on clang-format error. malharJ: I had to fix it because patch was failing on clang-format error.
				dmgreenUnsubmitted Done Reply Inline Actions That's fine. We can ignore the precommit bot where it's more noisy than helpful. dmgreen: That's fine. We can ignore the precommit bot where it's more noisy than helpful.

	} // end namespace ARMISD			} // end namespace ARMISD

	namespace ARM {			namespace ARM {
	/// Possible values of current rounding mode, which is specified in bits			/// Possible values of current rounding mode, which is specified in bits
	/// 23:22 of FPSCR.			/// 23:22 of FPSCR.
	enum Rounding {			enum Rounding {
	RN = 0, // Round to Nearest			RN = 0, // Round to Nearest
	▲ Show 20 Lines • Show All 613 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,609 Lines • ▼ Show 20 Lines

const char *ARMTargetLowering::getTargetNodeName(unsigned Opcode) const {		const char *ARMTargetLowering::getTargetNodeName(unsigned Opcode) const {
#define MAKE_CASE(V) \		#define MAKE_CASE(V) \
case V: \		case V: \
return #V;		return #V;
switch ((ARMISD::NodeType)Opcode) {		switch ((ARMISD::NodeType)Opcode) {
case ARMISD::FIRST_NUMBER:		case ARMISD::FIRST_NUMBER:
break;		break;
MAKE_CASE(ARMISD::Wrapper)		MAKE_CASE(ARMISD::Wrapper)
		dmgreenUnsubmitted Done Reply Inline Actions Remove newline. dmgreen: Remove newline.
MAKE_CASE(ARMISD::WrapperPIC)		MAKE_CASE(ARMISD::WrapperPIC)
MAKE_CASE(ARMISD::WrapperJT)		MAKE_CASE(ARMISD::WrapperJT)
MAKE_CASE(ARMISD::COPY_STRUCT_BYVAL)		MAKE_CASE(ARMISD::COPY_STRUCT_BYVAL)
MAKE_CASE(ARMISD::CALL)		MAKE_CASE(ARMISD::CALL)
MAKE_CASE(ARMISD::CALL_PRED)		MAKE_CASE(ARMISD::CALL_PRED)
MAKE_CASE(ARMISD::CALL_NOLINK)		MAKE_CASE(ARMISD::CALL_NOLINK)
MAKE_CASE(ARMISD::tSECALL)		MAKE_CASE(ARMISD::tSECALL)
MAKE_CASE(ARMISD::BRCOND)		MAKE_CASE(ARMISD::BRCOND)
▲ Show 20 Lines • Show All 170 Lines • ▼ Show 20 Lines	case ARMISD::FIRST_NUMBER:
MAKE_CASE(ARMISD::VST4LN_UPD)		MAKE_CASE(ARMISD::VST4LN_UPD)
MAKE_CASE(ARMISD::WLS)		MAKE_CASE(ARMISD::WLS)
MAKE_CASE(ARMISD::WLSSETUP)		MAKE_CASE(ARMISD::WLSSETUP)
MAKE_CASE(ARMISD::LE)		MAKE_CASE(ARMISD::LE)
MAKE_CASE(ARMISD::LOOP_DEC)		MAKE_CASE(ARMISD::LOOP_DEC)
MAKE_CASE(ARMISD::CSINV)		MAKE_CASE(ARMISD::CSINV)
MAKE_CASE(ARMISD::CSNEG)		MAKE_CASE(ARMISD::CSNEG)
MAKE_CASE(ARMISD::CSINC)		MAKE_CASE(ARMISD::CSINC)
		MAKE_CASE(ARMISD::MEMCPYLOOP)
#undef MAKE_CASE		#undef MAKE_CASE
}		}
return nullptr;		return nullptr;
}		}

EVT ARMTargetLowering::getSetCCResultType(const DataLayout &DL, LLVMContext &,		EVT ARMTargetLowering::getSetCCResultType(const DataLayout &DL, LLVMContext &,
EVT VT) const {		EVT VT) const {
if (!VT.isVector())		if (!VT.isVector())
▲ Show 20 Lines • Show All 9,279 Lines • ▼ Show 20 Lines	static bool checkAndUpdateCPSRKill(MachineBasicBlock::iterator SelectItr,
}		}

// We found a def, or hit the end of the basic block and CPSR wasn't live		// We found a def, or hit the end of the basic block and CPSR wasn't live
// out. SelectMI should have a kill flag on CPSR.		// out. SelectMI should have a kill flag on CPSR.
SelectItr->addRegisterKilled(ARM::CPSR, TRI);		SelectItr->addRegisterKilled(ARM::CPSR, TRI);
return true;		return true;
}		}

		/// Adds logic in loop entry MBB to calculate loop iteration count and adds
		/// t2WhileLoopSetup and t2WhileLoopStart to generate WLS loop
		static Register genTPEntry(MachineBasicBlock *TpEntry,
		dmgreenUnsubmitted Done Reply Inline Actions Can you look into all these clang-tidy errors. LLVM usually uses CamelCase for variable names, and the MBB_ looks a little odd. They should start with a capital and I would drop the "t2", that's not adding much. dmgreen: Can you look into all these clang-tidy errors. LLVM usually uses CamelCase for variable names…
		MachineBasicBlock *TpLoopBody,
		MachineBasicBlock *TpExit, Register OpSizeReg,
		const TargetInstrInfo *TII, DebugLoc Dl,
		MachineRegisterInfo &MRI) {

		// Calculates loop iteration count = ceil(n/16)/16 = ((n + 15)&(-16)) / 16.
		Register AddDestReg = MRI.createVirtualRegister(&ARM::rGPRRegClass);
		BuildMI(TpEntry, Dl, TII->get(ARM::t2ADDri), AddDestReg)
		.addUse(OpSizeReg)
		.addImm(15)
		.add(predOps(ARMCC::AL))
		.addReg(0);

		Register BicDestReg = MRI.createVirtualRegister(&ARM::rGPRRegClass);
		BuildMI(TpEntry, Dl, TII->get(ARM::t2BICri), BicDestReg)
		.addUse(AddDestReg, RegState::Kill)
		.addImm(16)
		.add(predOps(ARMCC::AL))
		.addReg(0);

		Register LsrDestReg = MRI.createVirtualRegister(&ARM::GPRlrRegClass);
		BuildMI(TpEntry, Dl, TII->get(ARM::t2LSRri), LsrDestReg)
		.addUse(BicDestReg, RegState::Kill)
		.addImm(4)
		.add(predOps(ARMCC::AL))
		.addReg(0);

		Register TotalIterationsReg = MRI.createVirtualRegister(&ARM::GPRlrRegClass);
		BuildMI(TpEntry, Dl, TII->get(ARM::t2WhileLoopSetup), TotalIterationsReg)
		.addUse(LsrDestReg, RegState::Kill);

		BuildMI(TpEntry, Dl, TII->get(ARM::t2WhileLoopStart))
		.addUse(TotalIterationsReg)
		.addMBB(TpExit);

		return TotalIterationsReg;
		}

		/// Adds logic in the loopBody MBB to generate MVE_VCTP, t2DoLoopDec and
		/// t2DoLoopEnd. These are used by later passes to generate tail predicated
		/// loops.
		static void genTPLoopBody(MachineBasicBlock *TpLoopBody,
		MachineBasicBlock TpEntry, MachineBasicBlock TpExit,
		const TargetInstrInfo *TII, DebugLoc Dl,
		MachineRegisterInfo &MRI, Register OpSrcReg,
		Register OpDestReg, Register ElementCountReg,
		Register TotalIterationsReg) {

		// First insert 4 PHI nodes for: Current pointer to Src, Dest array, loop
		// iteration counter, predication counter Current position in the src array
		Register SrcPhiReg = MRI.createVirtualRegister(&ARM::rGPRRegClass);
		Register CurrSrcReg = MRI.createVirtualRegister(&ARM::rGPRRegClass);
		BuildMI(TpLoopBody, Dl, TII->get(ARM::PHI), SrcPhiReg)
		.addUse(OpSrcReg)
		.addMBB(TpEntry)
		.addUse(CurrSrcReg)
		.addMBB(TpLoopBody);

		// Current position in the dest array
		Register DestPhiReg = MRI.createVirtualRegister(&ARM::rGPRRegClass);
		Register CurrDestReg = MRI.createVirtualRegister(&ARM::rGPRRegClass);
		BuildMI(TpLoopBody, Dl, TII->get(ARM::PHI), DestPhiReg)
		.addUse(OpDestReg)
		.addMBB(TpEntry)
		.addUse(CurrDestReg)
		.addMBB(TpLoopBody);

		// Current loop counter
		Register LoopCounterPhiReg = MRI.createVirtualRegister(&ARM::GPRlrRegClass);
		Register RemainingLoopIterationsReg =
		MRI.createVirtualRegister(&ARM::GPRlrRegClass);
		BuildMI(TpLoopBody, Dl, TII->get(ARM::PHI), LoopCounterPhiReg)
		.addUse(TotalIterationsReg)
		.addMBB(TpEntry)
		.addUse(RemainingLoopIterationsReg)
		.addMBB(TpLoopBody);

		// Predication counter
		Register PredCounterPhiReg = MRI.createVirtualRegister(&ARM::rGPRRegClass);
		Register RemainingElementsReg = MRI.createVirtualRegister(&ARM::rGPRRegClass);
		BuildMI(TpLoopBody, Dl, TII->get(ARM::PHI), PredCounterPhiReg)
		.addUse(ElementCountReg)
		.addMBB(TpEntry)
		.addUse(RemainingElementsReg)
		.addMBB(TpLoopBody);

		// Pass predication counter to VCTP
		Register VccrReg = MRI.createVirtualRegister(&ARM::VCCRRegClass);
		BuildMI(TpLoopBody, Dl, TII->get(ARM::MVE_VCTP8), VccrReg)
		.addUse(PredCounterPhiReg)
		.addImm(ARMVCC::None)
		.addReg(0);

		BuildMI(TpLoopBody, Dl, TII->get(ARM::t2SUBri), RemainingElementsReg)
		.addUse(PredCounterPhiReg)
		.addImm(16)
		.add(predOps(ARMCC::AL))
		.addReg(0);

		// VLDRB and VSTRB instructions, predicated using VPR
		Register LoadedValueReg = MRI.createVirtualRegister(&ARM::MQPRRegClass);
		BuildMI(TpLoopBody, Dl, TII->get(ARM::MVE_VLDRBU8_post))
		.addDef(CurrSrcReg)
		.addDef(LoadedValueReg)
		.addReg(SrcPhiReg)
		.addImm(16)
		.addImm(ARMVCC::Then)
		.addUse(VccrReg);

		BuildMI(TpLoopBody, Dl, TII->get(ARM::MVE_VSTRBU8_post))
		.addDef(CurrDestReg)
		.addUse(LoadedValueReg, RegState::Kill)
		.addReg(DestPhiReg)
		.addImm(16)
		.addImm(ARMVCC::Then)
		.addUse(VccrReg);

		// Add the pseudoInstrs for decrementing the loop counter and marking the
		// end:t2DoLoopDec and t2DoLoopEnd
		BuildMI(TpLoopBody, Dl, TII->get(ARM::t2LoopDec), RemainingLoopIterationsReg)
		.addUse(LoopCounterPhiReg)
		.addImm(1);

		BuildMI(TpLoopBody, Dl, TII->get(ARM::t2LoopEnd))
		.addUse(RemainingLoopIterationsReg)
		.addMBB(TpLoopBody);

		BuildMI(TpLoopBody, Dl, TII->get(ARM::t2B))
		.addMBB(TpExit)
		.add(predOps(ARMCC::AL));
		}

MachineBasicBlock *		MachineBasicBlock *
ARMTargetLowering::EmitInstrWithCustomInserter(MachineInstr &MI,		ARMTargetLowering::EmitInstrWithCustomInserter(MachineInstr &MI,
MachineBasicBlock *BB) const {		MachineBasicBlock *BB) const {
const TargetInstrInfo *TII = Subtarget->getInstrInfo();		const TargetInstrInfo *TII = Subtarget->getInstrInfo();
DebugLoc dl = MI.getDebugLoc();		DebugLoc dl = MI.getDebugLoc();
bool isThumb2 = Subtarget->isThumb2();		bool isThumb2 = Subtarget->isThumb2();
switch (MI.getOpcode()) {		switch (MI.getOpcode()) {
default: {		default: {
Show All 10 Lines	BuildMI(*BB, MI, dl, TII->get(ARM::tLDMIA_UPD))
.add(MI.getOperand(3)) // PredImm		.add(MI.getOperand(3)) // PredImm
.add(MI.getOperand(4)) // PredReg		.add(MI.getOperand(4)) // PredReg
.add(MI.getOperand(0)) // Rt		.add(MI.getOperand(0)) // Rt
.cloneMemRefs(MI);		.cloneMemRefs(MI);
MI.eraseFromParent();		MI.eraseFromParent();
return BB;		return BB;
}		}

		case ARM::MVE_MEMCPYLOOPINST: {

		// Transformation below expands MVE_MEMCPYLOOPINST Pseudo instruction
		// into a Tail Predicated (TP) Loop. It adds the instructions to calculate
		// the iteration count =ceil(size_in_bytes/16)) in the TP entry block and
		// adds the relevant instructions in the TP loop Body for generation of a
		// WLSTP loop.

		// Below is relevant portion of the CFG after the transformation.
		// The Machine Basic Blocks are shown along with branch conditions (in
		// brackets). Note that TP entry/exit MBBs depict the entry/exit of this
		// portion of the CFG and may not necessarily be the entry/exit of the
		// function.

		// (Relevant) CFG after transformation:
		// TP entry MBB
		// \|
		// \|-----------------\|
		// (n <= 0) (n > 0)
		// \| \|
		// \| TP loop Body MBB<--\|
		dmgreenUnsubmitted Done Reply Inline Actions Great comment by the way. Is it possible to make the loop MBB look like a bit like a loop? To show there is a backedge too. dmgreen: Great comment by the way. Is it possible to make the loop MBB look like a bit like a loop? To…
		// \| \| \|
		// \ \|___________\|
		// \ /
		// TP exit MBB

		MachineFunction *MF = BB->getParent();
		MachineFunctionProperties &Properties = MF->getProperties();
		MachineRegisterInfo &MRI = MF->getRegInfo();

		Register OpDestReg = MI.getOperand(0).getReg();
		Register OpSrcReg = MI.getOperand(1).getReg();
		Register OpSizeReg = MI.getOperand(2).getReg();

		// Allocate the required MBBs and add to parent function.
		MachineBasicBlock *TpEntry = BB;
		MachineBasicBlock *TpLoopBody = MF->CreateMachineBasicBlock();
		MachineBasicBlock *TpExit;

		MF->push_back(TpLoopBody);

		// If any instructions are present in the current block after
		// MVE_MEMCPYLOOPINST, split the current block and move the instructions
		// into the newly created exit block. If there are no instructions
		// add an explicit branch to the FallThrough block and then split.
		//
		dmgreenUnsubmitted Done Reply Inline Actions When will this happen? dmgreen: When will this happen?
		malharJAuthorUnsubmitted Done Reply Inline Actions This happens if MVE_MEMCPYLOOPINST is the only instruction in the block. splitAt() returns the same block if there is nothing after the instruction at which the split is done.. This happens when for loops are implicitly converted to memcpys, the memcpy call ends up being the only instruction in the preheader. There is already a test case for this as test2 in llvm/test/CodeGen/Thumb2/mve_tp_loop.mir malharJ: This happens if MVE_MEMCPYLOOPINST is the only instruction in the block. splitAt() returns the…
		dmgreenUnsubmitted Done Reply Inline Actions OK. I thought it was more eager about putting branches on the end of blocks, even if they are fallthroughs. dmgreen: OK. I thought it was more eager about putting branches on the end of blocks, even if they are…
		// The split is required for two reasons:
		// 1) A terminator(t2WhileLoopStart) will be placed at that site.
		// 2) Since a TPLoopBody will be added later, any phis in successive blocks
		// need to be updated. splitAt() already handles this.
		TpExit = BB->splitAt(MI, false);
		if (TpExit == BB) {
		assert(BB->canFallThrough() &&
		"Exit block must be FallThrough of the block containing memcpy");
		TpExit = BB->getFallThrough();
		BuildMI(BB, dl, TII->get(ARM::t2B))
		.addMBB(TpExit)
		.add(predOps(ARMCC::AL));
		TpExit = BB->splitAt(MI, false);
		}

		// Add logic for iteration count
		Register TotalIterationsReg =
		genTPEntry(TpEntry, TpLoopBody, TpExit, OpSizeReg, TII, dl, MRI);

		dmgreenUnsubmitted Done Reply Inline Actions -> "for a more natural layout"? I think there may be benefits from getting the order roughly correct at this stage, if we are relying on WLS branches. They can be fixed up later, but if we get them more correct at this point, that can only help. dmgreen: -> "for a more natural layout"? I think there may be benefits from getting the order roughly…
		// Add the vectorized (and predicated) loads/store instructions
		genTPLoopBody(TpLoopBody, TpEntry, TpExit, TII, dl, MRI, OpSrcReg,
		OpDestReg, OpSizeReg, TotalIterationsReg);

		// Required to avoid conflict with the MachineVerifier during testing.
		Properties.reset(MachineFunctionProperties::Property::NoPHIs);

		// Connect the blocks
		TpEntry->addSuccessor(TpLoopBody);
		TpLoopBody->addSuccessor(TpLoopBody);
		TpLoopBody->addSuccessor(TpExit);

		// Reorder for a more natural layout
		TpLoopBody->moveAfter(TpEntry);
		TpExit->moveAfter(TpLoopBody);

		// Finally, remove the memcpy Psuedo Instruction
		MI.eraseFromParent();

		// Return the exit block as it may contain other instructions requiring a
		// custom inserter
		return TpExit;
		}

// The Thumb2 pre-indexed stores have the same MI operands, they just		// The Thumb2 pre-indexed stores have the same MI operands, they just
// define them differently in the .td files from the isel patterns, so		// define them differently in the .td files from the isel patterns, so
// they need pseudos.		// they need pseudos.
case ARM::t2STR_preidx:		case ARM::t2STR_preidx:
MI.setDesc(TII->get(ARM::t2STR_PRE));		MI.setDesc(TII->get(ARM::t2STR_PRE));
return BB;		return BB;
case ARM::t2STRB_preidx:		case ARM::t2STRB_preidx:
MI.setDesc(TII->get(ARM::t2STRB_PRE));		MI.setDesc(TII->get(ARM::t2STRB_PRE));
▲ Show 20 Lines • Show All 8,689 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMInstrMVE.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,859 Lines • ▼ Show 20 Lines	class MVE_WLSTP<string asm, bits<2> size>
bits<11> label;		bits<11> label;
let Inst{13} = 0b0;		let Inst{13} = 0b0;
let Inst{11} = label{0};		let Inst{11} = label{0};
let Inst{10-1} = label{10-1};		let Inst{10-1} = label{10-1};
let isBranch = 1;		let isBranch = 1;
let isTerminator = 1;		let isTerminator = 1;
}		}

		def SDT_MVEMEMCPYLOOPNODE
		: SDTypeProfile<0, 3, [SDTCisPtrTy<0>, SDTCisPtrTy<1>, SDTCisVT<2, i32>]>;
		def MVE_MEMCPYLOOPNODE : SDNode<"ARMISD::MEMCPYLOOP", SDT_MVEMEMCPYLOOPNODE,
		[SDNPHasChain, SDNPMayStore, SDNPMayLoad]>;

		let usesCustomInserter = 1, hasNoSchedulingInfo = 1 in {
		def MVE_MEMCPYLOOPINST : PseudoInst<(outs),
		dmgreenUnsubmitted Done Reply Inline Actions Can you improve this formatting. If you clang-formatted it, it doesn't do well with .td files and manually making it look more like the others will do better. dmgreen: Can you improve this formatting. If you clang-formatted it, it doesn't do well with .td files…
		malharJAuthorUnsubmitted Done Reply Inline Actions Had clang formatted it but yeah doesnt look good. Updated now. malharJ: Had clang formatted it but yeah doesnt look good. Updated now.
		(ins rGPR:$dst, rGPR:$src, rGPR:$sz),
		NoItinerary,
		[(MVE_MEMCPYLOOPNODE rGPR:$dst, rGPR:$src, rGPR:$sz)]>;
		}

def MVE_DLSTP_8 : MVE_DLSTP<"dlstp.8", 0b00>;		def MVE_DLSTP_8 : MVE_DLSTP<"dlstp.8", 0b00>;
def MVE_DLSTP_16 : MVE_DLSTP<"dlstp.16", 0b01>;		def MVE_DLSTP_16 : MVE_DLSTP<"dlstp.16", 0b01>;
def MVE_DLSTP_32 : MVE_DLSTP<"dlstp.32", 0b10>;		def MVE_DLSTP_32 : MVE_DLSTP<"dlstp.32", 0b10>;
def MVE_DLSTP_64 : MVE_DLSTP<"dlstp.64", 0b11>;		def MVE_DLSTP_64 : MVE_DLSTP<"dlstp.64", 0b11>;

def MVE_WLSTP_8 : MVE_WLSTP<"wlstp.8", 0b00>;		def MVE_WLSTP_8 : MVE_WLSTP<"wlstp.8", 0b00>;
def MVE_WLSTP_16 : MVE_WLSTP<"wlstp.16", 0b01>;		def MVE_WLSTP_16 : MVE_WLSTP<"wlstp.16", 0b01>;
def MVE_WLSTP_32 : MVE_WLSTP<"wlstp.32", 0b10>;		def MVE_WLSTP_32 : MVE_WLSTP<"wlstp.32", 0b10>;
▲ Show 20 Lines • Show All 547 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMSelectionDAGInfo.cpp

	//===-- ARMSelectionDAGInfo.cpp - ARM SelectionDAG Info -------------------===//			//===-- ARMSelectionDAGInfo.cpp - ARM SelectionDAG Info -------------------===//
	//			//
	// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.			// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
	// See https://llvm.org/LICENSE.txt for license information.			// See https://llvm.org/LICENSE.txt for license information.
	// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception			// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// This file implements the ARMSelectionDAGInfo class.			// This file implements the ARMSelectionDAGInfo class.
	//			//
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#include "ARMTargetMachine.h"			#include "ARMTargetMachine.h"
				#include "ARMTargetTransformInfo.h"
	#include "llvm/CodeGen/SelectionDAG.h"			#include "llvm/CodeGen/SelectionDAG.h"
	#include "llvm/IR/DerivedTypes.h"			#include "llvm/IR/DerivedTypes.h"
				#include "llvm/Support/CommandLine.h"
	using namespace llvm;			using namespace llvm;

	#define DEBUG_TYPE "arm-selectiondag-info"			#define DEBUG_TYPE "arm-selectiondag-info"

				cl::opt<TPLoop::MemTransfer> EnableMemtransferTPLoop(
				"arm-memtransfer-tploop", cl::Hidden,
				cl::desc("Control conversion of memcpy to "
				"Tail predicated loops (WLSTP)"),
				cl::init(TPLoop::ForceDisabled),
				cl::values(clEnumValN(TPLoop::ForceDisabled, "force-disabled",
				"Don't convert memcpy to TP loop."),
				clEnumValN(TPLoop::ForceEnabled, "force-enabled",
				"Always convert memcpy to TP loop."),
				clEnumValN(TPLoop::Allow, "allow",
				"Allow (may be subject to certain conditions) "
				"conversion of memcpy to TP loop.")));

	// Emit, if possible, a specialized version of the given Libcall. Typically this			// Emit, if possible, a specialized version of the given Libcall. Typically this
	// means selecting the appropriately aligned version, but we also convert memset			// means selecting the appropriately aligned version, but we also convert memset
	// of 0 into memclr.			// of 0 into memclr.
	SDValue ARMSelectionDAGInfo::EmitSpecializedLibcall(			SDValue ARMSelectionDAGInfo::EmitSpecializedLibcall(
	SelectionDAG &DAG, const SDLoc &dl, SDValue Chain, SDValue Dst, SDValue Src,			SelectionDAG &DAG, const SDLoc &dl, SDValue Chain, SDValue Dst, SDValue Src,
	SDValue Size, unsigned Align, RTLIB::Libcall LC) const {			SDValue Size, unsigned Align, RTLIB::Libcall LC) const {
	const ARMSubtarget &Subtarget =			const ARMSubtarget &Subtarget =
	DAG.getMachineFunction().getSubtarget<ARMSubtarget>();			DAG.getMachineFunction().getSubtarget<ARMSubtarget>();
	▲ Show 20 Lines • Show All 97 Lines • ▼ Show 20 Lines
	}			}

	SDValue ARMSelectionDAGInfo::EmitTargetCodeForMemcpy(			SDValue ARMSelectionDAGInfo::EmitTargetCodeForMemcpy(
	SelectionDAG &DAG, const SDLoc &dl, SDValue Chain, SDValue Dst, SDValue Src,			SelectionDAG &DAG, const SDLoc &dl, SDValue Chain, SDValue Dst, SDValue Src,
	SDValue Size, Align Alignment, bool isVolatile, bool AlwaysInline,			SDValue Size, Align Alignment, bool isVolatile, bool AlwaysInline,
	MachinePointerInfo DstPtrInfo, MachinePointerInfo SrcPtrInfo) const {			MachinePointerInfo DstPtrInfo, MachinePointerInfo SrcPtrInfo) const {
	const ARMSubtarget &Subtarget =			const ARMSubtarget &Subtarget =
	DAG.getMachineFunction().getSubtarget<ARMSubtarget>();			DAG.getMachineFunction().getSubtarget<ARMSubtarget>();
				ConstantSDNode *ConstantSize = dyn_cast<ConstantSDNode>(Size);

				auto GenInlineTP = [&](const ARMSubtarget &Subtarget,
				dmgreenUnsubmitted Done Reply Inline Actions `[=]` ->`[&]` is more standard, even if it doesn't make a lot of difference here. dmgreen: `[=]` ->`[&]` is more standard, even if it doesn't make a lot of difference here.
				const SelectionDAG &DAG) {
				dmgreenUnsubmitted Done Reply Inline Actions Can we add an option that turns this inline memcpy on/off. If the option is true, we always use the MEMCPYLOOP, if it's false we never do, and if it's unset we use this default logic. Also consider pulling the if logic into a lambda for readability. dmgreen: Can we add an option that turns this inline memcpy on/off. If the option is true, we always use…
				auto &F = DAG.getMachineFunction().getFunction();
				dmgreenUnsubmitted Done Reply Inline Actions Probably better as: if (DAG.getMachineFunction().getFunction().hasOptNone()) return false; if (!ConstantSize && (Alignment >= Align(4)) return true; if (...) ... The EnableMemcpyTPLoop logic could be in here too, as it's just returning true/false at the right time. What do we do for -Oz and -Os? dmgreen: Probably better as: ``` if (DAG.getMachineFunction().getFunction().hasOptNone()) return false…
				malharJAuthorUnsubmitted Done Reply Inline Actions Ok. And I guess it would be best to disable in the case of -Os/-Oz in case there are multiple memcpys in the source. I've made the update and added tests as well. malharJ: Ok. And I guess it would be best to disable in the case of -Os/-Oz in case there are multiple…
				dmgreenUnsubmitted Done Reply Inline Actions func -> Func. Or maybe just F, which is quite common in LLVM. dmgreen: func -> Func. Or maybe just F, which is quite common in LLVM.
				if (!EnableMemtransferTPLoop)
				dmgreenUnsubmitted Not Done Reply Inline Actions Perhaps use == TPLoop::ForceDisable to make it clear. dmgreen: Perhaps use == TPLoop::ForceDisable to make it clear.
				return false;
				if (EnableMemtransferTPLoop == TPLoop::ForceEnabled)
				return true;
				// Do not generate inline TP loop if optimizations is disabled,
				// or if optimization for size (-Os or -Oz) is on.
				if (F.hasOptNone() \|\| F.hasOptSize())
				dmgreenUnsubmitted Done Reply Inline Actions if (EnableMemcpyTPLoop == cl::BOU_FALSE) return false; Is probably better, if it works as I expect. That keeps the indenting down, and the last if currently isn't in the block it looks like it should be. Oh, and move EnableMemcpyTPLoop above the OptSIze/OptNone, in case we want to try and force it. (Even if OptNone doesn't work, using that combo is unlikely to be useful at any rate.) dmgreen: if (EnableMemcpyTPLoop == cl::BOU_FALSE) return false; Is probably better, if it works…
				malharJAuthorUnsubmitted Done Reply Inline Actions Ok, my bad there with the braces. I've moved the cases when the cli option is set to be of higher priority than the optNone/optSize cases ... but the unset case is of Lower priority than (the optNone/optSize) since the user is no longer passing the cli option. Hopefully that sounds sensible. malharJ: Ok, my bad there with the braces. I've moved the cases when the cli option is set to be…
				return false;
				// If cli option is unset
				if (!ConstantSize && Alignment >= Align(4))
				return true;
				if (ConstantSize &&
				dmgreenUnsubmitted Done Reply Inline Actions Add a return false at the end? dmgreen: Add a return false at the end?
				malharJAuthorUnsubmitted Done Reply Inline Actions yep, had missed that out. malharJ: yep, had missed that out.
				ConstantSize->getZExtValue() > Subtarget.getMaxInlineSizeThreshold() &&
				ConstantSize->getZExtValue() <
				Subtarget.getMaxTPLoopInlineSizeThreshold())
				return true;
				return false;
				};

				if (Subtarget.hasMVEIntegerOps() && GenInlineTP(Subtarget, DAG))
				return DAG.getNode(ARMISD::MEMCPYLOOP, dl, MVT::Other, Chain, Dst, Src,
				DAG.getZExtOrTrunc(Size, dl, MVT::i32));

	// Do repeated 4-byte loads and stores. To be improved.			// Do repeated 4-byte loads and stores. To be improved.
	// This requires 4-byte alignment.			// This requires 4-byte alignment.
	if (Alignment < Align(4))			if (Alignment < Align(4))
	return SDValue();			return SDValue();
	// This requires the copy size to be a constant, preferably			// This requires the copy size to be a constant, preferably
	// within a subtarget-specific limit.			// within a subtarget-specific limit.
	ConstantSDNode *ConstantSize = dyn_cast<ConstantSDNode>(Size);
	if (!ConstantSize)			if (!ConstantSize)
	return EmitSpecializedLibcall(DAG, dl, Chain, Dst, Src, Size,			return EmitSpecializedLibcall(DAG, dl, Chain, Dst, Src, Size,
	Alignment.value(), RTLIB::MEMCPY);			Alignment.value(), RTLIB::MEMCPY);
	uint64_t SizeVal = ConstantSize->getZExtValue();			uint64_t SizeVal = ConstantSize->getZExtValue();
	if (!AlwaysInline && SizeVal > Subtarget.getMaxInlineSizeThreshold())			if (!AlwaysInline && SizeVal > Subtarget.getMaxInlineSizeThreshold())
	return EmitSpecializedLibcall(DAG, dl, Chain, Dst, Src, Size,			return EmitSpecializedLibcall(DAG, dl, Chain, Dst, Src, Size,
	Alignment.value(), RTLIB::MEMCPY);			Alignment.value(), RTLIB::MEMCPY);

	▲ Show 20 Lines • Show All 108 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMSubtarget.h

Show First 20 Lines • Show All 532 Lines • ▼ Show 20 Lines	ARMSubtarget(const Triple &TT, const std::string &CPU, const std::string &FS,
bool MinSize = false);		bool MinSize = false);

/// getMaxInlineSizeThreshold - Returns the maximum memset / memcpy size		/// getMaxInlineSizeThreshold - Returns the maximum memset / memcpy size
/// that still makes it profitable to inline the call.		/// that still makes it profitable to inline the call.
unsigned getMaxInlineSizeThreshold() const {		unsigned getMaxInlineSizeThreshold() const {
return 64;		return 64;
}		}

		/// getMaxTPLoopSizeThreshold - Returns the maximum memcpy size
		/// that still makes it profitable to inline the call as a Tail
		/// Predicated loop
		unsigned getMaxTPLoopInlineSizeThreshold() const { return 128; }

/// ParseSubtargetFeatures - Parses features string setting specified		/// ParseSubtargetFeatures - Parses features string setting specified
/// subtarget options. Definition of function is auto generated by tblgen.		/// subtarget options. Definition of function is auto generated by tblgen.
void ParseSubtargetFeatures(StringRef CPU, StringRef TuneCPU, StringRef FS);		void ParseSubtargetFeatures(StringRef CPU, StringRef TuneCPU, StringRef FS);

/// initializeSubtargetDependencies - Initializes using a CPU and feature string		/// initializeSubtargetDependencies - Initializes using a CPU and feature string
/// so that we can use initializer lists for subtarget initialization.		/// so that we can use initializer lists for subtarget initialization.
ARMSubtarget &initializeSubtargetDependencies(StringRef CPU, StringRef FS);		ARMSubtarget &initializeSubtargetDependencies(StringRef CPU, StringRef FS);

▲ Show 20 Lines • Show All 384 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMTargetTransformInfo.h

Show First 20 Lines • Show All 42 Lines • ▼ Show 20 Lines	enum Mode {
Disabled = 0,		Disabled = 0,
EnabledNoReductions,		EnabledNoReductions,
Enabled,		Enabled,
ForceEnabledNoReductions,		ForceEnabledNoReductions,
ForceEnabled		ForceEnabled
};		};
}		}

		// For controlling conversion of memcpy into Tail Predicated loop.
		namespace TPLoop {
		enum MemTransfer { ForceDisabled = 0, ForceEnabled, Allow };
		}

class ARMTTIImpl : public BasicTTIImplBase<ARMTTIImpl> {		class ARMTTIImpl : public BasicTTIImplBase<ARMTTIImpl> {
using BaseT = BasicTTIImplBase<ARMTTIImpl>;		using BaseT = BasicTTIImplBase<ARMTTIImpl>;
using TTI = TargetTransformInfo;		using TTI = TargetTransformInfo;

friend BaseT;		friend BaseT;

const ARMSubtarget *ST;		const ARMSubtarget *ST;
const ARMTargetLowering *TLI;		const ARMTargetLowering *TLI;
▲ Show 20 Lines • Show All 268 Lines • Show Last 20 Lines

llvm/test/CodeGen/Thumb2/LowOverheadLoops/memcall.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple=thumbv8.1m.main-none-none-eabi -mattr=+mve.fp -verify-machineinstrs -tail-predication=enabled -o - %s \| FileCheck %s			; RUN: llc --arm-memtransfer-tploop=allow -mtriple=thumbv8.1m.main-none-none-eabi -mattr=+mve.fp -verify-machineinstrs -tail-predication=enabled -o - %s \| FileCheck %s

	define void @test_memcpy(i32* nocapture %x, i32* nocapture readonly %y, i32 %n, i32 %m) {			define void @test_memcpy(i32* nocapture %x, i32* nocapture readonly %y, i32 %n, i32 %m) {
	; CHECK-LABEL: test_memcpy:			; CHECK-LABEL: test_memcpy:
	; CHECK: @ %bb.0: @ %entry			; CHECK: @ %bb.0: @ %entry
	; CHECK-NEXT: .save {r4, r5, r6, r7, r8, r9, lr}			; CHECK-NEXT: .save {r4, r5, r6, r7, lr}
	; CHECK-NEXT: push.w {r4, r5, r6, r7, r8, r9, lr}			; CHECK-NEXT: push {r4, r5, r6, r7, lr}
	; CHECK-NEXT: .pad #4
	; CHECK-NEXT: sub sp, #4
	; CHECK-NEXT: cmp r2, #1			; CHECK-NEXT: cmp r2, #1
	; CHECK-NEXT: blt .LBB0_3			; CHECK-NEXT: blt .LBB0_5
	; CHECK-NEXT: @ %bb.1: @ %for.body.preheader			; CHECK-NEXT: @ %bb.1: @ %for.body.preheader
	; CHECK-NEXT: mov r8, r3			; CHECK-NEXT: lsl.w r12, r3, #2
	; CHECK-NEXT: mov r5, r2			; CHECK-NEXT: movs r7, #0
	; CHECK-NEXT: mov r9, r1			; CHECK-NEXT: b .LBB0_2
	; CHECK-NEXT: mov r7, r0
	; CHECK-NEXT: lsls r4, r3, #2
	; CHECK-NEXT: movs r6, #0
	; CHECK-NEXT: .LBB0_2: @ %for.body			; CHECK-NEXT: .LBB0_2: @ %for.body
	; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1			; CHECK-NEXT: @ =>This Loop Header: Depth=1
	; CHECK-NEXT: adds r0, r7, r6			; CHECK-NEXT: @ Child Loop BB0_4 Depth 2
	; CHECK-NEXT: add.w r1, r9, r6			; CHECK-NEXT: adds r4, r1, r7
	; CHECK-NEXT: mov r2, r8			; CHECK-NEXT: adds r5, r0, r7
	; CHECK-NEXT: bl __aeabi_memcpy4			; CHECK-NEXT: mov r6, r3
	; CHECK-NEXT: add r6, r4			; CHECK-NEXT: wlstp.8 lr, r6, .LBB0_3
				dmgreenUnsubmitted Not Done Reply Inline Actions Why does this not use r3 directly? dmgreen: Why does this not use r3 directly?
				malharJAuthorUnsubmitted Done Reply Inline Actions This seems to be an issue with generating a preHeader during the transform .. The phi-node-elimination pass is lowering the phi instructions (in the TP loopBody) as COPY operations (into the PreHeader). In this instance, the copy/mov can be seen below on line 32: mov r7, r3 I've fixed this issue as of now by not generating an extra preHeader during the transform .. so the mov ends up above the t2WhileLoopStartLR and overall it seems to work. Please see my comment about the latest changes for more details on this. malharJ: This seems to be an issue with generating a preHeader during the transform .. The phi-node…
	; CHECK-NEXT: subs r5, #1			; CHECK-NEXT: b .LBB0_4
	; CHECK-NEXT: bne .LBB0_2			; CHECK-NEXT: .LBB0_3: @ %for.body
	; CHECK-NEXT: .LBB0_3: @ %for.cond.cleanup			; CHECK-NEXT: @ in Loop: Header=BB0_2 Depth=1
	; CHECK-NEXT: add sp, #4			; CHECK-NEXT: add r7, r12
	; CHECK-NEXT: pop.w {r4, r5, r6, r7, r8, r9, pc}			; CHECK-NEXT: subs r2, #1
				; CHECK-NEXT: beq .LBB0_5
				; CHECK-NEXT: b .LBB0_2
				; CHECK-NEXT: .LBB0_4: @ Parent Loop BB0_2 Depth=1
				; CHECK-NEXT: @ => This Inner Loop Header: Depth=2
				; CHECK-NEXT: vldrb.u8 q0, [r4], #16
				; CHECK-NEXT: vstrb.8 q0, [r5], #16
				; CHECK-NEXT: letp lr, .LBB0_4
				; CHECK-NEXT: b .LBB0_3
				; CHECK-NEXT: .LBB0_5: @ %for.cond.cleanup
				; CHECK-NEXT: pop {r4, r5, r6, r7, pc}
	entry:			entry:
	%cmp8 = icmp sgt i32 %n, 0			%cmp8 = icmp sgt i32 %n, 0
	br i1 %cmp8, label %for.body, label %for.cond.cleanup			br i1 %cmp8, label %for.body, label %for.cond.cleanup

	for.cond.cleanup: ; preds = %for.body, %entry			for.cond.cleanup: ; preds = %for.body, %entry
	ret void			ret void

	for.body: ; preds = %entry, %for.body			for.body: ; preds = %entry, %for.body
	▲ Show 20 Lines • Show All 231 Lines • Show Last 20 Lines

llvm/test/CodeGen/Thumb2/mve-tp-loop.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc --arm-memtransfer-tploop=allow -mtriple=thumbv8.1m.main-none-eabi -mattr=+mve --verify-machineinstrs %s -o - \| FileCheck %s

				; Check that WLSTP loop is not generated for alignment < 4
				; void test1(char* dest, char* src, int n){
				; memcpy(dest, src, n);
				; }

				declare void @llvm.memcpy.p0i8.p0i8.i32(i8* noalias nocapture writeonly, i8* noalias nocapture readonly, i32, i1 immarg) #1
				declare void @llvm.memcpy.p0i8.p0i8.i64(i8* noalias nocapture writeonly, i8* noalias nocapture readonly, i64, i1 immarg) #1

				define void @test1(i8* noalias nocapture %X, i8* noalias nocapture readonly %Y, i32 %n){
				; CHECK-LABEL: test1:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .save {r7, lr}
				; CHECK-NEXT: push {r7, lr}
				; CHECK-NEXT: bl __aeabi_memcpy
				; CHECK-NEXT: pop {r7, pc}
				entry:
				call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 1 %X, i8* align 1 %Y, i32 %n, i1 false)
				ret void
				}


				; Check that WLSTP loop is generated for alignment >= 4
				; void test2(int* restrict X, int* restrict Y, int n){
				; memcpy(X, Y, n);
				; }

				define void @test2(i32* noalias %X, i32* noalias readonly %Y, i32 %n){
				; CHECK-LABEL: test2:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .save {r7, lr}
				; CHECK-NEXT: push {r7, lr}
				; CHECK-NEXT: wlstp.8 lr, r2, .LBB1_2
				; CHECK-NEXT: .LBB1_1: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrb.u8 q0, [r1], #16
				; CHECK-NEXT: vstrb.8 q0, [r0], #16
				; CHECK-NEXT: letp lr, .LBB1_1
				; CHECK-NEXT: .LBB1_2: @ %entry
				; CHECK-NEXT: pop {r7, pc}
				entry:
				%0 = bitcast i32* %X to i8*
				%1 = bitcast i32* %Y to i8*
				call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 4 %0, i8* align 4 %1, i32 %n, i1 false)
				ret void
				}


				; Checks that transform handles some arithmetic on the input arguments.
				; void test3(int* restrict X, int* restrict Y, int n)
				; {
				; memcpy(X+2, Y+3, (n*2)+10);
				; }

				define void @test3(i32* noalias nocapture %X, i32* noalias nocapture readonly %Y, i32 %n) {
				; CHECK-LABEL: test3:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .save {r7, lr}
				; CHECK-NEXT: push {r7, lr}
				; CHECK-NEXT: movs r3, #10
				; CHECK-NEXT: add.w r2, r3, r2, lsl #1
				; CHECK-NEXT: adds r1, #12
				; CHECK-NEXT: adds r0, #8
				; CHECK-NEXT: wlstp.8 lr, r2, .LBB2_2
				; CHECK-NEXT: .LBB2_1: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrb.u8 q0, [r1], #16
				; CHECK-NEXT: vstrb.8 q0, [r0], #16
				; CHECK-NEXT: letp lr, .LBB2_1
				; CHECK-NEXT: .LBB2_2: @ %entry
				; CHECK-NEXT: pop {r7, pc}
				entry:
				%add.ptr = getelementptr inbounds i32, i32* %X, i32 2
				%0 = bitcast i32* %add.ptr to i8*
				%add.ptr1 = getelementptr inbounds i32, i32* %Y, i32 3
				%1 = bitcast i32* %add.ptr1 to i8*
				%mul = shl nsw i32 %n, 1
				%add = add nsw i32 %mul, 10
				call void @llvm.memcpy.p0i8.p0i8.i32(i8* nonnull align 4 %0, i8* nonnull align 4 %1, i32 %add, i1 false)
				ret void
				}


				; Checks that transform handles for loops that are implicitly converted to mempcy
				; void test4(int* restrict X, int* restrict Y, int n){
				; for(int i = 0; i < n; ++i){
				; X[i] = Y[i];
				; }
				; }

				define void @test4(i32* noalias %X, i32* noalias readonly %Y, i32 %n) {
				; CHECK-LABEL: test4:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: cmp r2, #1
				; CHECK-NEXT: it lt
				; CHECK-NEXT: bxlt lr
				; CHECK-NEXT: .LBB3_1: @ %for.body.preheader
				; CHECK-NEXT: .save {r7, lr}
				; CHECK-NEXT: push {r7, lr}
				; CHECK-NEXT: wlstp.8 lr, r2, .LBB3_3
				; CHECK-NEXT: .LBB3_2: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrb.u8 q0, [r1], #16
				; CHECK-NEXT: vstrb.8 q0, [r0], #16
				; CHECK-NEXT: letp lr, .LBB3_2
				; CHECK-NEXT: .LBB3_3: @ %for.body.preheader
				; CHECK-NEXT: pop.w {r7, lr}
				; CHECK-NEXT: bx lr
				entry:
				%cmp6 = icmp sgt i32 %n, 0
				br i1 %cmp6, label %for.body.preheader, label %for.cond.cleanup

				for.body.preheader: ; preds = %entry
				%X.bits = bitcast i32* %X to i8*
				%Y.bits = bitcast i32* %Y to i8*
				call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 4 %X.bits, i8* align 4 %Y.bits, i32 %n, i1 false)
				br label %for.cond.cleanup

				for.cond.cleanup: ; preds = %for.body.preheader, %entry
				ret void
				}

				; Checks that transform can handle > i32 size inputs
				define void @test5(i8* noalias %X, i8* noalias %Y, i64 %n){
				; CHECK-LABEL: test5:
				; CHECK: @ %bb.0:
				; CHECK-NEXT: .save {r7, lr}
				; CHECK-NEXT: push {r7, lr}
				; CHECK-NEXT: wlstp.8 lr, r2, .LBB4_2
				; CHECK-NEXT: .LBB4_1: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrb.u8 q0, [r1], #16
				; CHECK-NEXT: vstrb.8 q0, [r0], #16
				; CHECK-NEXT: letp lr, .LBB4_1
				; CHECK-NEXT: .LBB4_2:
				; CHECK-NEXT: pop {r7, pc}
				call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 4 %X, i8* align 4 %Y, i64 %n, i1 false)
				ret void
				}

				; Checks the transform is applied for constant size inputs below a certain threshold (128 in this case)
				define void @test6(i32* noalias nocapture %X, i32* noalias nocapture readonly %Y, i32 %n) {
				; CHECK-LABEL: test6:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .save {r7, lr}
				; CHECK-NEXT: push {r7, lr}
				; CHECK-NEXT: movs r2, #127
				; CHECK-NEXT: wlstp.8 lr, r2, .LBB5_2
				; CHECK-NEXT: .LBB5_1: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrb.u8 q0, [r1], #16
				; CHECK-NEXT: vstrb.8 q0, [r0], #16
				; CHECK-NEXT: letp lr, .LBB5_1
				; CHECK-NEXT: .LBB5_2: @ %entry
				; CHECK-NEXT: pop {r7, pc}
				entry:
				%0 = bitcast i32* %X to i8*
				%1 = bitcast i32* %Y to i8*
				call void @llvm.memcpy.p0i8.p0i8.i32(i8* noundef nonnull align 4 dereferenceable(127) %0, i8* noundef nonnull align 4 dereferenceable(127) %1, i32 127, i1 false)
				ret void
				}

				; Checks the transform is NOT applied for constant size inputs above a certain threshold (128 in this case)
				define void @test7(i32* noalias nocapture %X, i32* noalias nocapture readonly %Y, i32 %n) {
				; CHECK-LABEL: test7:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .save {r7, lr}
				; CHECK-NEXT: push {r7, lr}
				; CHECK-NEXT: movs r2, #128
				; CHECK-NEXT: bl __aeabi_memcpy4
				; CHECK-NEXT: pop {r7, pc}
				entry:
				%0 = bitcast i32* %X to i8*
				%1 = bitcast i32* %Y to i8*
				call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 4 %0, i8* align 4 %1, i32 128, i1 false)
				ret void
				}

				; Checks the transform is NOT applied for constant size inputs below a certain threshold (64 in this case)
				define void @test8(i32* noalias nocapture %X, i32* noalias nocapture readonly %Y, i32 %n) {
				; CHECK-LABEL: test8:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .save {r4, lr}
				; CHECK-NEXT: push {r4, lr}
				; CHECK-NEXT: ldm.w r1!, {r2, r3, r4, r12, lr}
				; CHECK-NEXT: stm.w r0!, {r2, r3, r4, r12, lr}
				; CHECK-NEXT: ldm.w r1!, {r2, r3, r4, r12, lr}
				; CHECK-NEXT: stm.w r0!, {r2, r3, r4, r12, lr}
				; CHECK-NEXT: ldm.w r1, {r2, r3, r4, r12, lr}
				; CHECK-NEXT: stm.w r0, {r2, r3, r4, r12, lr}
				; CHECK-NEXT: pop {r4, pc}
				entry:
				%0 = bitcast i32* %X to i8*
				%1 = bitcast i32* %Y to i8*
				call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 4 %0, i8* align 4 %1, i32 60, i1 false)
				ret void
				}

				; Checks the transform is NOT applied (regardless of alignment) when optimizations are disabled
				define void @test9(i32* noalias nocapture %X, i32* noalias nocapture readonly %Y, i32 %n) #0 {
				; CHECK-LABEL: test9:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .save {r7, lr}
				; CHECK-NEXT: push {r7, lr}
				; CHECK-NEXT: bl __aeabi_memcpy4
				; CHECK-NEXT: pop {r7, pc}
				entry:
				%0 = bitcast i32* %X to i8*
				%1 = bitcast i32* %Y to i8*
				call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 4 %0, i8* align 4 %1, i32 %n, i1 false)
				ret void
				}

				; Checks the transform is NOT applied (regardless of alignment) when optimization for size is on (-Os or -Oz)
				define void @test10(i32* noalias nocapture %X, i32* noalias nocapture readonly %Y, i32 %n) #1 {
				; CHECK-LABEL: test10:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .save {r7, lr}
				; CHECK-NEXT: push {r7, lr}
				; CHECK-NEXT: bl __aeabi_memcpy4
				; CHECK-NEXT: pop {r7, pc}
				entry:
				%0 = bitcast i32* %X to i8*
				%1 = bitcast i32* %Y to i8*
				call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 4 %0, i8* align 4 %1, i32 %n, i1 false)
				ret void
				}

				define void @test11(i8* nocapture %x, i8* nocapture %y, i32 %n) {
				; CHECK-LABEL: test11:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .save {r4, lr}
				; CHECK-NEXT: push {r4, lr}
				; CHECK-NEXT: cmp.w r2, #-1
				; CHECK-NEXT: it gt
				; CHECK-NEXT: popgt {r4, pc}
				; CHECK-NEXT: .LBB10_1: @ %prehead
				; CHECK-NEXT: add.w r3, r2, #15
				; CHECK-NEXT: mov r12, r1
				; CHECK-NEXT: bic r3, r3, #16
				; CHECK-NEXT: mov r4, r0
				; CHECK-NEXT: lsr.w lr, r3, #4
				; CHECK-NEXT: mov r3, r2
				; CHECK-NEXT: subs.w lr, lr, #0
				; CHECK-NEXT: beq .LBB10_3
				; CHECK-NEXT: .LBB10_2: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vctp.8 r3
				; CHECK-NEXT: subs r3, #16
				; CHECK-NEXT: vpstt
				; CHECK-NEXT: vldrbt.u8 q0, [r12], #16
				; CHECK-NEXT: vstrbt.8 q0, [r4], #16
				; CHECK-NEXT: subs.w lr, lr, #1
				; CHECK-NEXT: bne .LBB10_2
				; CHECK-NEXT: b .LBB10_3
				malharJAuthorUnsubmitted Done Reply Inline Actions Not entirely sure why this isn't a TP loop, might need to check ArmLOL pass as to why it's being reverted.. malharJ: Not entirely sure why this isn't a TP loop, might need to check ArmLOL pass as to why it's…
				; CHECK-NEXT: .LBB10_3: @ %for.body
				; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: ldrb r3, [r0], #1
				; CHECK-NEXT: subs r2, #2
				; CHECK-NEXT: strb r3, [r1], #1
				; CHECK-NEXT: bne .LBB10_3
				; CHECK-NEXT: @ %bb.4: @ %for.cond.cleanup
				; CHECK-NEXT: pop {r4, pc}
				entry:
				%cmp6 = icmp slt i32 %n, 0
				br i1 %cmp6, label %prehead, label %for.cond.cleanup

				prehead: ; preds = %entry
				call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 4 %x, i8* align 4 %y, i32 %n, i1 false)
				br label %for.body

				for.body: ; preds = %for.body, %prehead
				%i.09 = phi i32 [ %inc, %for.body ], [ 0, %prehead ]
				%x.addr.08 = phi i8* [ %add.ptr, %for.body ], [ %x, %prehead ]
				%y.addr.07 = phi i8* [ %add.ptr1, %for.body ], [ %y, %prehead ]
				%add.ptr = getelementptr inbounds i8, i8* %x.addr.08, i32 1
				%add.ptr1 = getelementptr inbounds i8, i8* %y.addr.07, i32 1
				%l = load i8, i8* %x.addr.08, align 1
				store i8 %l, i8* %y.addr.07, align 1
				%inc = add nuw nsw i32 %i.09, 2
				%exitcond.not = icmp eq i32 %inc, %n
				br i1 %exitcond.not, label %for.cond.cleanup, label %for.body

				for.cond.cleanup: ; preds = %entry
				ret void
				}

				attributes #0 = { noinline optnone }
				attributes #1 = { optsize }

llvm/test/CodeGen/Thumb2/mve-tp-loop.mir

This file was added.

				# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
				# RUN: llc -mtriple=thumbv8.1m.main-none-eabi -mattr=+mve -simplify-mir --verify-machineinstrs -run-pass=finalize-isel %s -o - \| FileCheck %s
				--- \|
				target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"
				target triple = "arm-arm-none-eabi"

				; Function Attrs: argmemonly nofree nosync nounwind willreturn
				declare void @llvm.memcpy.p0i8.p0i8.i32(i8* noalias nocapture writeonly, i8* noalias nocapture readonly, i32, i1 immarg)

				define void @test1(i32* noalias %X, i32* noalias readonly %Y, i32 %n) {
				entry:
				%0 = bitcast i32* %X to i8*
				%1 = bitcast i32* %Y to i8*
				call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 4 %0, i8* align 4 %1, i32 %n, i1 false)
				ret void
				}

				define void @test2(i32* noalias %X, i32* noalias readonly %Y, i32 %n) {
				entry:
				%cmp6 = icmp sgt i32 %n, 0
				br i1 %cmp6, label %for.body.preheader, label %for.cond.cleanup

				for.body.preheader: ; preds = %entry
				%X.bits = bitcast i32* %X to i8*
				%Y.bits = bitcast i32* %Y to i8*
				call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 4 %X.bits, i8* align 4 %Y.bits, i32 %n, i1 false)
				br label %for.cond.cleanup

				for.cond.cleanup: ; preds = %for.body.preheader, %entry
				ret void
				}

				...
				---
				name: test1
				tracksRegLiveness: true
				body: \|
				bb.0.entry:
				liveins: $r0, $r1, $r2

				; CHECK-LABEL: name: test1
				; CHECK: liveins: $r0, $r1, $r2
				; CHECK: [[COPY:%[0-9]+]]:rgpr = COPY $r2
				; CHECK: [[COPY1:%[0-9]+]]:rgpr = COPY $r1
				; CHECK: [[COPY2:%[0-9]+]]:rgpr = COPY $r0
				; CHECK: [[t2ADDri:%[0-9]+]]:rgpr = t2ADDri [[COPY]], 15, 14 /* CC::al */, $noreg, $noreg
				; CHECK: [[t2BICri:%[0-9]+]]:rgpr = t2BICri killed [[t2ADDri]], 16, 14 /* CC::al */, $noreg, $noreg
				; CHECK: [[t2LSRri:%[0-9]+]]:gprlr = t2LSRri killed [[t2BICri]], 4, 14 /* CC::al */, $noreg, $noreg
				; CHECK: [[t2WhileLoopSetup:%[0-9]+]]:gprlr = t2WhileLoopSetup killed [[t2LSRri]]
				; CHECK: t2WhileLoopStart [[t2WhileLoopSetup]], %bb.2, implicit-def $cpsr
				; CHECK: .1:
				; CHECK: [[PHI:%[0-9]+]]:rgpr = PHI [[COPY1]], %bb.0, %8, %bb.1
				; CHECK: [[PHI1:%[0-9]+]]:rgpr = PHI [[COPY2]], %bb.0, %10, %bb.1
				; CHECK: [[PHI2:%[0-9]+]]:gprlr = PHI [[t2WhileLoopSetup]], %bb.0, %12, %bb.1
				; CHECK: [[PHI3:%[0-9]+]]:rgpr = PHI [[COPY]], %bb.0, %14, %bb.1
				; CHECK: [[MVE_VCTP8_:%[0-9]+]]:vccr = MVE_VCTP8 [[PHI3]], 0, $noreg
				; CHECK: [[t2SUBri:%[0-9]+]]:rgpr = t2SUBri [[PHI3]], 16, 14 /* CC::al */, $noreg, $noreg
				; CHECK: [[MVE_VLDRBU8_post:%[0-9]+]]:rgpr, [[MVE_VLDRBU8_post1:%[0-9]+]]:mqpr = MVE_VLDRBU8_post [[PHI]], 16, 1, [[MVE_VCTP8_]]
				; CHECK: [[MVE_VSTRBU8_post:%[0-9]+]]:rgpr = MVE_VSTRBU8_post killed [[MVE_VLDRBU8_post1]], [[PHI1]], 16, 1, [[MVE_VCTP8_]]
				; CHECK: [[t2LoopDec:%[0-9]+]]:gprlr = t2LoopDec [[PHI2]], 1
				; CHECK: t2LoopEnd [[t2LoopDec]], %bb.1, implicit-def $cpsr
				; CHECK: t2B %bb.2, 14 /* CC::al */, $noreg
				; CHECK: .2.entry:
				; CHECK: tBX_RET 14 /* CC::al */, $noreg
				%2:rgpr = COPY $r2
				%1:rgpr = COPY $r1
				%0:rgpr = COPY $r0
				MVE_MEMCPYLOOPINST %0, %1, %2
				tBX_RET 14 /* CC::al */, $noreg

				...
				---
				name: test2
				tracksRegLiveness: true
				body: \|
				; CHECK-LABEL: name: test2
				; CHECK: bb.0.entry:
				; CHECK: successors: %bb.1(0x50000000), %bb.2(0x30000000)
				; CHECK: liveins: $r0, $r1, $r2
				; CHECK: [[COPY:%[0-9]+]]:rgpr = COPY $r2
				; CHECK: [[COPY1:%[0-9]+]]:rgpr = COPY $r1
				; CHECK: [[COPY2:%[0-9]+]]:rgpr = COPY $r0
				; CHECK: t2CMPri [[COPY]], 1, 14 /* CC::al */, $noreg, implicit-def $cpsr
				; CHECK: t2Bcc %bb.2, 11 /* CC::lt */, $cpsr
				; CHECK: t2B %bb.1, 14 /* CC::al */, $noreg
				; CHECK: bb.1.for.body.preheader:
				; CHECK: [[t2ADDri:%[0-9]+]]:rgpr = t2ADDri [[COPY]], 15, 14 /* CC::al */, $noreg, $noreg
				; CHECK: [[t2BICri:%[0-9]+]]:rgpr = t2BICri killed [[t2ADDri]], 16, 14 /* CC::al */, $noreg, $noreg
				; CHECK: [[t2LSRri:%[0-9]+]]:gprlr = t2LSRri killed [[t2BICri]], 4, 14 /* CC::al */, $noreg, $noreg
				; CHECK: [[t2WhileLoopSetup:%[0-9]+]]:gprlr = t2WhileLoopSetup killed [[t2LSRri]]
				; CHECK: t2WhileLoopStart [[t2WhileLoopSetup]], %bb.4, implicit-def $cpsr
				; CHECK: bb.3:
				; CHECK: [[PHI:%[0-9]+]]:rgpr = PHI [[COPY1]], %bb.1, %8, %bb.3
				; CHECK: [[PHI1:%[0-9]+]]:rgpr = PHI [[COPY2]], %bb.1, %10, %bb.3
				; CHECK: [[PHI2:%[0-9]+]]:gprlr = PHI [[t2WhileLoopSetup]], %bb.1, %12, %bb.3
				; CHECK: [[PHI3:%[0-9]+]]:rgpr = PHI [[COPY]], %bb.1, %14, %bb.3
				; CHECK: [[MVE_VCTP8_:%[0-9]+]]:vccr = MVE_VCTP8 [[PHI3]], 0, $noreg
				; CHECK: [[t2SUBri:%[0-9]+]]:rgpr = t2SUBri [[PHI3]], 16, 14 /* CC::al */, $noreg, $noreg
				; CHECK: [[MVE_VLDRBU8_post:%[0-9]+]]:rgpr, [[MVE_VLDRBU8_post1:%[0-9]+]]:mqpr = MVE_VLDRBU8_post [[PHI]], 16, 1, [[MVE_VCTP8_]]
				; CHECK: [[MVE_VSTRBU8_post:%[0-9]+]]:rgpr = MVE_VSTRBU8_post killed [[MVE_VLDRBU8_post1]], [[PHI1]], 16, 1, [[MVE_VCTP8_]]
				; CHECK: [[t2LoopDec:%[0-9]+]]:gprlr = t2LoopDec [[PHI2]], 1
				; CHECK: t2LoopEnd [[t2LoopDec]], %bb.3, implicit-def $cpsr
				; CHECK: t2B %bb.4, 14 /* CC::al */, $noreg
				; CHECK: bb.4.for.body.preheader:
				; CHECK: t2B %bb.2, 14 /* CC::al */, $noreg
				; CHECK: bb.2.for.cond.cleanup:
				; CHECK: tBX_RET 14 /* CC::al */, $noreg
				bb.0.entry:
				successors: %bb.1(0x50000000), %bb.2(0x30000000)
				liveins: $r0, $r1, $r2

				%2:rgpr = COPY $r2
				%1:rgpr = COPY $r1
				%0:rgpr = COPY $r0
				t2CMPri %2, 1, 14 /* CC::al */, $noreg, implicit-def $cpsr
				t2Bcc %bb.2, 11 /* CC::lt */, $cpsr
				t2B %bb.1, 14 /* CC::al */, $noreg

				bb.1.for.body.preheader:
				successors: %bb.2(0x80000000)

				MVE_MEMCPYLOOPINST %0, %1, %2

				bb.2.for.cond.cleanup:
				tBX_RET 14 /* CC::al */, $noreg

				...