This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/ARM/
-
Target/
-
ARM/
3/3
ARMISelLowering.h
7/7
ARMISelLowering.cpp
2/2
ARMInstrMVE.td
1/1
ARMSelectionDAGInfo.h
9/10
ARMSelectionDAGInfo.cpp
-
ARMSubtarget.h
-
test/CodeGen/Thumb2/
-
CodeGen/
-
Thumb2/
-
LowOverheadLoops/
1/2
memcall.ll
9/9
mve_tp_loop.ll
1/1
mve_tp_loop.mir

Differential D99723

[ARM] Transforming memcpy to Tail predicated Loop
ClosedPublic

Authored by malharJ on Apr 1 2021, 6:02 AM.

Download Raw Diff

Details

Reviewers

dmgreen
SjoerdMeijer

Commits

rG9ff38e2d9dd7: [ARM] Transforming memcpy to Tail predicated Loop
rGb856f4a232cb: [ARM] Transforming memcpy to Tail predicated Loop

Summary

This patch converts llvm.memcpy intrinsic into Tail Predicated
Hardware loops for a target that supports the Arm M-profile
Vector Extension (MVE).

From an implementation point of view, the patch

adds an ARM specific SDAG Node (to which the llvm.memcpy intrinsic is lowered to, during first phase of ISel)
adds a corresponding TableGen entry to generate a pseudo instruction, with a custom inserter, on matching the above node.
Adds a custom inserter function that expands the pseudo instruction into MIR suitable to be (by later passes) into a WLSTP loop.

Diff Detail

Repository: rG LLVM Github Monorepo

Unit TestsFailed

	Time	Test
	280 ms	x64 debian > LLVM.CodeGen/Thumb2::mve_tp_loop.ll
	250 ms	x64 debian > LLVM.CodeGen/Thumb2/LowOverheadLoops::memcall.ll
	2,490 ms	x64 debian > libarcher.races::task-two.c
	110 ms	x64 windows > LLVM.CodeGen/Thumb2::mve_tp_loop.ll
	160 ms	x64 windows > LLVM.CodeGen/Thumb2/LowOverheadLoops::memcall.ll

Event Timeline

malharJ created this revision.Apr 1 2021, 6:02 AM

Herald added subscribers: hiraditya, kristof.beyls. · View Herald TranscriptApr 1 2021, 6:02 AM

malharJ requested review of this revision.Apr 1 2021, 6:02 AM

Herald added a project: Restricted Project. · View Herald TranscriptApr 1 2021, 6:02 AM

Herald added a subscriber: llvm-commits. · View Herald Transcript

lebedev.ri retitled this revision from Transforming memcpy to Tail predicated Loop to [ARM] Transforming memcpy to Tail predicated Loop.Apr 1 2021, 6:07 AM

Herald added a subscriber: danielkiss. · View Herald TranscriptApr 1 2021, 6:07 AM

I know you've worked on this for a while and investigated different strategies, but I think we also need to argue here why we would like to emit a memcpy loop instead of e.g. having optimised versions in the clib. In other words, is this the best we can do for all different alignments, sizes, etc.?

Harbormaster completed remote builds in B96699: Diff 334666.Apr 1 2021, 7:14 AM

Added some comments to better illustrate transform.
Also renamed some variables to maintain consistency.

dmgreen added inline comments.Apr 1 2021, 8:18 AM

llvm/lib/Target/ARM/ARMISelLowering.cpp
11078	Can you look into all these clang-tidy errors. LLVM usually uses CamelCase for variable names, and the MBB_ looks a little odd. They should start with a capital and I would drop the "t2", that's not adding much.
llvm/lib/Target/ARM/ARMInstrMVE.td
6873	Can you improve this formatting. If you clang-formatted it, it doesn't do well with .td files and manually making it look more like the others will do better.
llvm/test/CodeGen/Thumb2/LowOverheadLoops/memcall.ll
21	Why does this not use r3 directly?
llvm/test/CodeGen/Thumb2/mve_tp_loop.ll
36	What is this generating r3 for? I thought those should be removed.
56	Why is this using printf? It looks like an execution test, not a unit test. Is it testing anything specifically? If so it can probably use any call, not a variadic version of printf.
64	Remove hidden and local_unnamed_addr #0

Harbormaster completed remote builds in B96713: Diff 334684.Apr 1 2021, 8:24 AM

Addressed comments (review comments + clang-tidy and clang-format fixes)

Updated transform to not generate preHeader block due to issues with phi-node-elimination pass placing copy/movs in the generated preHeader.

Details provided in comment below.

malharJ added inline comments.Apr 5 2021, 1:40 AM

llvm/lib/Target/ARM/ARMInstrMVE.td
6873	Had clang formatted it but yeah doesnt look good. Updated now.
llvm/test/CodeGen/Thumb2/LowOverheadLoops/memcall.ll
21	This seems to be an issue with generating a preHeader during the transform .. The phi-node-elimination pass is lowering the phi instructions (in the TP loopBody) as COPY operations (into the PreHeader). In this instance, the copy/mov can be seen below on line 32: mov r7, r3 I've fixed this issue as of now by not generating an extra preHeader during the transform .. so the mov ends up above the t2WhileLoopStartLR and overall it seems to work. Please see my comment about the latest changes for more details on this.
llvm/test/CodeGen/Thumb2/mve_tp_loop.ll
36	Good spot, I think this is because DCE was not happening for the instructions calculating iterationCount. I had a quick look at ARMLowOverheadLoops::IterationCountDCE( ), and it seems that the expectation from the generated MIR is: // $lr = big-itercount-expression // .. // $lr = t2DoLoopStart/t2WhileLoopStartLR renamable $lr // vector.body: I've updated the final expression (in the generated MIR) calculating iteration count to now return the result in LR (earlier it was returning in one of the rGPRs) and these are now getting removed.
56	So the intent of the test was just to check whether code surrounding memcpy call site is properly transformed. I simply used printf to prevent the code from getting optimized away (seems like a poor way now that I think about it). I've removed this test now since the transformation involving nested loops (in memcall.ll) is already testing the mentioned intent.

So I've updated transform to not generate a preHeader block as there seems to be an issue
when generating a preHeader during the transform:

The issue:

The phi-node-elimination pass introduces COPY operations (for each PHI instruction in the TP loopBody) into the preHeader.

While most of them get removed by simple-register-coalescing pass, one copy in particular is not
getting removed. This is the one involving memcpy transfer size/vector element count. Regarding
why the register coalescing is unable to get rid of this particular copy/mov, I had a look at the
llc --debug output and it seems that it cant remove the mov/copy because the liveness range of
element count register intersects with liveness range of the target of the copy/mov.

An example of the generated (incorrect) assembly is shown below:

Relevant MIR:

TP Entry
         ...
	lr = t2WhileLoopStartLR r4 (r4 may be holding something other than element count)

TP preHeader
	...
	mov r4, r2 (assume r2 holds element count)
	...
TP body
	...
	VCTP r4
	...

Existing logic:

So this value (r4 above) feeds into the loopBody PHI nodes and then the VCTP receives it (which is fine).
But when the ARMLowOverHeadsLoop pass tries to use element count operand of VCTP to feed back to t2WhileLoopStartLR,
it is providing r4 (which is incorrect because the mov is happening after the t2WhileLoopStartLR).

So I tried to see if I could fix this by looking into LowOverheadLoop::ValidateTailPredicate(),
as it defines the "TPNumElements" variable. There is some logic there that handles the case for
local redefinitions of the elementCount physical register, by moving it forward/backward using ReachingDefAnalysis.
But in this instance, we have a redefinition (the mov) in a different BasicBlock so that code doesn't seem to fix this.

I'm not entirely certain if it's acceptable to not generate the preHeader, but unless there is a reasonably
simple fix for the above issue, I can't see another way.

Harbormaster completed remote builds in B97102: Diff 335218.Apr 5 2021, 2:09 AM

Fixed some more clang-format errors.

Harbormaster completed remote builds in B97106: Diff 335222.Apr 5 2021, 3:28 AM

malharJ edited the summary of this revision. (Show Details)Apr 7 2021, 6:40 PM

I'm a little worried that WLSTP is going to cause problems, with it not used anywhere else. Lets at least add an option for disabling it needed.

llvm/lib/Target/ARM/ARMBaseInstrInfo.h
370 ↗	(On Diff #335222)	Please split this out into a separate review. I think it makes sense (I'm pretty sure I remember writing it, it must have been lost in a refactoring).
llvm/lib/Target/ARM/ARMISelLowering.cpp
11282	When will this happen?
11301	-> "for a more natural layout"? I think there may be benefits from getting the order roughly correct at this stage, if we are relying on WLS branches. They can be fixed up later, but if we get them more correct at this point, that can only help.
llvm/lib/Target/ARM/ARMISelLowering.h
336	Don't format any of this - it's unrelated.
llvm/lib/Target/ARM/ARMSelectionDAGInfo.cpp
141	Can we add an option that turns this inline memcpy on/off. If the option is true, we always use the MEMCPYLOOP, if it's false we never do, and if it's unset we use this default logic. Also consider pulling the if logic into a lambda for readability.
llvm/test/CodeGen/Thumb2/mve_tp_loop.ll
10	Can you make sure there are tests where the i32 is a different type, one that is not legal like an i64.
llvm/test/CodeGen/Thumb2/mve_tp_loop.mir
137	Some of this can be removed, to help keep the test smaller.

addressed some of the review comments:

added a cli option for generation of TP loop for memcpy
simplified the mir test

I'm a little worried that WLSTP is going to cause problems ...

Would it better to use DLSTP in that case ? or perhaps a command line option
for choosing between DLSTP/WLSTP implementations ?

llvm/lib/Target/ARM/ARMBaseInstrInfo.h
370 ↗	(On Diff #335222)	Do you mean just this single line update as a separate review ?
llvm/lib/Target/ARM/ARMISelLowering.cpp
11282	This happens if MVE_MEMCPYLOOPINST is the only instruction in the block. splitAt() returns the same block if there is nothing after the instruction at which the split is done.. This happens when for loops are implicitly converted to memcpys, the memcpy call ends up being the only instruction in the preheader. There is already a test case for this as test2 in llvm/test/CodeGen/Thumb2/mve_tp_loop.mir
llvm/lib/Target/ARM/ARMISelLowering.h
336	I had to fix it because patch was failing on clang-format error.
llvm/test/CodeGen/Thumb2/mve_tp_loop.ll
10	Do you mean something like: define void @test(i8* noalias %X, i8* noalias %Y, i64 %n){ call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 4 %X, i8* align 4 %Y, i64 %n, i1 false) ret void } I get an error when I try to generate the assembly. Since i64 is illegal, what is the expectation here ? As a side note, if I generate the IR from C code, the IR always truncates the memcpy size variable to a i32 before calling llvm.memcpy( )

Harbormaster completed remote builds in B98018: Diff 336495.Apr 9 2021, 10:28 AM

dmgreen added inline comments.Apr 12 2021, 6:28 AM

llvm/lib/Target/ARM/ARMBaseInstrInfo.h
370 ↗	(On Diff #335222)	Yeah... it should preferably be a separate patch. Do you have a test case? Some reason why you changed it?
llvm/lib/Target/ARM/ARMISelLowering.cpp
11282	OK. I thought it was more eager about putting branches on the end of blocks, even if they are fallthroughs.
llvm/lib/Target/ARM/ARMISelLowering.h
336	That's fine. We can ignore the precommit bot where it's more noisy than helpful.
llvm/test/CodeGen/Thumb2/mve_tp_loop.ll
10	Yeah sure. I think they can still come up, from other places creating memcpy calls. You can probably use DAG.getZExtOrTrunc(Size, MVT::i32), instead of using the size directly.

Addressed remaining review comments:

Separated a change into it's own patch and added as dependency
minor formatting updates
added a testcase with size of type other than i32

malharJ marked an inline comment as done.Apr 13 2021, 3:57 AM

malharJ added inline comments.

llvm/lib/Target/ARM/ARMBaseInstrInfo.h
370 ↗	(On Diff #335222)	Ok, I've now created a separate patch for this: https://reviews.llvm.org/D100376

malharJ added a parent revision: D100376: [ARM] Prevent phi-node-elimination from generating copy above t2WhileLoopStartLR.Apr 13 2021, 4:00 AM

Harbormaster completed remote builds in B98454: Diff 337096.Apr 13 2021, 4:43 AM

dmgreen mentioned this in D100435: [ARM] Transforming memset to Tail predicated Loop.Apr 14 2021, 1:30 AM

dmgreen added inline comments.Apr 14 2021, 1:22 PM

llvm/lib/Target/ARM/ARMISelLowering.cpp
11257	Great comment by the way. Is it possible to make the loop MBB look like a bit like a loop? To show there is a backedge too.
llvm/lib/Target/ARM/ARMSelectionDAGInfo.cpp
140	`[=]` ->`[&]` is more standard, even if it doesn't make a lot of difference here.
142	Probably better as: if (DAG.getMachineFunction().getFunction().hasOptNone()) return false; if (!ConstantSize && (Alignment >= Align(4)) return true; if (...) ... The EnableMemcpyTPLoop logic could be in here too, as it's just returning true/false at the right time. What do we do for -Oz and -Os?
llvm/test/CodeGen/Thumb2/mve_tp_loop.ll
2	Shouldn't have -O1 or cpu, use the -mtriple from other similar tests. The test can be called llvm/test/CodeGen/Thumb2/mve-tp-loop.ll.

Addressed review comments:

renamed test files
disabled inline memcpy for optimize size cases (-Os, -Oz) and added tests for the same
also added tests for constant size inputs to ensure the threshold values are tested as well.
minor formatting changes

llvm/lib/Target/ARM/ARMSelectionDAGInfo.cpp
142	Ok. And I guess it would be best to disable in the case of -Os/-Oz in case there are multiple memcpys in the source. I've made the update and added tests as well.

dmgreen added inline comments.Apr 15 2021, 2:44 AM

llvm/lib/Target/ARM/ARMSelectionDAGInfo.cpp
142	func -> Func. Or maybe just F, which is quite common in LLVM.
149	if (EnableMemcpyTPLoop == cl::BOU_FALSE) return false; Is probably better, if it works as I expect. That keeps the indenting down, and the last if currently isn't in the block it looks like it should be. Oh, and move EnableMemcpyTPLoop above the OptSIze/OptNone, in case we want to try and force it. (Even if OptNone doesn't work, using that combo is unlikely to be useful at any rate.)
154	Add a return false at the end?

Addressed review comments:

moved cli option (when set) to be of higher priority than optNone/optSize
minor formatting updates.

llvm/lib/Target/ARM/ARMSelectionDAGInfo.cpp
149	Ok, my bad there with the braces. I've moved the cases when the cli option is set to be of higher priority than the optNone/optSize cases ... but the unset case is of Lower priority than (the optNone/optSize) since the user is no longer passing the cli option. Hopefully that sounds sensible.
154	yep, had missed that out.

Harbormaster completed remote builds in B98843: Diff 337669.Apr 15 2021, 3:29 AM

Harbormaster completed remote builds in B98854: Diff 337684.Apr 15 2021, 4:55 AM

Rebased patch and removed the dependency as it has been closed.

malharJ removed parent revisions: D100376: [ARM] Prevent phi-node-elimination from generating copy above t2WhileLoopStartLR, D99649: [ARM] Updates to arm-block-placement pass.Apr 17 2021, 3:50 PM

malharJ edited the summary of this revision. (Show Details)

Harbormaster completed remote builds in B99351: Diff 338345.Apr 17 2021, 4:27 PM

For comparison:
https://github.com/llvm/llvm-project/blob/main/libc/src/string/memory_utils/memcpy_utils.h

malharJ added a child revision: D100435: [ARM] Transforming memset to Tail predicated Loop.Apr 18 2021, 9:27 AM

In D99723#2697318, @tschuett wrote:

For comparison:
https://github.com/llvm/llvm-project/blob/main/libc/src/string/memory_utils/memcpy_utils.h

Yeah thanks, but this is for a different architecture. On M class we have access to MVE tail predicated loops that can be much more efficient for emitting inline memcpys. A-Class Arm with Neon will be very different.

This looks good to me now, with a couple of extra nits.

llvm/lib/Target/ARM/ARMISelLowering.cpp
1817	Remove newline.
llvm/lib/Target/ARM/ARMSelectionDAGInfo.h
19	Is this needed here? Can it be in the cpp file?

This revision is now accepted and ready to land.Apr 19 2021, 1:23 AM

Minor formatting updates.

Herald added a subscriber: tmatheson. · View Herald TranscriptApr 25 2021, 10:34 AM

Harbormaster completed remote builds in B100820: Diff 340370.Apr 25 2021, 11:23 AM

Thanks. Can you rebase and make sure the patch is clang-formatted?

Rebased patch + minor formatting updates.

Harbormaster completed remote builds in B101154: Diff 340828.Apr 27 2021, 7:42 AM

Fix for bug spotted by dmgreen (thank you):

Added an update to ensure that the block containing memcpy pseudo is always
split using splitAt().

An example case where this is important is when updating
phi instructions in successive blocks, which is taken care of by splitAt()
which calls transferSuccessorsAndUpdatePHIs() internally.
A test has been added for the same.

malharJ added inline comments.May 4 2021, 10:11 PM

llvm/test/CodeGen/Thumb2/mve-tp-loop.ll
241–250 ↗	(On Diff #342948)	Not entirely sure why this isn't a TP loop, might need to check ArmLOL pass as to why it's being reverted..

Harbormaster completed remote builds in B102670: Diff 342948.May 4 2021, 10:52 PM

tmatheson removed a subscriber: tmatheson.May 5 2021, 2:06 AM

Thanks. It looks like the arm low overhead loop pass doesn't like that two loops have the same preheader. Which makes sense, I don't like that either.

What do you think about committing this with the flag off for the time being and flipping the switch when we have sorted out some of the problems this is running into? memset especially seems to come up in a lot of cases, and can run into problem with so many low overhead loops together.

Changed cli option for conversion of memcpy to TP loop to be disabled by default.
The disable may be temporary, and will be removed after some more testing.

A custom enum replaces the cl::boolOrDefault to implement the required functionality.

Harbormaster completed remote builds in B102761: Diff 343066.May 5 2021, 9:50 AM

Thanks. LGTM

llvm/lib/Target/ARM/ARMSelectionDAGInfo.cpp
143	Perhaps use == TPLoop::ForceDisable to make it clear.

This revision was landed with ongoing or failed builds.May 6 2021, 1:39 AM

Closed by commit rGb856f4a232cb: [ARM] Transforming memcpy to Tail predicated Loop (authored by malharJ). · Explain Why

This revision was automatically updated to reflect the committed changes.

malharJ added a commit: rGb856f4a232cb: [ARM] Transforming memcpy to Tail predicated Loop.

Thanks a lot for the review !

malharJ added a reverting change: rGfc690777fce0: Revert "[ARM] Transforming memcpy to Tail predicated Loop".May 6 2021, 4:42 AM

malharJ reopened this revision.May 6 2021, 4:58 AM

This revision is now accepted and ready to land.May 6 2021, 4:58 AM

Fix for MachineVerifier error during Buildbot failure
https://lab.llvm.org/buildbot/#/builders/16/builds/10462

The failure is happening during testing because the NoPHIs property is being
set to true by MIRParserImpl::computeFunctionProperties( ) as there are No phis (prior to transformation),
but during the transform phis are generated.
This results in an error during MachineVerifier, since the function is labelled
with NoPHIs=true while there are phi insructions in the code.

This fix resets the property to false during the transform.

Harbormaster completed remote builds in B102960: Diff 343364.May 6 2021, 6:02 AM

Ah yes. Sorry I didn't suggest adding that to the tests - it can be useful.

Setting NoPHIs seems a bit odd. It's a side effect of the mir test having no PHI's as it's loaded but them being added here. I don't have a better suggestion for fixing it though, other than adding existing PHI's to the mir test which needlessly complicates it.

This sounds like a good fix to me.

This revision was landed with ongoing or failed builds.May 6 2021, 3:26 PM

Closed by commit rG9ff38e2d9dd7: [ARM] Transforming memcpy to Tail predicated Loop (authored by malharJ). · Explain Why

This revision was automatically updated to reflect the committed changes.

malharJ added a commit: rG9ff38e2d9dd7: [ARM] Transforming memcpy to Tail predicated Loop.

Revision Contents

Path

Size

llvm/

lib/

Target/

ARM/

ARMISelLowering.h

4 lines

ARMISelLowering.cpp

213 lines

ARMInstrMVE.td

12 lines

ARMSelectionDAGInfo.h

1 line

ARMSelectionDAGInfo.cpp

25 lines

ARMSubtarget.h

5 lines

test/

CodeGen/

Thumb2/

LowOverheadLoops/

memcall.ll

49 lines

mve_tp_loop.ll

139 lines

mve_tp_loop.mir

131 lines

Diff 337096

llvm/lib/Target/ARM/ARMISelLowering.h

Show First 20 Lines • Show All 46 Lines • ▼ Show 20 Lines
class TargetLibraryInfo;		class TargetLibraryInfo;
class TargetMachine;		class TargetMachine;
class TargetRegisterInfo;		class TargetRegisterInfo;
class VectorType;		class VectorType;

namespace ARMISD {		namespace ARMISD {

// ARM Specific DAG Nodes		// ARM Specific DAG Nodes
enum NodeType : unsigned {		enum NodeType : unsigned {
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - enum NodeType : unsigned { - // Start the numbering where the builtin ops and target ops leave off. - FIRST_NUMBER = ISD::BUILTIN_OP_END, - - Wrapper, // Wrapper - A wrapper node for TargetConstantPool, - // TargetExternalSymbol, and TargetGlobalAddress. - WrapperPIC, // WrapperPIC - A wrapper node for TargetGlobalAddress in - // PIC mode. - WrapperJT, // WrapperJT - A wrapper node for TargetJumpTable - 558 diff lines are omitted. See full path. Lint: Pre-merge checks: clang-format: please reformat the code ``` - enum NodeType : unsigned { - // Start the…
// Start the numbering where the builtin ops and target ops leave off.		// Start the numbering where the builtin ops and target ops leave off.
FIRST_NUMBER = ISD::BUILTIN_OP_END,		FIRST_NUMBER = ISD::BUILTIN_OP_END,

Wrapper, // Wrapper - A wrapper node for TargetConstantPool,		Wrapper, // Wrapper - A wrapper node for TargetConstantPool,
// TargetExternalSymbol, and TargetGlobalAddress.		// TargetExternalSymbol, and TargetGlobalAddress.
WrapperPIC, // WrapperPIC - A wrapper node for TargetGlobalAddress in		WrapperPIC, // WrapperPIC - A wrapper node for TargetGlobalAddress in
// PIC mode.		// PIC mode.
WrapperJT, // WrapperJT - A wrapper node for TargetJumpTable		WrapperJT, // WrapperJT - A wrapper node for TargetJumpTable
▲ Show 20 Lines • Show All 224 Lines • ▼ Show 20 Lines	enum NodeType : unsigned {

// Pseudo vector bitwise select		// Pseudo vector bitwise select
VBSP,		VBSP,

// Pseudo-instruction representing a memory copy using ldm/stm		// Pseudo-instruction representing a memory copy using ldm/stm
// instructions.		// instructions.
MEMCPY,		MEMCPY,

		// Pseudo-instruction representing a memory copy using a tail predicated
		// loop
		MEMCPYLOOP,

// V8.1MMainline condition select		// V8.1MMainline condition select
CSINV, // Conditional select invert.		CSINV, // Conditional select invert.
CSNEG, // Conditional select negate.		CSNEG, // Conditional select negate.
CSINC, // Conditional select increment.		CSINC, // Conditional select increment.

// Vector load N-element structure to all lanes:		// Vector load N-element structure to all lanes:
VLD1DUP = ISD::FIRST_TARGET_MEMORY_OPCODE,		VLD1DUP = ISD::FIRST_TARGET_MEMORY_OPCODE,
VLD2DUP,		VLD2DUP,
Show All 20 Lines	enum NodeType : unsigned {
VST4_UPD,		VST4_UPD,
VST2LN_UPD,		VST2LN_UPD,
VST3LN_UPD,		VST3LN_UPD,
VST4LN_UPD,		VST4LN_UPD,

// Load/Store of dual registers		// Load/Store of dual registers
LDRD,		LDRD,
STRD		STRD
};		};
		dmgreenUnsubmitted Done Reply Inline Actions Don't format any of this - it's unrelated. dmgreen: Don't format any of this - it's unrelated.
		malharJAuthorUnsubmitted Done Reply Inline Actions I had to fix it because patch was failing on clang-format error. malharJ: I had to fix it because patch was failing on clang-format error.
		dmgreenUnsubmitted Done Reply Inline Actions That's fine. We can ignore the precommit bot where it's more noisy than helpful. dmgreen: That's fine. We can ignore the precommit bot where it's more noisy than helpful.

} // end namespace ARMISD		} // end namespace ARMISD

namespace ARM {		namespace ARM {
/// Possible values of current rounding mode, which is specified in bits		/// Possible values of current rounding mode, which is specified in bits
/// 23:22 of FPSCR.		/// 23:22 of FPSCR.
enum Rounding {		enum Rounding {
RN = 0, // Round to Nearest		RN = 0, // Round to Nearest
▲ Show 20 Lines • Show All 614 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMISelLowering.cpp

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 1,807 Lines • ▼ Show 20 Lines	const char *ARMTargetLowering::getTargetNodeName(unsigned Opcode) const {
case ARMISD::VST4LN_UPD: return "ARMISD::VST4LN_UPD";		case ARMISD::VST4LN_UPD: return "ARMISD::VST4LN_UPD";
case ARMISD::WLS: return "ARMISD::WLS";		case ARMISD::WLS: return "ARMISD::WLS";
case ARMISD::WLSSETUP: return "ARMISD::WLSSETUP";		case ARMISD::WLSSETUP: return "ARMISD::WLSSETUP";
case ARMISD::LE: return "ARMISD::LE";		case ARMISD::LE: return "ARMISD::LE";
case ARMISD::LOOP_DEC: return "ARMISD::LOOP_DEC";		case ARMISD::LOOP_DEC: return "ARMISD::LOOP_DEC";
case ARMISD::CSINV: return "ARMISD::CSINV";		case ARMISD::CSINV: return "ARMISD::CSINV";
case ARMISD::CSNEG: return "ARMISD::CSNEG";		case ARMISD::CSNEG: return "ARMISD::CSNEG";
case ARMISD::CSINC: return "ARMISD::CSINC";		case ARMISD::CSINC: return "ARMISD::CSINC";
		case ARMISD::MEMCPYLOOP:
		return "ARMISD::MEMCPYLOOP";
		dmgreenUnsubmitted Done Reply Inline Actions Remove newline. dmgreen: Remove newline.
}		}
return nullptr;		return nullptr;
}		}

EVT ARMTargetLowering::getSetCCResultType(const DataLayout &DL, LLVMContext &,		EVT ARMTargetLowering::getSetCCResultType(const DataLayout &DL, LLVMContext &,
EVT VT) const {		EVT VT) const {
if (!VT.isVector())		if (!VT.isVector())
return getPointerTy(DL);		return getPointerTy(DL);
▲ Show 20 Lines • Show All 9,242 Lines • ▼ Show 20 Lines	static bool checkAndUpdateCPSRKill(MachineBasicBlock::iterator SelectItr,
}		}

// We found a def, or hit the end of the basic block and CPSR wasn't live		// We found a def, or hit the end of the basic block and CPSR wasn't live
// out. SelectMI should have a kill flag on CPSR.		// out. SelectMI should have a kill flag on CPSR.
SelectItr->addRegisterKilled(ARM::CPSR, TRI);		SelectItr->addRegisterKilled(ARM::CPSR, TRI);
return true;		return true;
}		}

		/// Adds logic in loop entry MBB to calculate loop iteration count and adds
		/// t2WhileLoopSetup and t2WhileLoopStart to generate WLS loop
		static Register genTPEntry(MachineBasicBlock *TpEntry,
		dmgreenUnsubmitted Done Reply Inline Actions Can you look into all these clang-tidy errors. LLVM usually uses CamelCase for variable names, and the MBB_ looks a little odd. They should start with a capital and I would drop the "t2", that's not adding much. dmgreen: Can you look into all these clang-tidy errors. LLVM usually uses CamelCase for variable names…
		MachineBasicBlock *TpLoopBody,
		MachineBasicBlock *TpExit, Register OpSizeReg,
		const TargetInstrInfo *TII, DebugLoc Dl,
		MachineRegisterInfo &MRI) {

		// Calculates loop iteration count = ceil(n/16)/16 = ((n + 15)&(-16)) / 16.
		Register AddDestReg = MRI.createVirtualRegister(&ARM::rGPRRegClass);
		BuildMI(TpEntry, Dl, TII->get(ARM::t2ADDri), AddDestReg)
		.addUse(OpSizeReg)
		.addImm(15)
		.add(predOps(ARMCC::AL))
		.addReg(0);

		Register BicDestReg = MRI.createVirtualRegister(&ARM::rGPRRegClass);
		BuildMI(TpEntry, Dl, TII->get(ARM::t2BICri), BicDestReg)
		.addUse(AddDestReg, RegState::Kill)
		.addImm(16)
		.add(predOps(ARMCC::AL))
		.addReg(0);

		Register LsrDestReg = MRI.createVirtualRegister(&ARM::GPRlrRegClass);
		BuildMI(TpEntry, Dl, TII->get(ARM::t2LSRri), LsrDestReg)
		.addUse(BicDestReg, RegState::Kill)
		.addImm(4)
		.add(predOps(ARMCC::AL))
		.addReg(0);

		Register TotalIterationsReg = MRI.createVirtualRegister(&ARM::GPRlrRegClass);
		BuildMI(TpEntry, Dl, TII->get(ARM::t2WhileLoopSetup), TotalIterationsReg)
		.addUse(LsrDestReg, RegState::Kill);

		BuildMI(TpEntry, Dl, TII->get(ARM::t2WhileLoopStart))
		.addUse(TotalIterationsReg)
		.addMBB(TpExit);

		return TotalIterationsReg;
		}

		/// Adds logic in the loopBody MBB to generate MVE_VCTP, t2DoLoopDec and
		/// t2DoLoopEnd. These are used by later passes to generate tail predicated
		/// loops.
		static void genTPLoopBody(MachineBasicBlock *TpLoopBody,
		MachineBasicBlock TpEntry, MachineBasicBlock TpExit,
		const TargetInstrInfo *TII, DebugLoc Dl,
		MachineRegisterInfo &MRI, Register OpSrcReg,
		Register OpDestReg, Register ElementCountReg,
		Register TotalIterationsReg) {

		// First insert 4 PHI nodes for: Current pointer to Src, Dest array, loop
		// iteration counter, predication counter Current position in the src array
		Register SrcPhiReg = MRI.createVirtualRegister(&ARM::rGPRRegClass);
		Register CurrSrcReg = MRI.createVirtualRegister(&ARM::rGPRRegClass);
		BuildMI(TpLoopBody, Dl, TII->get(ARM::PHI), SrcPhiReg)
		.addUse(OpSrcReg)
		.addMBB(TpEntry)
		.addUse(CurrSrcReg)
		.addMBB(TpLoopBody);

		// Current position in the dest array
		Register DestPhiReg = MRI.createVirtualRegister(&ARM::rGPRRegClass);
		Register CurrDestReg = MRI.createVirtualRegister(&ARM::rGPRRegClass);
		BuildMI(TpLoopBody, Dl, TII->get(ARM::PHI), DestPhiReg)
		.addUse(OpDestReg)
		.addMBB(TpEntry)
		.addUse(CurrDestReg)
		.addMBB(TpLoopBody);

		// Current loop counter
		Register LoopCounterPhiReg = MRI.createVirtualRegister(&ARM::GPRlrRegClass);
		Register RemainingLoopIterationsReg =
		MRI.createVirtualRegister(&ARM::GPRlrRegClass);
		BuildMI(TpLoopBody, Dl, TII->get(ARM::PHI), LoopCounterPhiReg)
		.addUse(TotalIterationsReg)
		.addMBB(TpEntry)
		.addUse(RemainingLoopIterationsReg)
		.addMBB(TpLoopBody);

		// Predication counter
		Register PredCounterPhiReg = MRI.createVirtualRegister(&ARM::rGPRRegClass);
		Register RemainingElementsReg = MRI.createVirtualRegister(&ARM::rGPRRegClass);
		BuildMI(TpLoopBody, Dl, TII->get(ARM::PHI), PredCounterPhiReg)
		.addUse(ElementCountReg)
		.addMBB(TpEntry)
		.addUse(RemainingElementsReg)
		.addMBB(TpLoopBody);

		// Pass predication counter to VCTP
		Register VccrReg = MRI.createVirtualRegister(&ARM::VCCRRegClass);
		BuildMI(TpLoopBody, Dl, TII->get(ARM::MVE_VCTP8), VccrReg)
		.addUse(PredCounterPhiReg)
		.addImm(ARMVCC::None)
		.addReg(0);

		BuildMI(TpLoopBody, Dl, TII->get(ARM::t2SUBri), RemainingElementsReg)
		.addUse(PredCounterPhiReg)
		.addImm(16)
		.add(predOps(ARMCC::AL))
		.addReg(0);

		// VLDRB and VSTRB instructions, predicated using VPR
		Register LoadedValueReg = MRI.createVirtualRegister(&ARM::MQPRRegClass);
		BuildMI(TpLoopBody, Dl, TII->get(ARM::MVE_VLDRBU8_post))
		.addDef(CurrSrcReg)
		.addDef(LoadedValueReg)
		.addReg(SrcPhiReg)
		.addImm(16)
		.addImm(ARMVCC::Then)
		.addUse(VccrReg);

		BuildMI(TpLoopBody, Dl, TII->get(ARM::MVE_VSTRBU8_post))
		.addDef(CurrDestReg)
		.addUse(LoadedValueReg, RegState::Kill)
		.addReg(DestPhiReg)
		.addImm(16)
		.addImm(ARMVCC::Then)
		.addUse(VccrReg);

		// Add the pseudoInstrs for decrementing the loop counter and marking the
		// end:t2DoLoopDec and t2DoLoopEnd
		BuildMI(TpLoopBody, Dl, TII->get(ARM::t2LoopDec), RemainingLoopIterationsReg)
		.addUse(LoopCounterPhiReg)
		.addImm(1);

		BuildMI(TpLoopBody, Dl, TII->get(ARM::t2LoopEnd))
		.addUse(RemainingLoopIterationsReg)
		.addMBB(TpLoopBody);

		BuildMI(TpLoopBody, Dl, TII->get(ARM::t2B))
		.addMBB(TpExit)
		.add(predOps(ARMCC::AL));
		}

MachineBasicBlock *		MachineBasicBlock *
ARMTargetLowering::EmitInstrWithCustomInserter(MachineInstr &MI,		ARMTargetLowering::EmitInstrWithCustomInserter(MachineInstr &MI,
MachineBasicBlock *BB) const {		MachineBasicBlock *BB) const {
const TargetInstrInfo *TII = Subtarget->getInstrInfo();		const TargetInstrInfo *TII = Subtarget->getInstrInfo();
DebugLoc dl = MI.getDebugLoc();		DebugLoc dl = MI.getDebugLoc();
bool isThumb2 = Subtarget->isThumb2();		bool isThumb2 = Subtarget->isThumb2();
switch (MI.getOpcode()) {		switch (MI.getOpcode()) {
default: {		default: {
Show All 10 Lines	BuildMI(*BB, MI, dl, TII->get(ARM::tLDMIA_UPD))
.add(MI.getOperand(3)) // PredImm		.add(MI.getOperand(3)) // PredImm
.add(MI.getOperand(4)) // PredReg		.add(MI.getOperand(4)) // PredReg
.add(MI.getOperand(0)) // Rt		.add(MI.getOperand(0)) // Rt
.cloneMemRefs(MI);		.cloneMemRefs(MI);
MI.eraseFromParent();		MI.eraseFromParent();
return BB;		return BB;
}		}

		case ARM::MVE_MEMCPYLOOPINST: {

		// Transformation below expands MVE_MEMCPYLOOPINST Pseudo instruction
		// into a Tail Predicated (TP) Loop. It adds the instructions to calculate
		// the iteration count =ceil(size_in_bytes/16)) in the TP entry block and
		// adds the relevant instructions in the TP loop Body for generation of a
		// WLSTP loop.

		// Below is relevant portion of the CFG after the transformation.
		// The Machine Basic Blocks are shown along with branch conditions (in
		// brackets). Note that TP entry/exit MBBs depict the entry/exit of this
		// portion of the CFG and may not necessarily be the entry/exit of the
		// function.

		// (Relevant) CFG after transformation:
		// TP entry MBB
		// \|
		// \|-----------------\|
		// (n <= 0) (n > 0)
		// \| \|
		// \| TP loop Body MBB
		dmgreenUnsubmitted Done Reply Inline Actions Great comment by the way. Is it possible to make the loop MBB look like a bit like a loop? To show there is a backedge too. dmgreen: Great comment by the way. Is it possible to make the loop MBB look like a bit like a loop? To…
		// \ \|
		// \ /
		// TP exit MBB

		MachineFunction *MF = BB->getParent();
		MachineRegisterInfo &MRI = MF->getRegInfo();

		Register OpDestReg = MI.getOperand(0).getReg();
		Register OpSrcReg = MI.getOperand(1).getReg();
		Register OpSizeReg = MI.getOperand(2).getReg();

		// Allocate the required MBBs and add to parent function.
		MachineBasicBlock *TpEntry = BB;
		MachineBasicBlock *TpLoopBody = MF->CreateMachineBasicBlock();
		MachineBasicBlock *TpExit;

		MF->push_back(TpLoopBody);

		// If any instructions are present in the current block after
		// MVE_MEMCPYLOOPINST, move them into the exit block. This is required since
		// a terminator(t2WhileLoopStart) will be placed at that site. If no
		// instructions are present after MVE_MEMCPYLOOPINST, then fallthrough is
		// the exit.
		TpExit = BB->splitAt(MI, false);
		if (TpExit == BB) {
		dmgreenUnsubmitted Done Reply Inline Actions When will this happen? dmgreen: When will this happen?
		malharJAuthorUnsubmitted Done Reply Inline Actions This happens if MVE_MEMCPYLOOPINST is the only instruction in the block. splitAt() returns the same block if there is nothing after the instruction at which the split is done.. This happens when for loops are implicitly converted to memcpys, the memcpy call ends up being the only instruction in the preheader. There is already a test case for this as test2 in llvm/test/CodeGen/Thumb2/mve_tp_loop.mir malharJ: This happens if MVE_MEMCPYLOOPINST is the only instruction in the block. splitAt() returns the…
		dmgreenUnsubmitted Done Reply Inline Actions OK. I thought it was more eager about putting branches on the end of blocks, even if they are fallthroughs. dmgreen: OK. I thought it was more eager about putting branches on the end of blocks, even if they are…
		assert(BB->canFallThrough() &&
		"exit Block must be Fallthrough of the block containing memcpy");
		TpExit = BB->getFallThrough();
		}

		// Add logic for iteration count
		Register TotalIterationsReg =
		genTPEntry(TpEntry, TpLoopBody, TpExit, OpSizeReg, TII, dl, MRI);

		// Add the vectorized (and predicated) loads/store instructions
		genTPLoopBody(TpLoopBody, TpEntry, TpExit, TII, dl, MRI, OpSrcReg,
		OpDestReg, OpSizeReg, TotalIterationsReg);

		// Connect the blocks
		TpEntry->addSuccessor(TpLoopBody);
		TpLoopBody->addSuccessor(TpLoopBody);
		TpLoopBody->addSuccessor(TpExit);

		// Reorder for a more natural layout
		dmgreenUnsubmitted Done Reply Inline Actions -> "for a more natural layout"? I think there may be benefits from getting the order roughly correct at this stage, if we are relying on WLS branches. They can be fixed up later, but if we get them more correct at this point, that can only help. dmgreen: -> "for a more natural layout"? I think there may be benefits from getting the order roughly…
		TpLoopBody->moveAfter(TpEntry);
		TpExit->moveAfter(TpLoopBody);

		// Finally, remove the memcpy Psuedo Instruction
		MI.eraseFromParent();

		// Return the exit block as it may contain other instructions requiring a
		// custom inserter
		return TpExit;
		}

// The Thumb2 pre-indexed stores have the same MI operands, they just		// The Thumb2 pre-indexed stores have the same MI operands, they just
// define them differently in the .td files from the isel patterns, so		// define them differently in the .td files from the isel patterns, so
// they need pseudos.		// they need pseudos.
case ARM::t2STR_preidx:		case ARM::t2STR_preidx:
MI.setDesc(TII->get(ARM::t2STR_PRE));		MI.setDesc(TII->get(ARM::t2STR_PRE));
return BB;		return BB;
case ARM::t2STRB_preidx:		case ARM::t2STRB_preidx:
MI.setDesc(TII->get(ARM::t2STRB_PRE));		MI.setDesc(TII->get(ARM::t2STRB_PRE));
▲ Show 20 Lines • Show All 8,501 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMInstrMVE.td

This file is larger than 256 KB, so syntax highlighting is disabled by default.

Show First 20 Lines • Show All 6,858 Lines • ▼ Show 20 Lines	class MVE_WLSTP<string asm, bits<2> size>
bits<11> label;		bits<11> label;
let Inst{13} = 0b0;		let Inst{13} = 0b0;
let Inst{11} = label{0};		let Inst{11} = label{0};
let Inst{10-1} = label{10-1};		let Inst{10-1} = label{10-1};
let isBranch = 1;		let isBranch = 1;
let isTerminator = 1;		let isTerminator = 1;
}		}

		def SDT_MVEMEMCPYLOOPNODE
		: SDTypeProfile<0, 3, [SDTCisPtrTy<0>, SDTCisPtrTy<1>, SDTCisVT<2, i32>]>;
		def MVE_MEMCPYLOOPNODE : SDNode<"ARMISD::MEMCPYLOOP", SDT_MVEMEMCPYLOOPNODE,
		[SDNPHasChain, SDNPMayStore, SDNPMayLoad]>;

		let usesCustomInserter = 1, hasNoSchedulingInfo = 1 in {
		def MVE_MEMCPYLOOPINST : PseudoInst<(outs),
		dmgreenUnsubmitted Done Reply Inline Actions Can you improve this formatting. If you clang-formatted it, it doesn't do well with .td files and manually making it look more like the others will do better. dmgreen: Can you improve this formatting. If you clang-formatted it, it doesn't do well with .td files…
		malharJAuthorUnsubmitted Done Reply Inline Actions Had clang formatted it but yeah doesnt look good. Updated now. malharJ: Had clang formatted it but yeah doesnt look good. Updated now.
		(ins rGPR:$dst, rGPR:$src, rGPR:$sz),
		NoItinerary,
		[(MVE_MEMCPYLOOPNODE rGPR:$dst, rGPR:$src, rGPR:$sz)]>;
		}

def MVE_DLSTP_8 : MVE_DLSTP<"dlstp.8", 0b00>;		def MVE_DLSTP_8 : MVE_DLSTP<"dlstp.8", 0b00>;
def MVE_DLSTP_16 : MVE_DLSTP<"dlstp.16", 0b01>;		def MVE_DLSTP_16 : MVE_DLSTP<"dlstp.16", 0b01>;
def MVE_DLSTP_32 : MVE_DLSTP<"dlstp.32", 0b10>;		def MVE_DLSTP_32 : MVE_DLSTP<"dlstp.32", 0b10>;
def MVE_DLSTP_64 : MVE_DLSTP<"dlstp.64", 0b11>;		def MVE_DLSTP_64 : MVE_DLSTP<"dlstp.64", 0b11>;

def MVE_WLSTP_8 : MVE_WLSTP<"wlstp.8", 0b00>;		def MVE_WLSTP_8 : MVE_WLSTP<"wlstp.8", 0b00>;
def MVE_WLSTP_16 : MVE_WLSTP<"wlstp.16", 0b01>;		def MVE_WLSTP_16 : MVE_WLSTP<"wlstp.16", 0b01>;
def MVE_WLSTP_32 : MVE_WLSTP<"wlstp.32", 0b10>;		def MVE_WLSTP_32 : MVE_WLSTP<"wlstp.32", 0b10>;
▲ Show 20 Lines • Show All 547 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMSelectionDAGInfo.h

	Show All 10 Lines
	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//

	#ifndef LLVM_LIB_TARGET_ARM_ARMSELECTIONDAGINFO_H			#ifndef LLVM_LIB_TARGET_ARM_ARMSELECTIONDAGINFO_H
	#define LLVM_LIB_TARGET_ARM_ARMSELECTIONDAGINFO_H			#define LLVM_LIB_TARGET_ARM_ARMSELECTIONDAGINFO_H

	#include "MCTargetDesc/ARMAddressingModes.h"			#include "MCTargetDesc/ARMAddressingModes.h"
	#include "llvm/CodeGen/RuntimeLibcalls.h"			#include "llvm/CodeGen/RuntimeLibcalls.h"
	#include "llvm/CodeGen/SelectionDAGTargetInfo.h"			#include "llvm/CodeGen/SelectionDAGTargetInfo.h"
				#include "llvm/Support/CommandLine.h"
				dmgreenUnsubmitted Done Reply Inline Actions Is this needed here? Can it be in the cpp file? dmgreen: Is this needed here? Can it be in the cpp file?

	namespace llvm {			namespace llvm {

	namespace ARM_AM {			namespace ARM_AM {
	static inline ShiftOpc getShiftOpcForNode(unsigned Opcode) {			static inline ShiftOpc getShiftOpcForNode(unsigned Opcode) {
	switch (Opcode) {			switch (Opcode) {
	default: return ARM_AM::no_shift;			default: return ARM_AM::no_shift;
	case ISD::SHL: return ARM_AM::lsl;			case ISD::SHL: return ARM_AM::lsl;
	▲ Show 20 Lines • Show All 42 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMSelectionDAGInfo.cpp

	Show All 11 Lines

	#include "ARMTargetMachine.h"			#include "ARMTargetMachine.h"
	#include "llvm/CodeGen/SelectionDAG.h"			#include "llvm/CodeGen/SelectionDAG.h"
	#include "llvm/IR/DerivedTypes.h"			#include "llvm/IR/DerivedTypes.h"
	using namespace llvm;			using namespace llvm;

	#define DEBUG_TYPE "arm-selectiondag-info"			#define DEBUG_TYPE "arm-selectiondag-info"

				static cl::opt<cl::boolOrDefault>
				EnableMemcpyTPLoop("arm-memcpy-tploop", cl::Hidden,
				cl::desc("Enable/disable conversion of llvm.memcpy to "
				"Tail predicated loops (WLSTP)"));

	// Emit, if possible, a specialized version of the given Libcall. Typically this			// Emit, if possible, a specialized version of the given Libcall. Typically this
	// means selecting the appropriately aligned version, but we also convert memset			// means selecting the appropriately aligned version, but we also convert memset
	// of 0 into memclr.			// of 0 into memclr.
	SDValue ARMSelectionDAGInfo::EmitSpecializedLibcall(			SDValue ARMSelectionDAGInfo::EmitSpecializedLibcall(
	SelectionDAG &DAG, const SDLoc &dl, SDValue Chain, SDValue Dst, SDValue Src,			SelectionDAG &DAG, const SDLoc &dl, SDValue Chain, SDValue Dst, SDValue Src,
	SDValue Size, unsigned Align, RTLIB::Libcall LC) const {			SDValue Size, unsigned Align, RTLIB::Libcall LC) const {
	const ARMSubtarget &Subtarget =			const ARMSubtarget &Subtarget =
	DAG.getMachineFunction().getSubtarget<ARMSubtarget>();			DAG.getMachineFunction().getSubtarget<ARMSubtarget>();
	▲ Show 20 Lines • Show All 97 Lines • ▼ Show 20 Lines
	}			}

	SDValue ARMSelectionDAGInfo::EmitTargetCodeForMemcpy(			SDValue ARMSelectionDAGInfo::EmitTargetCodeForMemcpy(
	SelectionDAG &DAG, const SDLoc &dl, SDValue Chain, SDValue Dst, SDValue Src,			SelectionDAG &DAG, const SDLoc &dl, SDValue Chain, SDValue Dst, SDValue Src,
	SDValue Size, Align Alignment, bool isVolatile, bool AlwaysInline,			SDValue Size, Align Alignment, bool isVolatile, bool AlwaysInline,
	MachinePointerInfo DstPtrInfo, MachinePointerInfo SrcPtrInfo) const {			MachinePointerInfo DstPtrInfo, MachinePointerInfo SrcPtrInfo) const {
	const ARMSubtarget &Subtarget =			const ARMSubtarget &Subtarget =
	DAG.getMachineFunction().getSubtarget<ARMSubtarget>();			DAG.getMachineFunction().getSubtarget<ARMSubtarget>();
				ConstantSDNode *ConstantSize = dyn_cast<ConstantSDNode>(Size);

				auto GenInlineTP = [=](const ARMSubtarget &Subtarget,
				dmgreenUnsubmitted Done Reply Inline Actions `[=]` ->`[&]` is more standard, even if it doesn't make a lot of difference here. dmgreen: `[=]` ->`[&]` is more standard, even if it doesn't make a lot of difference here.
				const SelectionDAG &DAG) {
				dmgreenUnsubmitted Done Reply Inline Actions Can we add an option that turns this inline memcpy on/off. If the option is true, we always use the MEMCPYLOOP, if it's false we never do, and if it's unset we use this default logic. Also consider pulling the if logic into a lambda for readability. dmgreen: Can we add an option that turns this inline memcpy on/off. If the option is true, we always use…
				return !DAG.getMachineFunction().getFunction().hasOptNone() &&
				dmgreenUnsubmitted Done Reply Inline Actions Probably better as: if (DAG.getMachineFunction().getFunction().hasOptNone()) return false; if (!ConstantSize && (Alignment >= Align(4)) return true; if (...) ... The EnableMemcpyTPLoop logic could be in here too, as it's just returning true/false at the right time. What do we do for -Oz and -Os? dmgreen: Probably better as: ``` if (DAG.getMachineFunction().getFunction().hasOptNone()) return false…
				malharJAuthorUnsubmitted Done Reply Inline Actions Ok. And I guess it would be best to disable in the case of -Os/-Oz in case there are multiple memcpys in the source. I've made the update and added tests as well. malharJ: Ok. And I guess it would be best to disable in the case of -Os/-Oz in case there are multiple…
				dmgreenUnsubmitted Done Reply Inline Actions func -> Func. Or maybe just F, which is quite common in LLVM. dmgreen: func -> Func. Or maybe just F, which is quite common in LLVM.
				((!ConstantSize && (Alignment >= Align(4))) \|\|
				dmgreenUnsubmitted Not Done Reply Inline Actions Perhaps use == TPLoop::ForceDisable to make it clear. dmgreen: Perhaps use == TPLoop::ForceDisable to make it clear.
				(ConstantSize &&
				ConstantSize->getZExtValue() >
				Subtarget.getMaxInlineSizeThreshold() &&
				ConstantSize->getZExtValue() <
				Subtarget.getMaxTPLoopInlineSizeThreshold()));
				};
				dmgreenUnsubmitted Done Reply Inline Actions if (EnableMemcpyTPLoop == cl::BOU_FALSE) return false; Is probably better, if it works as I expect. That keeps the indenting down, and the last if currently isn't in the block it looks like it should be. Oh, and move EnableMemcpyTPLoop above the OptSIze/OptNone, in case we want to try and force it. (Even if OptNone doesn't work, using that combo is unlikely to be useful at any rate.) dmgreen: if (EnableMemcpyTPLoop == cl::BOU_FALSE) return false; Is probably better, if it works…
				malharJAuthorUnsubmitted Done Reply Inline Actions Ok, my bad there with the braces. I've moved the cases when the cli option is set to be of higher priority than the optNone/optSize cases ... but the unset case is of Lower priority than (the optNone/optSize) since the user is no longer passing the cli option. Hopefully that sounds sensible. malharJ: Ok, my bad there with the braces. I've moved the cases when the cli option is set to be…

				if (Subtarget.hasMVEIntegerOps())
				if ((EnableMemcpyTPLoop == cl::BOU_TRUE) \|\|
				(EnableMemcpyTPLoop == cl::BOU_UNSET && GenInlineTP(Subtarget, DAG)))
				return DAG.getNode(ARMISD::MEMCPYLOOP, dl, MVT::Other, Chain, Dst, Src,
				dmgreenUnsubmitted Done Reply Inline Actions Add a return false at the end? dmgreen: Add a return false at the end?
				malharJAuthorUnsubmitted Done Reply Inline Actions yep, had missed that out. malharJ: yep, had missed that out.
				DAG.getZExtOrTrunc(Size, dl, MVT::i32));

	// Do repeated 4-byte loads and stores. To be improved.			// Do repeated 4-byte loads and stores. To be improved.
	// This requires 4-byte alignment.			// This requires 4-byte alignment.
	if (Alignment < Align(4))			if (Alignment < Align(4))
	return SDValue();			return SDValue();
	// This requires the copy size to be a constant, preferably			// This requires the copy size to be a constant, preferably
	// within a subtarget-specific limit.			// within a subtarget-specific limit.
	ConstantSDNode *ConstantSize = dyn_cast<ConstantSDNode>(Size);
	if (!ConstantSize)			if (!ConstantSize)
	return EmitSpecializedLibcall(DAG, dl, Chain, Dst, Src, Size,			return EmitSpecializedLibcall(DAG, dl, Chain, Dst, Src, Size,
	Alignment.value(), RTLIB::MEMCPY);			Alignment.value(), RTLIB::MEMCPY);
	uint64_t SizeVal = ConstantSize->getZExtValue();			uint64_t SizeVal = ConstantSize->getZExtValue();
	if (!AlwaysInline && SizeVal > Subtarget.getMaxInlineSizeThreshold())			if (!AlwaysInline && SizeVal > Subtarget.getMaxInlineSizeThreshold())
	return EmitSpecializedLibcall(DAG, dl, Chain, Dst, Src, Size,			return EmitSpecializedLibcall(DAG, dl, Chain, Dst, Src, Size,
	Alignment.value(), RTLIB::MEMCPY);			Alignment.value(), RTLIB::MEMCPY);

	▲ Show 20 Lines • Show All 108 Lines • Show Last 20 Lines

llvm/lib/Target/ARM/ARMSubtarget.h

Show First 20 Lines • Show All 532 Lines • ▼ Show 20 Lines	ARMSubtarget(const Triple &TT, const std::string &CPU, const std::string &FS,
bool MinSize = false);		bool MinSize = false);

/// getMaxInlineSizeThreshold - Returns the maximum memset / memcpy size		/// getMaxInlineSizeThreshold - Returns the maximum memset / memcpy size
/// that still makes it profitable to inline the call.		/// that still makes it profitable to inline the call.
unsigned getMaxInlineSizeThreshold() const {		unsigned getMaxInlineSizeThreshold() const {
return 64;		return 64;
}		}

		/// getMaxTPLoopSizeThreshold - Returns the maximum memcpy size
		/// that still makes it profitable to inline the call as a Tail
		/// Predicated loop
		unsigned getMaxTPLoopInlineSizeThreshold() const { return 128; }

/// ParseSubtargetFeatures - Parses features string setting specified		/// ParseSubtargetFeatures - Parses features string setting specified
/// subtarget options. Definition of function is auto generated by tblgen.		/// subtarget options. Definition of function is auto generated by tblgen.
void ParseSubtargetFeatures(StringRef CPU, StringRef TuneCPU, StringRef FS);		void ParseSubtargetFeatures(StringRef CPU, StringRef TuneCPU, StringRef FS);

/// initializeSubtargetDependencies - Initializes using a CPU and feature string		/// initializeSubtargetDependencies - Initializes using a CPU and feature string
/// so that we can use initializer lists for subtarget initialization.		/// so that we can use initializer lists for subtarget initialization.
ARMSubtarget &initializeSubtargetDependencies(StringRef CPU, StringRef FS);		ARMSubtarget &initializeSubtargetDependencies(StringRef CPU, StringRef FS);

▲ Show 20 Lines • Show All 384 Lines • Show Last 20 Lines

llvm/test/CodeGen/Thumb2/LowOverheadLoops/memcall.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple=thumbv8.1m.main-none-none-eabi -mattr=+mve.fp -verify-machineinstrs -tail-predication=enabled -o - %s \| FileCheck %s			; RUN: llc -mtriple=thumbv8.1m.main-none-none-eabi -mattr=+mve.fp -verify-machineinstrs -tail-predication=enabled -o - %s \| FileCheck %s

	define void @test_memcpy(i32* nocapture %x, i32* nocapture readonly %y, i32 %n, i32 %m) {			define void @test_memcpy(i32* nocapture %x, i32* nocapture readonly %y, i32 %n, i32 %m) {
	; CHECK-LABEL: test_memcpy:			; CHECK-LABEL: test_memcpy:
	; CHECK: @ %bb.0: @ %entry			; CHECK: @ %bb.0: @ %entry
	; CHECK-NEXT: .save {r4, r5, r6, r7, r8, r9, lr}			; CHECK-NEXT: .save {r4, r5, r6, r7, lr}
	; CHECK-NEXT: push.w {r4, r5, r6, r7, r8, r9, lr}			; CHECK-NEXT: push {r4, r5, r6, r7, lr}
	; CHECK-NEXT: .pad #4
	; CHECK-NEXT: sub sp, #4
	; CHECK-NEXT: cmp r2, #1			; CHECK-NEXT: cmp r2, #1
	; CHECK-NEXT: blt .LBB0_3			; CHECK-NEXT: blt .LBB0_5
	; CHECK-NEXT: @ %bb.1: @ %for.body.preheader			; CHECK-NEXT: @ %bb.1: @ %for.body.preheader
	; CHECK-NEXT: mov r8, r3			; CHECK-NEXT: lsl.w r12, r3, #2
	; CHECK-NEXT: mov r5, r2			; CHECK-NEXT: movs r7, #0
	; CHECK-NEXT: mov r9, r1			; CHECK-NEXT: b .LBB0_2
	; CHECK-NEXT: mov r7, r0
	; CHECK-NEXT: lsls r4, r3, #2
	; CHECK-NEXT: movs r6, #0
	; CHECK-NEXT: .LBB0_2: @ %for.body			; CHECK-NEXT: .LBB0_2: @ %for.body
	; CHECK-NEXT: @ =>This Inner Loop Header: Depth=1			; CHECK-NEXT: @ =>This Loop Header: Depth=1
	; CHECK-NEXT: adds r0, r7, r6			; CHECK-NEXT: @ Child Loop BB0_4 Depth 2
	; CHECK-NEXT: add.w r1, r9, r6			; CHECK-NEXT: adds r4, r1, r7
	; CHECK-NEXT: mov r2, r8			; CHECK-NEXT: adds r5, r0, r7
	; CHECK-NEXT: bl __aeabi_memcpy4			; CHECK-NEXT: mov r6, r3
	; CHECK-NEXT: add r6, r4			; CHECK-NEXT: wlstp.8 lr, r6, .LBB0_3
				dmgreenUnsubmitted Not Done Reply Inline Actions Why does this not use r3 directly? dmgreen: Why does this not use r3 directly?
				malharJAuthorUnsubmitted Done Reply Inline Actions This seems to be an issue with generating a preHeader during the transform .. The phi-node-elimination pass is lowering the phi instructions (in the TP loopBody) as COPY operations (into the PreHeader). In this instance, the copy/mov can be seen below on line 32: mov r7, r3 I've fixed this issue as of now by not generating an extra preHeader during the transform .. so the mov ends up above the t2WhileLoopStartLR and overall it seems to work. Please see my comment about the latest changes for more details on this. malharJ: This seems to be an issue with generating a preHeader during the transform .. The phi-node…
	; CHECK-NEXT: subs r5, #1			; CHECK-NEXT: b .LBB0_4
	; CHECK-NEXT: bne .LBB0_2			; CHECK-NEXT: .LBB0_3: @ %for.body
	; CHECK-NEXT: .LBB0_3: @ %for.cond.cleanup			; CHECK-NEXT: @ in Loop: Header=BB0_2 Depth=1
	; CHECK-NEXT: add sp, #4			; CHECK-NEXT: add r7, r12
	; CHECK-NEXT: pop.w {r4, r5, r6, r7, r8, r9, pc}			; CHECK-NEXT: subs r2, #1
				; CHECK-NEXT: beq .LBB0_5
				; CHECK-NEXT: b .LBB0_2
				; CHECK-NEXT: .LBB0_4: @ Parent Loop BB0_2 Depth=1
				; CHECK-NEXT: @ => This Inner Loop Header: Depth=2
				; CHECK-NEXT: vldrb.u8 q0, [r4], #16
				; CHECK-NEXT: vstrb.8 q0, [r5], #16
				; CHECK-NEXT: letp lr, .LBB0_4
				; CHECK-NEXT: b .LBB0_3
				; CHECK-NEXT: .LBB0_5: @ %for.cond.cleanup
				; CHECK-NEXT: pop {r4, r5, r6, r7, pc}
	entry:			entry:
	%cmp8 = icmp sgt i32 %n, 0			%cmp8 = icmp sgt i32 %n, 0
	br i1 %cmp8, label %for.body, label %for.cond.cleanup			br i1 %cmp8, label %for.body, label %for.cond.cleanup

	for.cond.cleanup: ; preds = %for.body, %entry			for.cond.cleanup: ; preds = %for.body, %entry
	ret void			ret void

	for.body: ; preds = %entry, %for.body			for.body: ; preds = %entry, %for.body
	▲ Show 20 Lines • Show All 231 Lines • Show Last 20 Lines

llvm/test/CodeGen/Thumb2/mve_tp_loop.ll

This file was added.

				; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
				; RUN: llc -O1 -mtriple=arm-arm-none-eabi -mcpu=cortex-m55 --verify-machineinstrs %s -o - \| FileCheck %s
				dmgreenUnsubmitted Done Reply Inline Actions Shouldn't have -O1 or cpu, use the -mtriple from other similar tests. The test can be called llvm/test/CodeGen/Thumb2/mve-tp-loop.ll. dmgreen: Shouldn't have -O1 or cpu, use the -mtriple from other similar tests. The test can be called…

				; Check that WLSTP loop is not generated for alignment < 4
				; void test1(char* dest, char* src, int n){
				; memcpy(dest, src, n);
				; }

				declare void @llvm.memcpy.p0i8.p0i8.i32(i8* noalias nocapture writeonly, i8* noalias nocapture readonly, i32, i1 immarg) #1
				declare void @llvm.memcpy.p0i8.p0i8.i64(i8* noalias nocapture writeonly, i8* noalias nocapture readonly, i64, i1 immarg) #1
				dmgreenUnsubmitted Done Reply Inline Actions Can you make sure there are tests where the i32 is a different type, one that is not legal like an i64. dmgreen: Can you make sure there are tests where the i32 is a different type, one that is not legal like…
				malharJAuthorUnsubmitted Done Reply Inline Actions Do you mean something like: define void @test(i8* noalias %X, i8* noalias %Y, i64 %n){ call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 4 %X, i8* align 4 %Y, i64 %n, i1 false) ret void } I get an error when I try to generate the assembly. Since i64 is illegal, what is the expectation here ? As a side note, if I generate the IR from C code, the IR always truncates the memcpy size variable to a i32 before calling llvm.memcpy( ) malharJ: Do you mean something like: ``` define void @test(i8* noalias %X, i8* noalias %Y, i64 %n){…
				dmgreenUnsubmitted Done Reply Inline Actions Yeah sure. I think they can still come up, from other places creating memcpy calls. You can probably use DAG.getZExtOrTrunc(Size, MVT::i32), instead of using the size directly. dmgreen: Yeah sure. I think they can still come up, from other places creating memcpy calls. You can…

				define void @test1(i8* noalias nocapture %X, i8* noalias nocapture readonly %Y, i32 %n){
				; CHECK-LABEL: test1:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .save {r7, lr}
				; CHECK-NEXT: push {r7, lr}
				; CHECK-NEXT: bl __aeabi_memcpy
				; CHECK-NEXT: pop {r7, pc}
				entry:
				call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 1 %X, i8* align 1 %Y, i32 %n, i1 false)
				ret void
				}


				; Check that WLSTP loop is generated for alignment >= 4
				; void test2(int* restrict X, int* restrict Y, int n){
				; memcpy(X, Y, n);
				; }


				define void @test2(i32* noalias %X, i32* noalias readonly %Y, i32 %n){
				; CHECK-LABEL: test2:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .save {r7, lr}
				; CHECK-NEXT: push {r7, lr}
				; CHECK-NEXT: wlstp.8 lr, r2, .LBB1_2
				dmgreenUnsubmitted Done Reply Inline Actions What is this generating r3 for? I thought those should be removed. dmgreen: What is this generating r3 for? I thought those should be removed.
				malharJAuthorUnsubmitted Done Reply Inline Actions Good spot, I think this is because DCE was not happening for the instructions calculating iterationCount. I had a quick look at ARMLowOverheadLoops::IterationCountDCE( ), and it seems that the expectation from the generated MIR is: // $lr = big-itercount-expression // .. // $lr = t2DoLoopStart/t2WhileLoopStartLR renamable $lr // vector.body: I've updated the final expression (in the generated MIR) calculating iteration count to now return the result in LR (earlier it was returning in one of the rGPRs) and these are now getting removed. malharJ: Good spot, I think this is because DCE was not happening for the instructions calculating…
				; CHECK-NEXT: .LBB1_1: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrb.u8 q0, [r1], #16
				; CHECK-NEXT: vstrb.8 q0, [r0], #16
				; CHECK-NEXT: letp lr, .LBB1_1
				; CHECK-NEXT: .LBB1_2: @ %entry
				; CHECK-NEXT: pop {r7, pc}
				entry:
				%0 = bitcast i32* %X to i8*
				%1 = bitcast i32* %Y to i8*
				call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 4 %0, i8* align 4 %1, i32 %n, i1 false)
				ret void
				}


				; Checks that transform handles some arithmetic on the input arguments.
				; void test3(int* restrict X, int* restrict Y, int n)
				; {
				; memcpy(X+2, Y+3, (n*2)+10);
				; }

				dmgreenUnsubmitted Done Reply Inline Actions Why is this using printf? It looks like an execution test, not a unit test. Is it testing anything specifically? If so it can probably use any call, not a variadic version of printf. dmgreen: Why is this using printf? It looks like an execution test, not a unit test. Is it testing…
				malharJAuthorUnsubmitted Done Reply Inline Actions So the intent of the test was just to check whether code surrounding memcpy call site is properly transformed. I simply used printf to prevent the code from getting optimized away (seems like a poor way now that I think about it). I've removed this test now since the transformation involving nested loops (in memcall.ll) is already testing the mentioned intent. malharJ: So the intent of the test was just to check whether code surrounding memcpy call site is…
				define void @test3(i32* noalias nocapture %X, i32* noalias nocapture readonly %Y, i32 %n) {
				; CHECK-LABEL: test3:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .save {r7, lr}
				; CHECK-NEXT: push {r7, lr}
				; CHECK-NEXT: movs r3, #10
				; CHECK-NEXT: add.w r2, r3, r2, lsl #1
				; CHECK-NEXT: adds r0, #8
				dmgreenUnsubmitted Done Reply Inline Actions Remove hidden and local_unnamed_addr #0 dmgreen: Remove hidden and local_unnamed_addr #0
				; CHECK-NEXT: adds r1, #12
				; CHECK-NEXT: wlstp.8 lr, r2, .LBB2_2
				; CHECK-NEXT: .LBB2_1: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrb.u8 q0, [r1], #16
				; CHECK-NEXT: vstrb.8 q0, [r0], #16
				; CHECK-NEXT: letp lr, .LBB2_1
				; CHECK-NEXT: .LBB2_2: @ %entry
				; CHECK-NEXT: pop {r7, pc}
				entry:
				%add.ptr = getelementptr inbounds i32, i32* %X, i32 2
				%0 = bitcast i32* %add.ptr to i8*
				%add.ptr1 = getelementptr inbounds i32, i32* %Y, i32 3
				%1 = bitcast i32* %add.ptr1 to i8*
				%mul = shl nsw i32 %n, 1
				%add = add nsw i32 %mul, 10
				call void @llvm.memcpy.p0i8.p0i8.i32(i8* nonnull align 4 %0, i8* nonnull align 4 %1, i32 %add, i1 false)
				ret void
				}


				; Checks that transform handles for loops that are implicitly converted to mempcy
				; void test4(int* restrict X, int* restrict Y, int n){
				; for(int i = 0; i < n; ++i){
				; X[i] = Y[i];
				; }
				; }

				define void @test4(i32* noalias %X, i32* noalias readonly %Y, i32 %n) {
				; CHECK-LABEL: test4:
				; CHECK: @ %bb.0: @ %entry
				; CHECK-NEXT: .save {r7, lr}
				; CHECK-NEXT: push {r7, lr}
				; CHECK-NEXT: cmp r2, #1
				; CHECK-NEXT: it lt
				; CHECK-NEXT: poplt {r7, pc}
				; CHECK-NEXT: .LBB3_1: @ %for.body.preheader
				; CHECK-NEXT: wlstp.8 lr, r2, .LBB3_3
				; CHECK-NEXT: .LBB3_2: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrb.u8 q0, [r1], #16
				; CHECK-NEXT: vstrb.8 q0, [r0], #16
				; CHECK-NEXT: letp lr, .LBB3_2
				; CHECK-NEXT: .LBB3_3: @ %for.cond.cleanup
				; CHECK-NEXT: pop {r7, pc}
				entry:
				%cmp6 = icmp sgt i32 %n, 0
				br i1 %cmp6, label %for.body.preheader, label %for.cond.cleanup

				for.body.preheader: ; preds = %entry
				%X.bits = bitcast i32* %X to i8*
				%Y.bits = bitcast i32* %Y to i8*
				call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 4 %X.bits, i8* align 4 %Y.bits, i32 %n, i1 false)
				br label %for.cond.cleanup

				for.cond.cleanup: ; preds = %for.body.preheader, %entry
				ret void
				}

				; Checks that transform can handle > i32 size inputs
				define void @test5(i8* noalias %X, i8* noalias %Y, i64 %n){
				; CHECK-LABEL: test5:
				; CHECK: @ %bb.0:
				; CHECK-NEXT: .save {r7, lr}
				; CHECK-NEXT: push {r7, lr}
				; CHECK-NEXT: wlstp.8 lr, r2, .LBB4_2
				; CHECK-NEXT: .LBB4_1: @ =>This Inner Loop Header: Depth=1
				; CHECK-NEXT: vldrb.u8 q0, [r1], #16
				; CHECK-NEXT: vstrb.8 q0, [r0], #16
				; CHECK-NEXT: letp lr, .LBB4_1
				; CHECK-NEXT: .LBB4_2:
				; CHECK-NEXT: pop {r7, pc}
				call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 4 %X, i8* align 4 %Y, i64 %n, i1 false)
				ret void
				}

llvm/test/CodeGen/Thumb2/mve_tp_loop.mir

This file was added.

				# NOTE: Assertions have been autogenerated by utils/update_mir_test_checks.py
				# RUN: llc -O1 -mtriple=arm-arm-none-eabi -mcpu=cortex-m55 -simplify-mir -run-pass=finalize-isel %s -o - \| FileCheck %s
				--- \|
				; ModuleID = 'llvm/test/CodeGen/Thumb2/mve_tp_loop.ll'
				source_filename = "llvm/test/CodeGen/Thumb2/mve_tp_loop.ll"
				target datalayout = "e-m:e-p:32:32-Fi8-i64:64-v128:64:128-a:0:32-n32-S64"
				target triple = "arm-arm-none-eabi"

				; Function Attrs: argmemonly nofree nosync nounwind willreturn
				declare void @llvm.memcpy.p0i8.p0i8.i32(i8* noalias nocapture writeonly, i8* noalias nocapture readonly, i32, i1 immarg) #0

				define void @test1(i32* noalias %X, i32* noalias readonly %Y, i32 %n) #1 {
				entry:
				%0 = bitcast i32* %X to i8*
				%1 = bitcast i32* %Y to i8*
				call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 4 %0, i8* align 4 %1, i32 %n, i1 false)
				ret void
				}

				define void @test2(i32* noalias %X, i32* noalias readonly %Y, i32 %n) #1 {
				entry:
				%cmp6 = icmp sgt i32 %n, 0
				br i1 %cmp6, label %for.body.preheader, label %for.cond.cleanup

				for.body.preheader: ; preds = %entry
				%X.bits = bitcast i32* %X to i8*
				%Y.bits = bitcast i32* %Y to i8*
				call void @llvm.memcpy.p0i8.p0i8.i32(i8* align 4 %X.bits, i8* align 4 %Y.bits, i32 %n, i1 false)
				br label %for.cond.cleanup

				for.cond.cleanup: ; preds = %for.body.preheader, %entry
				ret void
				}

				attributes #0 = { argmemonly nofree nosync nounwind willreturn "target-cpu"="cortex-m55" }
				attributes #1 = { "target-cpu"="cortex-m55" }

				...
				---
				name: test1
				tracksRegLiveness: true
				body: \|
				bb.0.entry:
				liveins: $r0, $r1, $r2

				; CHECK-LABEL: name: test1
				; CHECK: liveins: $r0, $r1, $r2
				; CHECK: [[COPY:%[0-9]+]]:rgpr = COPY $r2
				; CHECK: [[COPY1:%[0-9]+]]:rgpr = COPY $r1
				; CHECK: [[COPY2:%[0-9]+]]:rgpr = COPY $r0
				; CHECK: [[t2ADDri:%[0-9]+]]:rgpr = t2ADDri [[COPY]], 15, 14 /* CC::al */, $noreg, $noreg
				; CHECK: [[t2BICri:%[0-9]+]]:rgpr = t2BICri killed [[t2ADDri]], 16, 14 /* CC::al */, $noreg, $noreg
				; CHECK: [[t2LSRri:%[0-9]+]]:gprlr = t2LSRri killed [[t2BICri]], 4, 14 /* CC::al */, $noreg, $noreg
				; CHECK: [[t2WhileLoopSetup:%[0-9]+]]:gprlr = t2WhileLoopSetup killed [[t2LSRri]]
				; CHECK: t2WhileLoopStart [[t2WhileLoopSetup]], %bb.2, implicit-def $cpsr
				; CHECK: .1:
				; CHECK: [[PHI:%[0-9]+]]:rgpr = PHI [[COPY1]], %bb.0, %8, %bb.1
				; CHECK: [[PHI1:%[0-9]+]]:rgpr = PHI [[COPY2]], %bb.0, %10, %bb.1
				; CHECK: [[PHI2:%[0-9]+]]:gprlr = PHI [[t2WhileLoopSetup]], %bb.0, %12, %bb.1
				; CHECK: [[PHI3:%[0-9]+]]:rgpr = PHI [[COPY]], %bb.0, %14, %bb.1
				; CHECK: [[MVE_VCTP8_:%[0-9]+]]:vccr = MVE_VCTP8 [[PHI3]], 0, $noreg
				; CHECK: [[t2SUBri:%[0-9]+]]:rgpr = t2SUBri [[PHI3]], 16, 14 /* CC::al */, $noreg, $noreg
				; CHECK: [[MVE_VLDRBU8_post:%[0-9]+]]:rgpr, [[MVE_VLDRBU8_post1:%[0-9]+]]:mqpr = MVE_VLDRBU8_post [[PHI]], 16, 1, [[MVE_VCTP8_]]
				; CHECK: [[MVE_VSTRBU8_post:%[0-9]+]]:rgpr = MVE_VSTRBU8_post killed [[MVE_VLDRBU8_post1]], [[PHI1]], 16, 1, [[MVE_VCTP8_]]
				; CHECK: [[t2LoopDec:%[0-9]+]]:gprlr = t2LoopDec [[PHI2]], 1
				; CHECK: t2LoopEnd [[t2LoopDec]], %bb.1, implicit-def $cpsr
				; CHECK: t2B %bb.2, 14 /* CC::al */, $noreg
				; CHECK: .2.entry:
				; CHECK: tBX_RET 14 /* CC::al */, $noreg
				%2:rgpr = COPY $r2
				%1:rgpr = COPY $r1
				%0:rgpr = COPY $r0
				MVE_MEMCPYLOOPINST %0, %1, %2
				tBX_RET 14 /* CC::al */, $noreg

				...
				---
				name: test2
				tracksRegLiveness: true
				body: \|
				; CHECK-LABEL: name: test2
				; CHECK: bb.0.entry:
				; CHECK: successors: %bb.1(0x50000000), %bb.2(0x30000000)
				; CHECK: liveins: $r0, $r1, $r2
				; CHECK: [[COPY:%[0-9]+]]:rgpr = COPY $r2
				; CHECK: [[COPY1:%[0-9]+]]:rgpr = COPY $r1
				; CHECK: [[COPY2:%[0-9]+]]:rgpr = COPY $r0
				; CHECK: t2CMPri [[COPY]], 1, 14 /* CC::al */, $noreg, implicit-def $cpsr
				; CHECK: t2Bcc %bb.2, 11 /* CC::lt */, $cpsr
				; CHECK: t2B %bb.1, 14 /* CC::al */, $noreg
				; CHECK: bb.1.for.body.preheader:
				; CHECK: successors: %bb.2(0x80000000), %bb.3(0x00000000)
				; CHECK: [[t2ADDri:%[0-9]+]]:rgpr = t2ADDri [[COPY]], 15, 14 /* CC::al */, $noreg, $noreg
				; CHECK: [[t2BICri:%[0-9]+]]:rgpr = t2BICri killed [[t2ADDri]], 16, 14 /* CC::al */, $noreg, $noreg
				; CHECK: [[t2LSRri:%[0-9]+]]:gprlr = t2LSRri killed [[t2BICri]], 4, 14 /* CC::al */, $noreg, $noreg
				; CHECK: [[t2WhileLoopSetup:%[0-9]+]]:gprlr = t2WhileLoopSetup killed [[t2LSRri]]
				; CHECK: t2WhileLoopStart [[t2WhileLoopSetup]], %bb.2, implicit-def $cpsr
				; CHECK: bb.3:
				; CHECK: [[PHI:%[0-9]+]]:rgpr = PHI [[COPY1]], %bb.1, %8, %bb.3
				; CHECK: [[PHI1:%[0-9]+]]:rgpr = PHI [[COPY2]], %bb.1, %10, %bb.3
				; CHECK: [[PHI2:%[0-9]+]]:gprlr = PHI [[t2WhileLoopSetup]], %bb.1, %12, %bb.3
				; CHECK: [[PHI3:%[0-9]+]]:rgpr = PHI [[COPY]], %bb.1, %14, %bb.3
				; CHECK: [[MVE_VCTP8_:%[0-9]+]]:vccr = MVE_VCTP8 [[PHI3]], 0, $noreg
				; CHECK: [[t2SUBri:%[0-9]+]]:rgpr = t2SUBri [[PHI3]], 16, 14 /* CC::al */, $noreg, $noreg
				; CHECK: [[MVE_VLDRBU8_post:%[0-9]+]]:rgpr, [[MVE_VLDRBU8_post1:%[0-9]+]]:mqpr = MVE_VLDRBU8_post [[PHI]], 16, 1, [[MVE_VCTP8_]]
				; CHECK: [[MVE_VSTRBU8_post:%[0-9]+]]:rgpr = MVE_VSTRBU8_post killed [[MVE_VLDRBU8_post1]], [[PHI1]], 16, 1, [[MVE_VCTP8_]]
				; CHECK: [[t2LoopDec:%[0-9]+]]:gprlr = t2LoopDec [[PHI2]], 1
				; CHECK: t2LoopEnd [[t2LoopDec]], %bb.3, implicit-def $cpsr
				; CHECK: t2B %bb.2, 14 /* CC::al */, $noreg
				; CHECK: bb.2.for.cond.cleanup:
				; CHECK: tBX_RET 14 /* CC::al */, $noreg
				bb.0.entry:
				successors: %bb.1(0x50000000), %bb.2(0x30000000)
				liveins: $r0, $r1, $r2

				%2:rgpr = COPY $r2
				%1:rgpr = COPY $r1
				%0:rgpr = COPY $r0
				t2CMPri %2, 1, 14 /* CC::al */, $noreg, implicit-def $cpsr
				t2Bcc %bb.2, 11 /* CC::lt */, $cpsr
				t2B %bb.1, 14 /* CC::al */, $noreg

				bb.1.for.body.preheader:
				successors: %bb.2(0x80000000)

				MVE_MEMCPYLOOPINST %0, %1, %2

				bb.2.for.cond.cleanup:
				tBX_RET 14 /* CC::al */, $noreg

				...
				dmgreenUnsubmitted Done Reply Inline Actions Some of this can be removed, to help keep the test smaller. dmgreen: Some of this can be removed, to help keep the test smaller.