This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
llvm/
-
lib/Target/AArch64/
-
Target/
-
AArch64/
-
AArch64TargetTransformInfo.h
2/5
AArch64TargetTransformInfo.cpp
-
test/
-
CodeGen/AArch64/
-
AArch64/
-
aarch64-matrix-umull-smull.ll
-
arm64-scaled_iv.ll
-
falkor-hwpf-fix.ll
-
pr27816.ll
-
ragreedy-local-interval-cost.ll
1
vldn_shuffle.ll
-
Transforms/LoopStrengthReduce/AArch64/
-
LoopStrengthReduce/
-
AArch64/
-
lsr-memcpy.ll
-
lsr-memset.ll

Differential D89693

[AArch64] Favor pre-increments and implement TTI::getPreferredAddressingMode
AbandonedPublic

Authored by SjoerdMeijer on Oct 19 2020, 6:06 AM.

Download Raw Diff

Details

Reviewers

fhahn
efriedma
dmgreen
samparker
sanwou01
stelios-arm
NickGuy

Summary

We are missing an opportunity to generate more efficient addressing modes, preindexed loads/stores, which avoids increments of pointers with separate add instructions. This implements target hook getPreferredAddressingMode for AArch64, which is queried in LoopStrengthReduce that can bring loops in a better form to generate pre-increments.

Diff Detail

Unit TestsFailed

	Time	Test
	40 ms	x64 debian > LLVM.CodeGen/AArch64::falkor-hwpf-fix.ll
	60 ms	x64 windows > LLVM.CodeGen/AArch64::falkor-hwpf-fix.ll

Event Timeline

SjoerdMeijer created this revision.Oct 19 2020, 6:06 AM

Herald added a project: Restricted Project. · View Herald TranscriptOct 19 2020, 6:06 AM

Herald added subscribers: danielkiss, hiraditya, kristof.beyls. · View Herald Transcript

SjoerdMeijer requested review of this revision.Oct 19 2020, 6:06 AM

How about results from the LLVM test suite? This should benefit size and performance?

So far as we have used it, this option mean "aggressively optimizer for postincs". That is useful for MVE where not using postinc's in loops can have a significant effect on performance. AArch64 I'm less sure about, but it might make sense. The option tends to make loops use more registers, which might be OK on an architecture with a lot of registers.

Which improves one benchmark with 1.2%, and didn't show any changes in another which I also didn't expect to be impacted (just checking as a sanity check).

Does this mean you've ran 2 benchmarks? What does SPEC or the llvm test suite say?

Cheers guys, that's fair, will give the llvm test suite a try too.

I've ran CTMark and filter out the test that run very shortly with --filter-short and see the same trend, i.e. no regressions and some okay improvements:

Tests: 10
Short Running: 4 (filtered out)
Remaining: 6
Metric: exec_time
Program                                        before after  diff 
 test-suite :: CTMark/lencod/lencod.test         8.57   8.58  0.0%
 test-suite...Mark/mafft/pairlocalalign.test    56.49  56.46 -0.0%
 test-suite...TMark/7zip/7zip-benchmark.test    10.23  10.15 -0.7%
 test-suite...:: CTMark/sqlite3/sqlite3.test     3.93   3.90 -0.8%
 test-suite :: CTMark/Bullet/bullet.test        10.17  10.01 -1.6%
 test-suite :: CTMark/SPASS/SPASS.test          26.04  25.09 -3.7%
 Geomean difference                                          -1.1%
          before      after      diff
count  6.000000   6.000000   6.000000
mean   19.238417  19.031417 -0.011363
std    19.723614  19.679435  0.013708
min    3.933800   3.901200  -0.036644
25%    8.971825   8.934025  -0.013844
50%    10.198600  10.080500 -0.007888
75%    22.087150  21.352350 -0.002162
max    56.486600  56.464800  0.000327

How many runs was that? How much noise is there?

It's encouraging at least. Can you run SPEC too? Tamar has a good low-noise system. This isn't a small change to be taken lightly, even if the patch is small in number of lines changed.

Just curious if there's something in particular you are concerned about? I am just asking because then I can focus on that.

I have rerun experiment and instead of 5 runs I am comparing 10 runs by taking the minimum runtime of each of the runs in this way:

../test-suite/utils/compare.py --filter-short before.1.json before.2.json before.3.json before.4.json before.5.json before.6.json before.7.json before.8.json before.9.json before.10.json vs after.1.json after.2.json after.3.json after.4.json after.5.json after.6.json after.7.json after.8.json after.9.json after.10.json

This gives:

Tests: 10
Short Running: 4 (filtered out)
Remaining: 6
Metric: exec_time
Program                                        lhs    rhs    diff 
 test-suite...Mark/mafft/pairlocalalign.test    56.41  56.48  0.1%
 test-suite :: CTMark/Bullet/bullet.test         9.99   9.98 -0.0%
 test-suite :: CTMark/SPASS/SPASS.test          25.11  25.08 -0.1%
 test-suite :: CTMark/lencod/lencod.test         8.58   8.57 -0.2%
 test-suite...TMark/7zip/7zip-benchmark.test    10.16  10.11 -0.4%
 test-suite...:: CTMark/sqlite3/sqlite3.test     3.86   3.84 -0.5%
 Geomean difference                                          -0.2%
             lhs        rhs      diff
count  6.000000   6.000000   6.000000
mean   19.016133  19.009183 -0.001906
std    19.665927  19.699874  0.002340
min    3.858400   3.839700  -0.004847
25%    8.933775   8.919975  -0.003747
50%    10.071450  10.047500 -0.001556
75%    21.368850  21.336775 -0.000573
max    56.406300  56.476400  0.001243

Ah wait a minute, I am doing the last experiment again, just double checking if I haven't make mistake running them

Okay, so my last results basically shows the noise as I had forgotten to do a rebuild between the runs. Now results look less convincing....looking into it.

In D89693#2342476, @SjoerdMeijer wrote:

Okay, so my last results basically shows the noise as I had forgotten to do a rebuild between the runs. Now results look less convincing....looking into it.

FWIW the CTmark subset runs for quite a short time on beefier cores, so it might be good to also do some SPEC runs.

Yep, cheers, hopefully SPEC is better and more conclusive. The 1.2% uplift in one benchmark was on baremetal aarch64, will check if I can run some more things on that too.

samparker mentioned this in D89894: [AArch64] Backedge indexing.Oct 21 2020, 9:19 AM

SPECInt numbers:

500.perlbench_r	-0.36%
502.gcc_r	 0.25%
505.mcf_r	-0.23%
520.omnetpp_r	-0.16%
523.xalancbmk_r	-0.48%
525.x264_r	 1.38%
531.deepsjeng_r	-0.20%
541.leela_r	-0.11%
548.exchange2_r	 0.01%
557.xz_r	-0.03%

These are runtime reduction in percentage, so negative is good, postive number is a regression.
Overall, a small win, does what I want, except that 525.x264 is a bit of a negative outlier that makes things less rosy, and needs looking into.
And also need to look what Sam has cooked up in D89894....

Those numbers don't look too bad, but like you say it's probably worth looking into what x264_r is doing, just to see what is going on. Sanne ran some other numbers from the burst compiler and they were about the same - some small improvements, a couple of small losses but overall OK. That gives us confidence that big out of order cores are not going to hate this.

The original tests were on an in-order core I believe? Which from the optimization guide looks like it should be sensible to use. And the option doesn't seem to be messing anything up especially.

Can you add a test for vector postincs, the kind of thing that you would get in a loop? I only see changes for scalars here.

llvm/test/CodeGen/AArch64/shrink-wrapping-vla.ll
91 ↗	(On Diff #299030)	How is this test changing? Just the initial operand for the add?

In D89693#2346568, @dmgreen wrote:

Those numbers don't look too bad, but like you say it's probably worth looking into what x264_r is doing, just to see what is going on. Sanne ran some other numbers from the burst compiler and they were about the same - some small improvements, a couple of small losses but overall OK. That gives us confidence that big out of order cores are not going to hate this.

The original tests were on an in-order core I believe? Which from the optimization guide looks like it should be sensible to use. And the option doesn't seem to be messing anything up especially.

I analysed x264 and couldn't find any concerning codegen changes in the top 6 hottest functions in the profile. Then, I did more runs and concluded I must have been looking at noise (again) as I don't see that 1.38% regression anymore. It is more like 0.5% if it happens. Overall, my conclusion is in line with yours: this change is neutral on bigger cores worst case, but probably a small gain, and indeed in my first experiment on an in-order core I see decent speed-ups. Intuitively this makes sense, because probably the bigger ooo cores can deal better with inefficient code, while the smaller ones are more sensitive for this and an optimisation has more effect. While looking at x264, I did observe this for some cases:

The option tends to make loops use more registers,

I saw some more registers being used in preheaders to setup pointers, but then didn't seem to affect the loop.

Can you add a test for vector postincs, the kind of thing that you would get in a loop? I only see changes for scalars here.

Cheers, will look at this now.

I wanted to abandon this change in favour of D89894. The reason is that D89894 works for unrolled loops, and this change doesn't. But D89894 doesn't really work when things haven't been unrolled, so I think there's actually value in having them both, and I will progress that.

This means shouldFavorPostInc() would need to make a decision if it is working on unrolled loops or not, but the current interface doesn't allow that:

/// \return True is LSR should make efforts to create/preserve post-inc
/// addressing mode expressions.
bool shouldFavorPostInc() const;
/// Return true if LSR should make efforts to generate indexed addressing
/// modes that operate across loop iterations.
bool shouldFavorBackedgeIndex(const Loop *L) const;

I will first make shouldFavorPostInc consistent with shouldFavorBackedgeIndex to accept Loop *L as an argument, so that we can look at its induction variable and step size.
After that interface change, I will then continue here to implement that.

Sounds good. How about unifying it to one call to get the type of preferred indexing: none, pre and post?

In D89693#2559369, @samparker wrote:

Sounds good. How about unifying it to one call to get the type of preferred indexing: none, pre and post?

That would be best, agreed, as it's quite incomprehensible at the moment what things are and how they interact.

I will see if I can first refactor this into 1 call.

Ideally LSR would account for the pre/postinc in its cost modelling and automatically pick the correct one. The shouldFavorPostInc would work on top of that to more consistently produce postinc in cases like MVE where they are so beneficial. But until that happens this sounds like a fine plan. And we can probably use the same thing to do something better under MVE, where not all loops are vector loops, after all.

If it is doing any real amount of processing on the loop, it will need to be calling shouldFavorPostInc less, caching the result somewhere for the loop.

This is now "rebased" on the changes that introduced getPreferredAddressingMode, see also the updated title/description of this change.

I have rerun numbers and not much has fundamentally changed: this is a good thing to do. The only new insight is that if we also start doing runtime loop unrolling, the picture change. In that case, preindexed addressing modes are better. But with getPreferredAddressingMode now in place, we can easily adapt to that (see also my comment in the implementation of getPreferredAddressingMode) when we start doing that, which is what we will be looking into next.

SjoerdMeijer added a child revision: D97050: [LoopInfo] Look through trunc instructions.Feb 19 2021, 7:35 AM

fhahn added inline comments.Feb 19 2021, 8:16 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1287	What if the loop has multiple induction phis and one of those has a large constant step or unknown step? Should we instead iterate over all phis and check all induction phis (using `InductionDescriptor::isInductionPHI` to identify them)?

SjoerdMeijer added inline comments.Feb 19 2021, 8:46 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1287	Yeah, I thought about that a bit. I don't think we are interested in all induction phis here. I think we are interested in what is called the `PrimaryInduction` in the loop vectoriser, i.e. the one that actually controls the loop. And this seems to match exactly with what `getInductionVariable()` promises to return, which is used by `getInductionDescriptor`. That's why this looked okay to me...

fhahn added inline comments.Feb 19 2021, 9:04 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1287	I am not sure I completely understand why the IV that controls the loop is special when it comes to picking the addressing mode for the loop? As : D97050 indicates, `getInductionVariable` is quite brittle and probably misses additional cases, so if we can avoid using it the code should be more robust. If we have multiple IVs, would we not be interested if we can use post-increments for all of the ones that access memory? You could have a loop with 2 IVs, one to control the loop and one to access memory, like below. If we use the offset of the IV controlling the loop, post-index is profitable, but there won't be any accesses for that variable. (in this example, the inductions can probably be simplified to a single one, but it keeps things simple) int I=0, J=0; while (I != N) { Ptr[J] = 0; I++; J += 2000; }

SjoerdMeijer added inline comments.Feb 19 2021, 9:41 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1287	Ah, okay, I misunderstood but understand your point now! So your suggestion is if we need a more fine-grained (precise) heuristic. I would need to give this some more thoughts, but the current heuristic is based on the distinction between unrolled loops not not unrolled loops (which is why I use the primary IV) which is a simple heuristic that seems to work. In general, I think this is quite a difficult problem, given that there are a few addressing modes and potentially quite some inductions to analyse. This will come at a cost, and it's unclear at the point if it will improve results. I have verified that the implementation in D89894 works (for runtime unrolled loops), but that indeed reveals an inefficiency (missed opportunity) in the load store optimiser as noted there, which means we can't use that yet for enabling pre-indexed accesses. But perhaps I can use that heuristic, which does a bit more analysis, to decide when not to generate pre-indexed but post-indexed accesses. But this slightly improved heuristic in D89894 may still not be precise enough for your liking... I will try to experiment a little bit with that, but at the moment I tend to think that this is a step forward, and this could be improved when we find the need for that?

This abandons the idea of looking at the IV, and incorporates D89894 to look at the pointer uses.

dmgreen added inline comments.Feb 24 2021, 3:15 AM

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp
1293	I'm not sure I understand this logic. Won't it detect many things that are not unrolled loops? Why does it start at 0? I think it's worth added a few deliberate tests for the different cases this is trying to optimize. Some for simple loops, more complex unrolled loops and vectorized/interleaved loops. Showing how the codegen has changed now that we change this default option.
llvm/test/CodeGen/AArch64/vldn_shuffle.ll
2	I don't think we should be adding this to unrelated tests if we can help it. For tests like this that are testing something else, it's OK to just check the default. Same for all the other tests. From what I can tell they seem the same or better?

Thanks for commenting Dave. I will have another go at this, and try to come up with a better analysis, at least one we understand.

This is changing our approach to preferring pre-indexed addressing modes, because:

this what we want for runtime unrolled loops, which is what we want to address next,
post indexed is currently better for some cases, but that's mainly cause because we miss an opportunity in the load/store optimiser. With that fixed, expectation is that pre-indexed gives the same or better perf than post-indexed.
for what it is worth, pre-indexed is also the default for ARM,

And as a consequence of the above, this implementation becomes really straightforward, which is another benefit.
But again, we can't commit this yet, because it depends on changes in the load/store optimiser which we will address first.

Harbormaster completed remote builds in B91346: Diff 327124.Mar 1 2021, 9:36 AM

This depends on D99272.

Matt added a subscriber: Matt.Apr 13 2021, 7:21 AM

SjoerdMeijer abandoned this revision.Mar 17 2023, 1:40 AM

Herald added a project: Restricted Project. · View Herald TranscriptMar 17 2023, 1:40 AM

Herald added a subscriber: StephenFan. · View Herald Transcript

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64TargetTransformInfo.h

4 lines

AArch64TargetTransformInfo.cpp

14 lines

test/

CodeGen/

AArch64/

aarch64-matrix-umull-smull.ll

15 lines

arm64-scaled_iv.ll

8 lines

falkor-hwpf-fix.ll

4 lines

pr27816.ll

4 lines

ragreedy-local-interval-cost.ll

194 lines

vldn_shuffle.ll

47 lines

Transforms/

LoopStrengthReduce/

AArch64/

lsr-memcpy.ll

4 lines

lsr-memset.ll

2 lines

Diff 327124

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h

Show First 20 Lines • Show All 271 Lines • ▼ Show 20 Lines	bool isLegalToVectorizeReduction(RecurrenceDescriptor RdxDesc,
ElementCount VF) const;		ElementCount VF) const;

int getArithmeticReductionCost(unsigned Opcode, VectorType *Ty,		int getArithmeticReductionCost(unsigned Opcode, VectorType *Ty,
bool IsPairwiseForm,		bool IsPairwiseForm,
TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput);		TTI::TargetCostKind CostKind = TTI::TCK_RecipThroughput);

int getShuffleCost(TTI::ShuffleKind Kind, VectorType *Tp, int Index,		int getShuffleCost(TTI::ShuffleKind Kind, VectorType *Tp, int Index,
VectorType *SubTp);		VectorType *SubTp);

		TTI::AddressingModeKind getPreferredAddressingMode(const Loop *L,
		ScalarEvolution *SE) const;

/// @}		/// @}
};		};

} // end namespace llvm		} // end namespace llvm

#endif // LLVM_LIB_TARGET_AARCH64_AARCH64TARGETTRANSFORMINFO_H		#endif // LLVM_LIB_TARGET_AARCH64_AARCH64TARGETTRANSFORMINFO_H

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

//===-- AArch64TargetTransformInfo.cpp - AArch64 specific TTI -------------===//		//===-- AArch64TargetTransformInfo.cpp - AArch64 specific TTI -------------===//
//		//
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.		// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.		// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception		// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//		//
//===----------------------------------------------------------------------===//		//===----------------------------------------------------------------------===//

#include "AArch64TargetTransformInfo.h"		#include "AArch64TargetTransformInfo.h"
#include "AArch64ExpandImm.h"		#include "AArch64ExpandImm.h"
#include "MCTargetDesc/AArch64AddressingModes.h"		#include "MCTargetDesc/AArch64AddressingModes.h"
		#include "llvm/Analysis/IVDescriptors.h"
#include "llvm/Analysis/LoopInfo.h"		#include "llvm/Analysis/LoopInfo.h"
#include "llvm/Analysis/TargetTransformInfo.h"		#include "llvm/Analysis/TargetTransformInfo.h"
#include "llvm/CodeGen/BasicTTIImpl.h"		#include "llvm/CodeGen/BasicTTIImpl.h"
#include "llvm/CodeGen/CostTable.h"		#include "llvm/CodeGen/CostTable.h"
#include "llvm/CodeGen/TargetLowering.h"		#include "llvm/CodeGen/TargetLowering.h"
#include "llvm/IR/IntrinsicInst.h"		#include "llvm/IR/IntrinsicInst.h"
#include "llvm/IR/IntrinsicsAArch64.h"		#include "llvm/IR/IntrinsicsAArch64.h"
#include "llvm/IR/PatternMatch.h"		#include "llvm/IR/PatternMatch.h"
▲ Show 20 Lines • Show All 1,253 Lines • ▼ Show 20 Lines	if (Kind == TTI::SK_Broadcast \|\| Kind == TTI::SK_Transpose \|\|
};		};
std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, Tp);		std::pair<int, MVT> LT = TLI->getTypeLegalizationCost(DL, Tp);
if (const auto *Entry = CostTableLookup(ShuffleTbl, Kind, LT.second))		if (const auto *Entry = CostTableLookup(ShuffleTbl, Kind, LT.second))
return LT.first * Entry->Cost;		return LT.first * Entry->Cost;
}		}

return BaseT::getShuffleCost(Kind, Tp, Index, SubTp);		return BaseT::getShuffleCost(Kind, Tp, Index, SubTp);
}		}

		TTI::AddressingModeKind
		AArch64TTIImpl::getPreferredAddressingMode(const Loop *L,
		ScalarEvolution *SE) const {
		// Pre-indexed addressing modes will generally introduce base address
		// modifying instruction(s) into the preheader and is only really useful for
		fhahnUnsubmitted Not Done Reply Inline Actions What if the loop has multiple induction phis and one of those has a large constant step or unknown step? Should we instead iterate over all phis and check all induction phis (using `InductionDescriptor::isInductionPHI` to identify them)? fhahn: What if the loop has multiple induction phis and one of those has a large constant step or…
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions Yeah, I thought about that a bit. I don't think we are interested in all induction phis here. I think we are interested in what is called the `PrimaryInduction` in the loop vectoriser, i.e. the one that actually controls the loop. And this seems to match exactly with what `getInductionVariable()` promises to return, which is used by `getInductionDescriptor`. That's why this looked okay to me... SjoerdMeijer: Yeah, I thought about that a bit. I don't think we are interested in all induction phis here. I…
		fhahnUnsubmitted Not Done Reply Inline Actions I am not sure I completely understand why the IV that controls the loop is special when it comes to picking the addressing mode for the loop? As : D97050 indicates, `getInductionVariable` is quite brittle and probably misses additional cases, so if we can avoid using it the code should be more robust. If we have multiple IVs, would we not be interested if we can use post-increments for all of the ones that access memory? You could have a loop with 2 IVs, one to control the loop and one to access memory, like below. If we use the offset of the IV controlling the loop, post-index is profitable, but there won't be any accesses for that variable. (in this example, the inductions can probably be simplified to a single one, but it keeps things simple) int I=0, J=0; while (I != N) { Ptr[J] = 0; I++; J += 2000; } fhahn: I am not sure I completely understand why the IV that controls the loop is special when it…
		SjoerdMeijerAuthorUnsubmitted Done Reply Inline Actions Ah, okay, I misunderstood but understand your point now! So your suggestion is if we need a more fine-grained (precise) heuristic. I would need to give this some more thoughts, but the current heuristic is based on the distinction between unrolled loops not not unrolled loops (which is why I use the primary IV) which is a simple heuristic that seems to work. In general, I think this is quite a difficult problem, given that there are a few addressing modes and potentially quite some inductions to analyse. This will come at a cost, and it's unclear at the point if it will improve results. I have verified that the implementation in D89894 works (for runtime unrolled loops), but that indeed reveals an inefficiency (missed opportunity) in the load store optimiser as noted there, which means we can't use that yet for enabling pre-indexed accesses. But perhaps I can use that heuristic, which does a bit more analysis, to decide when not to generate pre-indexed but post-indexed accesses. But this slightly improved heuristic in D89894 may still not be precise enough for your liking... I will try to experiment a little bit with that, but at the moment I tend to think that this is a step forward, and this could be improved when we find the need for that? SjoerdMeijer: Ah, okay, I misunderstood but understand your point now! So your suggestion is if we need a…
		// unrolled loops, and we don't generally do when optimising for size.
		if (L->getHeader()->getParent()->hasOptSize() \|\|
		L->getNumBlocks() != 1)
		return TTI::AMK_None;

		return TTI::AMK_PreIndexed;
		dmgreenUnsubmitted Not Done Reply Inline Actions I'm not sure I understand this logic. Won't it detect many things that are not unrolled loops? Why does it start at 0? I think it's worth added a few deliberate tests for the different cases this is trying to optimize. Some for simple loops, more complex unrolled loops and vectorized/interleaved loops. Showing how the codegen has changed now that we change this default option. dmgreen: I'm not sure I understand this logic. Won't it detect many things that are not unrolled loops?
		}
		Lint: Pre-merge checks Inline Actions clang-format: please reformat the code - if (L->getHeader()->getParent()->hasOptSize() \|\| - L->getNumBlocks() != 1) + if (L->getHeader()->getParent()->hasOptSize() \|\| L->getNumBlocks() != 1) Lint: Pre-merge checks: clang-format: please reformat the code ``` - if (L->getHeader()->getParent()->hasOptSize() \|\|…

llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll

	Show First 20 Lines • Show All 127 Lines • ▼ Show 20 Lines
	for.end12: ; preds = %vector.body			for.end12: ; preds = %vector.body
	ret void			ret void
	}			}


	define void @matrix_mul_double_shuffle(i32 %N, i32* nocapture %C, i16* nocapture readonly %A, i16 %val) {			define void @matrix_mul_double_shuffle(i32 %N, i32* nocapture %C, i16* nocapture readonly %A, i16 %val) {
	; CHECK-LABEL: matrix_mul_double_shuffle:			; CHECK-LABEL: matrix_mul_double_shuffle:
	; CHECK: // %bb.0: // %vector.header			; CHECK: // %bb.0: // %vector.header
	; CHECK-NEXT: and w9, w3, #0xffff			; CHECK-NEXT: and w10, w3, #0xffff
	; CHECK-NEXT: // kill: def $w0 killed $w0 def $x0			; CHECK-NEXT: // kill: def $w0 killed $w0 def $x0
	; CHECK-NEXT: and x8, x0, #0xfffffff8			; CHECK-NEXT: and x8, x0, #0xfffffff8
	; CHECK-NEXT: dup v0.4h, w9			; CHECK-NEXT: sub x9, x2, #16 // =16
				; CHECK-NEXT: dup v0.4h, w10
	; CHECK-NEXT: .LBB2_1: // %vector.body			; CHECK-NEXT: .LBB2_1: // %vector.body
	; CHECK-NEXT: // =>This Inner Loop Header: Depth=1			; CHECK-NEXT: // =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: ldrh w9, [x2], #16			; CHECK-NEXT: ldrh w10, [x9, #16]!
	; CHECK-NEXT: mov w10, w0			; CHECK-NEXT: mov w11, w0
	; CHECK-NEXT: subs x8, x8, #8 // =8			; CHECK-NEXT: subs x8, x8, #8 // =8
	; CHECK-NEXT: lsl x10, x10, #2			; CHECK-NEXT: lsl x11, x11, #2
	; CHECK-NEXT: dup v1.4h, w9			; CHECK-NEXT: dup v1.4h, w10
	; CHECK-NEXT: umull v1.4s, v0.4h, v1.4h			; CHECK-NEXT: umull v1.4s, v0.4h, v1.4h
	; CHECK-NEXT: add w0, w0, #8 // =8			; CHECK-NEXT: add w0, w0, #8 // =8
	; CHECK-NEXT: str q1, [x1, x10]			; CHECK-NEXT: str q1, [x1, x11]
	; CHECK-NEXT: b.ne .LBB2_1			; CHECK-NEXT: b.ne .LBB2_1
	; CHECK-NEXT: // %bb.2: // %for.end12			; CHECK-NEXT: // %bb.2: // %for.end12
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	vector.header:			vector.header:
	%conv4 = zext i16 %val to i32			%conv4 = zext i16 %val to i32
	%wide.trip.count = zext i32 %N to i64			%wide.trip.count = zext i32 %N to i64
	%0 = add nsw i64 %wide.trip.count, -1			%0 = add nsw i64 %wide.trip.count, -1
	%min.iters.check = icmp ult i32 %N, 8			%min.iters.check = icmp ult i32 %N, 8
	Show All 29 Lines

llvm/test/CodeGen/AArch64/arm64-scaled_iv.ll

	; RUN: opt -S -loop-reduce < %s \| FileCheck %s			; RUN: opt -S -loop-reduce < %s \| FileCheck %s
	; Scaling factor in addressing mode are costly.			; Scaling factor in addressing mode are costly.
	; Make loop-reduce prefer unscaled accesses.			; Make loop-reduce prefer unscaled accesses.
	; <rdar://problem/13806271>			; <rdar://problem/13806271>
	target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-n32:64-S128"			target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-n32:64-S128"
	target triple = "arm64-apple-ios7.0.0"			target triple = "arm64-apple-ios7.0.0"

	; Function Attrs: nounwind ssp			; Function Attrs: nounwind ssp
	define void @mulDouble(double* nocapture %a, double* nocapture %b, double* nocapture %c) {			define void @mulDouble(double* nocapture %a, double* nocapture %b, double* nocapture %c) {
	; CHECK: @mulDouble			; CHECK: @mulDouble
	entry:			entry:
	br label %for.body			br label %for.body

	for.body: ; preds = %for.body, %entry			for.body: ; preds = %for.body, %entry
	; CHECK: [[IV:%[^ ]+]] = phi i64 [ [[IVNEXT:%[^,]+]], %for.body ], [ 0, %entry ]			; CHECK: [[IV:%[^ ]+]] = phi i64 [ [[IVNEXT:%[^,]+]], %for.body ], [ -8, %entry ]
	; Only one induction variable should have been generated.			; Only one induction variable should have been generated.
	; CHECK-NOT: phi			; CHECK-NOT: phi
	%indvars.iv = phi i64 [ 1, %entry ], [ %indvars.iv.next, %for.body ]			%indvars.iv = phi i64 [ 1, %entry ], [ %indvars.iv.next, %for.body ]
	%tmp = add nsw i64 %indvars.iv, -1			%tmp = add nsw i64 %indvars.iv, -1
	%arrayidx = getelementptr inbounds double, double* %b, i64 %tmp			%arrayidx = getelementptr inbounds double, double* %b, i64 %tmp
	%tmp1 = load double, double* %arrayidx, align 8			%tmp1 = load double, double* %arrayidx, align 8
	; The induction variable should carry the scaling factor: 1 * 8 = 8.			; The induction variable should carry the scaling factor: 1 * 8 = 8.
	; CHECK: [[IVNEXT]] = add nuw nsw i64 [[IV]], 8			; CHECK: [[IVNEXT]] = add nsw i64 [[IV]], 8
	%indvars.iv.next = add i64 %indvars.iv, 1			%indvars.iv.next = add i64 %indvars.iv, 1
	%arrayidx2 = getelementptr inbounds double, double* %c, i64 %indvars.iv.next			%arrayidx2 = getelementptr inbounds double, double* %c, i64 %indvars.iv.next
	%tmp2 = load double, double* %arrayidx2, align 8			%tmp2 = load double, double* %arrayidx2, align 8
	%mul = fmul double %tmp1, %tmp2			%mul = fmul double %tmp1, %tmp2
	%arrayidx4 = getelementptr inbounds double, double* %a, i64 %indvars.iv			%arrayidx4 = getelementptr inbounds double, double* %a, i64 %indvars.iv
	store double %mul, double* %arrayidx4, align 8			store double %mul, double* %arrayidx4, align 8
	%lftr.wideiv = trunc i64 %indvars.iv.next to i32			%lftr.wideiv = trunc i64 %indvars.iv.next to i32
	; Comparison should be 19 * 8 = 152.			; Comparison should be 19 * 8 - 8 = 144.
	; CHECK: icmp eq i32 {{%[^,]+}}, 152			; CHECK: icmp eq i32 {{%[^,]+}}, 144
	%exitcond = icmp eq i32 %lftr.wideiv, 20			%exitcond = icmp eq i32 %lftr.wideiv, 20
	br i1 %exitcond, label %for.end, label %for.body			br i1 %exitcond, label %for.end, label %for.body

	for.end: ; preds = %for.body			for.end: ; preds = %for.body
	ret void			ret void
	}			}

llvm/test/CodeGen/AArch64/falkor-hwpf-fix.ll

	; RUN: llc < %s -mtriple aarch64 -mcpu=falkor -disable-post-ra \| FileCheck %s			; RUN: llc < %s -mtriple aarch64 -mcpu=falkor -disable-post-ra \| FileCheck %s

	; Check that strided load tag collisions are avoided on Falkor.			; Check that strided load tag collisions are avoided on Falkor.

	; CHECK-LABEL: hwpf1:			; CHECK-LABEL: hwpf1:
	; CHECK: ldp {{w[0-9]+}}, {{w[0-9]+}}, [x[[BASE:[0-9]+]], #-16]			; CHECK: ldp {{w[0-9]+}}, {{w[0-9]+}}, [x[[BASE:[0-9]+]]]
	; CHECK: mov x[[BASE2:[0-9]+]], x[[BASE]]			; CHECK: mov x[[BASE2:[0-9]+]], x[[BASE]]
	; CHECK: ldp {{w[0-9]+}}, {{w[0-9]+}}, [x[[BASE2]], #-8]			; CHECK: ldp {{w[0-9]+}}, {{w[0-9]+}}, [x[[BASE2]], #8]
	; CHECK: ldp {{w[0-9]+}}, {{w[0-9]+}}, [x[[BASE3:[0-9]+]]]			; CHECK: ldp {{w[0-9]+}}, {{w[0-9]+}}, [x[[BASE3:[0-9]+]]]
	; CHECK: mov x[[BASE4:[0-9]+]], x[[BASE3]]			; CHECK: mov x[[BASE4:[0-9]+]], x[[BASE3]]
	; CHECK: ldp {{w[0-9]+}}, {{w[0-9]+}}, [x[[BASE4]], #8]			; CHECK: ldp {{w[0-9]+}}, {{w[0-9]+}}, [x[[BASE4]], #8]

	define void @hwpf1(i32* %p, i32* %sp, i32* %sp2, i32* %sp3, i32* %sp4) {			define void @hwpf1(i32* %p, i32* %sp, i32* %sp2, i32* %sp3, i32* %sp4) {
	entry:			entry:
	br label %loop			br label %loop

	▲ Show 20 Lines • Show All 51 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/pr27816.ll

	; RUN: llc %s -mtriple=aarch64 -o - \| FileCheck %s			; RUN: llc %s -mtriple=aarch64 -o - \| FileCheck %s

	%struct.A = type { i8, i8, i8, i8, i8, i8, i8, i8, i32 }			%struct.A = type { i8, i8, i8, i8, i8, i8, i8, i8, i32 }

	; The existence of the final i32 value should not prevent the i8s from			; The existence of the final i32 value should not prevent the i8s from
	; being merged.			; being merged.

	; CHECK-LABEL: @merge_const_store			; CHECK-LABEL: @merge_const_store
	; CHECK-NOT: strb			; CHECK-NOT: strb
	; CHECK: str x8, [x1]			; CHECK: str x9, [x8, #12]!
	; CHECK-NOT: strb			; CHECK-NOT: strb
	; CHECK: str wzr, [x1, #8]			; CHECK: str wzr, [x8, #8]
	; CHECK-NOT: strb			; CHECK-NOT: strb
	define void @merge_const_store(i32 %count, %struct.A* nocapture %p) {			define void @merge_const_store(i32 %count, %struct.A* nocapture %p) {
	%1 = icmp sgt i32 %count, 0			%1 = icmp sgt i32 %count, 0
	br i1 %1, label %.lr.ph, label %._crit_edge			br i1 %1, label %.lr.ph, label %._crit_edge
	.lr.ph:			.lr.ph:
	%i.02 = phi i32 [ %add, %.lr.ph ], [ 0, %0 ]			%i.02 = phi i32 [ %add, %.lr.ph ], [ 0, %0 ]
	%.01 = phi %struct.A* [ %addr, %.lr.ph ], [ %p, %0 ]			%.01 = phi %struct.A* [ %addr, %.lr.ph ], [ %p, %0 ]
	%a2 = getelementptr inbounds %struct.A, %struct.A* %.01, i64 0, i32 0			%a2 = getelementptr inbounds %struct.A, %struct.A* %.01, i64 0, i32 0
	Show All 28 Lines

llvm/test/CodeGen/AArch64/ragreedy-local-interval-cost.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc -mtriple=aarch64-arm-none-eabi < %s \| FileCheck %s			; RUN: llc -mtriple=aarch64-arm-none-eabi < %s \| FileCheck %s

	@A = external dso_local local_unnamed_addr global [8 x [8 x i64]], align 8			@A = external dso_local local_unnamed_addr global [8 x [8 x i64]], align 8
	@B = external dso_local local_unnamed_addr global [8 x [8 x i64]], align 8			@B = external dso_local local_unnamed_addr global [8 x [8 x i64]], align 8
	@C = external dso_local local_unnamed_addr global [8 x [8 x i64]], align 8			@C = external dso_local local_unnamed_addr global [8 x [8 x i64]], align 8

	define dso_local void @run_test() local_unnamed_addr #0 {			define dso_local void @run_test() local_unnamed_addr #0 {
	; CHECK-LABEL: run_test:			; CHECK-LABEL: run_test:
	; CHECK: // %bb.0: // %entry			; CHECK: // %bb.0: // %entry
	; CHECK-NEXT: sub sp, sp, #80 // =80			; CHECK-NEXT: stp d15, d14, [sp, #-64]! // 16-byte Folded Spill
	; CHECK-NEXT: stp d15, d14, [sp, #16] // 16-byte Folded Spill			; CHECK-NEXT: stp d13, d12, [sp, #16] // 16-byte Folded Spill
	; CHECK-NEXT: stp d13, d12, [sp, #32] // 16-byte Folded Spill			; CHECK-NEXT: stp d11, d10, [sp, #32] // 16-byte Folded Spill
	; CHECK-NEXT: stp d11, d10, [sp, #48] // 16-byte Folded Spill			; CHECK-NEXT: stp d9, d8, [sp, #48] // 16-byte Folded Spill
	; CHECK-NEXT: stp d9, d8, [sp, #64] // 16-byte Folded Spill			; CHECK-NEXT: .cfi_def_cfa_offset 64
	; CHECK-NEXT: .cfi_def_cfa_offset 80
	; CHECK-NEXT: .cfi_offset b8, -8			; CHECK-NEXT: .cfi_offset b8, -8
	; CHECK-NEXT: .cfi_offset b9, -16			; CHECK-NEXT: .cfi_offset b9, -16
	; CHECK-NEXT: .cfi_offset b10, -24			; CHECK-NEXT: .cfi_offset b10, -24
	; CHECK-NEXT: .cfi_offset b11, -32			; CHECK-NEXT: .cfi_offset b11, -32
	; CHECK-NEXT: .cfi_offset b12, -40			; CHECK-NEXT: .cfi_offset b12, -40
	; CHECK-NEXT: .cfi_offset b13, -48			; CHECK-NEXT: .cfi_offset b13, -48
	; CHECK-NEXT: .cfi_offset b14, -56			; CHECK-NEXT: .cfi_offset b14, -56
	; CHECK-NEXT: .cfi_offset b15, -64			; CHECK-NEXT: .cfi_offset b15, -64
	; CHECK-NEXT: adrp x10, B+48			; CHECK-NEXT: adrp x10, B
	; CHECK-NEXT: adrp x11, A			; CHECK-NEXT: add x10, x10, :lo12:B
				; CHECK-NEXT: adrp x9, A+120
	; CHECK-NEXT: mov x8, xzr			; CHECK-NEXT: mov x8, xzr
	; CHECK-NEXT: mov x9, xzr
	; CHECK-NEXT: movi v0.2d, #0000000000000000			; CHECK-NEXT: movi v0.2d, #0000000000000000
	; CHECK-NEXT: add x10, x10, :lo12:B+48			; CHECK-NEXT: add x9, x9, :lo12:A+120
	; CHECK-NEXT: add x11, x11, :lo12:A			; CHECK-NEXT: sub x10, x10, #16 // =16
	; CHECK-NEXT: str q0, [sp] // 16-byte Folded Spill			; CHECK-NEXT: mov w11, #8
	; CHECK-NEXT: // implicit-def: $q1			; CHECK-NEXT: // implicit-def: $q1
	; CHECK-NEXT: // implicit-def: $q2			; CHECK-NEXT: // implicit-def: $q2
	; CHECK-NEXT: // implicit-def: $q3			; CHECK-NEXT: // implicit-def: $q3
	; CHECK-NEXT: // implicit-def: $q4			; CHECK-NEXT: // implicit-def: $q4
	; CHECK-NEXT: // implicit-def: $q5			; CHECK-NEXT: // implicit-def: $q5
	; CHECK-NEXT: // implicit-def: $q6			; CHECK-NEXT: // implicit-def: $q6
	; CHECK-NEXT: // implicit-def: $q7			; CHECK-NEXT: // implicit-def: $q7
	; CHECK-NEXT: // implicit-def: $q16			; CHECK-NEXT: // implicit-def: $q16
	Show All 15 Lines
	; CHECK-NEXT: // implicit-def: $q8			; CHECK-NEXT: // implicit-def: $q8
	; CHECK-NEXT: // implicit-def: $q9			; CHECK-NEXT: // implicit-def: $q9
	; CHECK-NEXT: // implicit-def: $q10			; CHECK-NEXT: // implicit-def: $q10
	; CHECK-NEXT: // implicit-def: $q11			; CHECK-NEXT: // implicit-def: $q11
	; CHECK-NEXT: // implicit-def: $q12			; CHECK-NEXT: // implicit-def: $q12
	; CHECK-NEXT: // implicit-def: $q13			; CHECK-NEXT: // implicit-def: $q13
	; CHECK-NEXT: .LBB0_1: // %for.cond1.preheader			; CHECK-NEXT: .LBB0_1: // %for.cond1.preheader
	; CHECK-NEXT: // =>This Inner Loop Header: Depth=1			; CHECK-NEXT: // =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: mov x12, xzr			; CHECK-NEXT: mov x13, xzr
				; CHECK-NEXT: ldr q14, [x13]
	; CHECK-NEXT: ldr q15, [x8]			; CHECK-NEXT: ldr q15, [x8]
	; CHECK-NEXT: ldr q14, [x12]			; CHECK-NEXT: subs x11, x11, #1 // =1
	; CHECK-NEXT: ldr q0, [x10], #64
	; CHECK-NEXT: ldr x18, [x12]
	; CHECK-NEXT: fmov x15, d15
	; CHECK-NEXT: mov x14, v15.d[1]
	; CHECK-NEXT: fmov x13, d14
	; CHECK-NEXT: mul x1, x15, x18
	; CHECK-NEXT: mov x16, v0.d[1]
	; CHECK-NEXT: fmov x17, d0
	; CHECK-NEXT: fmov d0, x1
	; CHECK-NEXT: mul x1, x14, x18
	; CHECK-NEXT: mov x12, v14.d[1]			; CHECK-NEXT: mov x12, v14.d[1]
	; CHECK-NEXT: ldr x0, [x8]			; CHECK-NEXT: fmov x14, d14
	; CHECK-NEXT: mov v0.d[1], x1			; CHECK-NEXT: ldr q14, [x10, #64]!
	; CHECK-NEXT: mul x1, x13, x18			; CHECK-NEXT: ldr x13, [x13]
	; CHECK-NEXT: add v12.2d, v12.2d, v0.2d			; CHECK-NEXT: fmov x16, d15
	; CHECK-NEXT: fmov d0, x1			; CHECK-NEXT: mov x15, v15.d[1]
	; CHECK-NEXT: mul x1, x12, x18			; CHECK-NEXT: mov x1, v14.d[1]
	; CHECK-NEXT: mov v0.d[1], x1			; CHECK-NEXT: mul x0, x16, x13
	; CHECK-NEXT: mul x1, x17, x18			; CHECK-NEXT: fmov x2, d14
	; CHECK-NEXT: add v13.2d, v13.2d, v0.2d			; CHECK-NEXT: fmov d14, x0
	; CHECK-NEXT: add v11.2d, v11.2d, v0.2d			; CHECK-NEXT: mul x0, x15, x13
	; CHECK-NEXT: fmov d0, x1			; CHECK-NEXT: ldr x17, [x8], #1
	; CHECK-NEXT: mul x18, x16, x18			; CHECK-NEXT: ldr x18, [x9, #8]!
	; CHECK-NEXT: ldr q14, [sp] // 16-byte Folded Reload			; CHECK-NEXT: mov v14.d[1], x0
	; CHECK-NEXT: mov v0.d[1], x18			; CHECK-NEXT: mul x0, x14, x13
	; CHECK-NEXT: mul x18, x15, x0			; CHECK-NEXT: add v12.2d, v12.2d, v14.2d
	; CHECK-NEXT: add x1, x11, x8			; CHECK-NEXT: fmov d14, x0
	; CHECK-NEXT: add v10.2d, v10.2d, v0.2d			; CHECK-NEXT: mul x0, x12, x13
	; CHECK-NEXT: fmov d0, x18			; CHECK-NEXT: mov v14.d[1], x0
	; CHECK-NEXT: mul x18, x14, x0			; CHECK-NEXT: mul x0, x2, x13
	; CHECK-NEXT: ldr x1, [x1, #128]			; CHECK-NEXT: add v13.2d, v13.2d, v14.2d
	; CHECK-NEXT: mov v0.d[1], x18			; CHECK-NEXT: add v11.2d, v11.2d, v14.2d
	; CHECK-NEXT: mul x18, x13, x0			; CHECK-NEXT: fmov d14, x0
	; CHECK-NEXT: add v8.2d, v8.2d, v0.2d			; CHECK-NEXT: mul x13, x1, x13
	; CHECK-NEXT: add v25.2d, v25.2d, v0.2d			; CHECK-NEXT: mov v14.d[1], x13
	; CHECK-NEXT: add v22.2d, v22.2d, v0.2d			; CHECK-NEXT: mul x13, x16, x18
	; CHECK-NEXT: add v18.2d, v18.2d, v0.2d			; CHECK-NEXT: add v10.2d, v10.2d, v14.2d
	; CHECK-NEXT: add v6.2d, v6.2d, v0.2d			; CHECK-NEXT: fmov d14, x13
	; CHECK-NEXT: add v14.2d, v14.2d, v0.2d			; CHECK-NEXT: mul x13, x15, x18
	; CHECK-NEXT: fmov d0, x18			; CHECK-NEXT: mov v14.d[1], x13
	; CHECK-NEXT: mul x18, x12, x0			; CHECK-NEXT: mul x13, x14, x17
	; CHECK-NEXT: mov v0.d[1], x18			; CHECK-NEXT: mul x14, x14, x18
	; CHECK-NEXT: mul x18, x17, x0			; CHECK-NEXT: add v29.2d, v29.2d, v14.2d
	; CHECK-NEXT: mul x0, x16, x0			; CHECK-NEXT: fmov d14, x14
	; CHECK-NEXT: add v9.2d, v9.2d, v0.2d			; CHECK-NEXT: mul x14, x12, x18
	; CHECK-NEXT: add v31.2d, v31.2d, v0.2d			; CHECK-NEXT: mov v14.d[1], x14
	; CHECK-NEXT: add v26.2d, v26.2d, v0.2d			; CHECK-NEXT: mul x0, x2, x18
	; CHECK-NEXT: add v23.2d, v23.2d, v0.2d			; CHECK-NEXT: mul x18, x1, x18
	; CHECK-NEXT: add v21.2d, v21.2d, v0.2d			; CHECK-NEXT: add v28.2d, v28.2d, v14.2d
	; CHECK-NEXT: add v19.2d, v19.2d, v0.2d			; CHECK-NEXT: fmov d14, x0
	; CHECK-NEXT: add v17.2d, v17.2d, v0.2d			; CHECK-NEXT: mul x16, x16, x17
	; CHECK-NEXT: add v7.2d, v7.2d, v0.2d			; CHECK-NEXT: mov v14.d[1], x18
	; CHECK-NEXT: add v5.2d, v5.2d, v0.2d			; CHECK-NEXT: mul x15, x15, x17
	; CHECK-NEXT: add v3.2d, v3.2d, v0.2d			; CHECK-NEXT: add v27.2d, v27.2d, v14.2d
	; CHECK-NEXT: add v2.2d, v2.2d, v0.2d			; CHECK-NEXT: fmov d14, x16
	; CHECK-NEXT: fmov d0, x18			; CHECK-NEXT: mov v14.d[1], x15
	; CHECK-NEXT: mul x15, x15, x1			; CHECK-NEXT: mul x12, x12, x17
	; CHECK-NEXT: mov v0.d[1], x0			; CHECK-NEXT: add v8.2d, v8.2d, v14.2d
	; CHECK-NEXT: mul x14, x14, x1			; CHECK-NEXT: add v25.2d, v25.2d, v14.2d
	; CHECK-NEXT: add v30.2d, v30.2d, v0.2d			; CHECK-NEXT: add v22.2d, v22.2d, v14.2d
	; CHECK-NEXT: add v24.2d, v24.2d, v0.2d			; CHECK-NEXT: add v18.2d, v18.2d, v14.2d
	; CHECK-NEXT: add v20.2d, v20.2d, v0.2d			; CHECK-NEXT: add v6.2d, v6.2d, v14.2d
	; CHECK-NEXT: add v16.2d, v16.2d, v0.2d			; CHECK-NEXT: add v0.2d, v0.2d, v14.2d
	; CHECK-NEXT: add v4.2d, v4.2d, v0.2d			; CHECK-NEXT: fmov d14, x13
	; CHECK-NEXT: add v1.2d, v1.2d, v0.2d			; CHECK-NEXT: mul x14, x2, x17
	; CHECK-NEXT: fmov d0, x15			; CHECK-NEXT: mov v14.d[1], x12
	; CHECK-NEXT: mul x13, x13, x1			; CHECK-NEXT: mul x13, x1, x17
	; CHECK-NEXT: mov v0.d[1], x14			; CHECK-NEXT: add v9.2d, v9.2d, v14.2d
	; CHECK-NEXT: mul x12, x12, x1			; CHECK-NEXT: add v31.2d, v31.2d, v14.2d
	; CHECK-NEXT: add v29.2d, v29.2d, v0.2d			; CHECK-NEXT: add v26.2d, v26.2d, v14.2d
	; CHECK-NEXT: fmov d0, x13			; CHECK-NEXT: add v23.2d, v23.2d, v14.2d
	; CHECK-NEXT: mul x17, x17, x1			; CHECK-NEXT: add v21.2d, v21.2d, v14.2d
	; CHECK-NEXT: mov v0.d[1], x12			; CHECK-NEXT: add v19.2d, v19.2d, v14.2d
	; CHECK-NEXT: mul x16, x16, x1			; CHECK-NEXT: add v17.2d, v17.2d, v14.2d
	; CHECK-NEXT: add v28.2d, v28.2d, v0.2d			; CHECK-NEXT: add v7.2d, v7.2d, v14.2d
	; CHECK-NEXT: fmov d0, x17			; CHECK-NEXT: add v5.2d, v5.2d, v14.2d
	; CHECK-NEXT: mov v0.d[1], x16			; CHECK-NEXT: add v3.2d, v3.2d, v14.2d
	; CHECK-NEXT: add x8, x8, #8 // =8			; CHECK-NEXT: add v2.2d, v2.2d, v14.2d
	; CHECK-NEXT: add v27.2d, v27.2d, v0.2d			; CHECK-NEXT: fmov d14, x14
	; CHECK-NEXT: cmp x8, #64 // =64			; CHECK-NEXT: mov v14.d[1], x13
	; CHECK-NEXT: add x9, x9, #1 // =1			; CHECK-NEXT: add v30.2d, v30.2d, v14.2d
	; CHECK-NEXT: str q14, [sp] // 16-byte Folded Spill			; CHECK-NEXT: add v24.2d, v24.2d, v14.2d
				; CHECK-NEXT: add v20.2d, v20.2d, v14.2d
				; CHECK-NEXT: add v16.2d, v16.2d, v14.2d
				; CHECK-NEXT: add v4.2d, v4.2d, v14.2d
				; CHECK-NEXT: add v1.2d, v1.2d, v14.2d
	; CHECK-NEXT: b.ne .LBB0_1			; CHECK-NEXT: b.ne .LBB0_1
	; CHECK-NEXT: // %bb.2: // %for.cond.cleanup			; CHECK-NEXT: // %bb.2: // %for.cond.cleanup
	; CHECK-NEXT: adrp x8, C			; CHECK-NEXT: adrp x8, C
	; CHECK-NEXT: add x8, x8, :lo12:C			; CHECK-NEXT: add x8, x8, :lo12:C
	; CHECK-NEXT: ldr q0, [sp] // 16-byte Folded Reload
	; CHECK-NEXT: stp q13, q12, [x8]			; CHECK-NEXT: stp q13, q12, [x8]
	; CHECK-NEXT: stp q11, q10, [x8, #32]			; CHECK-NEXT: stp q11, q10, [x8, #32]
	; CHECK-NEXT: stp q9, q8, [x8, #64]			; CHECK-NEXT: stp q9, q8, [x8, #64]
	; CHECK-NEXT: ldp d9, d8, [sp, #64] // 16-byte Folded Reload			; CHECK-NEXT: ldp d9, d8, [sp, #48] // 16-byte Folded Reload
	; CHECK-NEXT: ldp d11, d10, [sp, #48] // 16-byte Folded Reload			; CHECK-NEXT: ldp d11, d10, [sp, #32] // 16-byte Folded Reload
	; CHECK-NEXT: ldp d13, d12, [sp, #32] // 16-byte Folded Reload			; CHECK-NEXT: ldp d13, d12, [sp, #16] // 16-byte Folded Reload
	; CHECK-NEXT: ldp d15, d14, [sp, #16] // 16-byte Folded Reload
	; CHECK-NEXT: stp q31, q30, [x8, #96]			; CHECK-NEXT: stp q31, q30, [x8, #96]
	; CHECK-NEXT: stp q29, q28, [x8, #144]			; CHECK-NEXT: stp q29, q28, [x8, #144]
	; CHECK-NEXT: stp q27, q26, [x8, #176]			; CHECK-NEXT: stp q27, q26, [x8, #176]
	; CHECK-NEXT: str q25, [x8, #208]			; CHECK-NEXT: str q25, [x8, #208]
	; CHECK-NEXT: stp q24, q23, [x8, #240]			; CHECK-NEXT: stp q24, q23, [x8, #240]
	; CHECK-NEXT: stp q22, q21, [x8, #272]			; CHECK-NEXT: stp q22, q21, [x8, #272]
	; CHECK-NEXT: stp q20, q19, [x8, #304]			; CHECK-NEXT: stp q20, q19, [x8, #304]
	; CHECK-NEXT: stp q18, q17, [x8, #336]			; CHECK-NEXT: stp q18, q17, [x8, #336]
	; CHECK-NEXT: stp q16, q7, [x8, #368]			; CHECK-NEXT: stp q16, q7, [x8, #368]
	; CHECK-NEXT: stp q6, q5, [x8, #400]			; CHECK-NEXT: stp q6, q5, [x8, #400]
	; CHECK-NEXT: stp q4, q3, [x8, #432]			; CHECK-NEXT: stp q4, q3, [x8, #432]
	; CHECK-NEXT: stp q0, q2, [x8, #464]			; CHECK-NEXT: stp q0, q2, [x8, #464]
	; CHECK-NEXT: str q1, [x8, #496]			; CHECK-NEXT: str q1, [x8, #496]
	; CHECK-NEXT: add sp, sp, #80 // =80			; CHECK-NEXT: ldp d15, d14, [sp], #64 // 16-byte Folded Reload
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	entry:			entry:
	br label %for.cond1.preheader			br label %for.cond1.preheader

	for.cond1.preheader: ; preds = %for.cond1.preheader, %entry			for.cond1.preheader: ; preds = %for.cond1.preheader, %entry
	%0 = phi <2 x i64> [ undef, %entry ], [ %118, %for.cond1.preheader ]			%0 = phi <2 x i64> [ undef, %entry ], [ %118, %for.cond1.preheader ]
	%1 = phi <2 x i64> [ undef, %entry ], [ %116, %for.cond1.preheader ]			%1 = phi <2 x i64> [ undef, %entry ], [ %116, %for.cond1.preheader ]
	%2 = phi <2 x i64> [ zeroinitializer, %entry ], [ %114, %for.cond1.preheader ]			%2 = phi <2 x i64> [ zeroinitializer, %entry ], [ %114, %for.cond1.preheader ]
	▲ Show 20 Lines • Show All 157 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/vldn_shuffle.ll

	; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py			; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
	; RUN: llc < %s -mtriple=aarch64-none-eabif \| FileCheck %s			; RUN: llc < %s -mtriple=aarch64-none-eabif \| FileCheck %s
				dmgreenUnsubmitted Not Done Reply Inline Actions I don't think we should be adding this to unrelated tests if we can help it. For tests like this that are testing something else, it's OK to just check the default. Same for all the other tests. From what I can tell they seem the same or better? dmgreen: I don't think we should be adding this to unrelated tests if we can help it. For tests like…

	define void @vld2(float* nocapture readonly %pSrc, float* noalias nocapture %pDst, i32 %numSamples) {			define void @vld2(float* nocapture readonly %pSrc, float* noalias nocapture %pDst, i32 %numSamples) {
	; CHECK-LABEL: vld2:			; CHECK-LABEL: vld2:
	; CHECK: // %bb.0: // %entry			; CHECK: // %bb.0: // %entry
	; CHECK-NEXT: mov x8, xzr			; CHECK-NEXT: sub x8, x1, #16 // =16
				; CHECK-NEXT: sub x9, x0, #32 // =32
				; CHECK-NEXT: mov w10, #1024
	; CHECK-NEXT: .LBB0_1: // %vector.body			; CHECK-NEXT: .LBB0_1: // %vector.body
	; CHECK-NEXT: // =>This Inner Loop Header: Depth=1			; CHECK-NEXT: // =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: ld2 { v0.4s, v1.4s }, [x0], #32			; CHECK-NEXT: add x9, x9, #32 // =32
				; CHECK-NEXT: ld2 { v0.4s, v1.4s }, [x9]
				; CHECK-NEXT: subs x10, x10, #4 // =4
	; CHECK-NEXT: fmul v2.4s, v0.4s, v0.4s			; CHECK-NEXT: fmul v2.4s, v0.4s, v0.4s
	; CHECK-NEXT: fmla v2.4s, v1.4s, v1.4s			; CHECK-NEXT: fmla v2.4s, v1.4s, v1.4s
	; CHECK-NEXT: str q2, [x1, x8]			; CHECK-NEXT: str q2, [x8, #16]!
	; CHECK-NEXT: add x8, x8, #16 // =16
	; CHECK-NEXT: cmp x8, #1, lsl #12 // =4096
	; CHECK-NEXT: b.ne .LBB0_1			; CHECK-NEXT: b.ne .LBB0_1
	; CHECK-NEXT: // %bb.2: // %while.end			; CHECK-NEXT: // %bb.2: // %while.end
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	entry:			entry:
	br label %vector.body			br label %vector.body

	vector.body: ; preds = %vector.body, %entry			vector.body: ; preds = %vector.body, %entry
	%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]			%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
	Show All 15 Lines

	while.end: ; preds = %vector.body			while.end: ; preds = %vector.body
	ret void			ret void
	}			}

	define void @vld3(float* nocapture readonly %pSrc, float* noalias nocapture %pDst, i32 %numSamples) {			define void @vld3(float* nocapture readonly %pSrc, float* noalias nocapture %pDst, i32 %numSamples) {
	; CHECK-LABEL: vld3:			; CHECK-LABEL: vld3:
	; CHECK: // %bb.0: // %entry			; CHECK: // %bb.0: // %entry
	; CHECK-NEXT: mov x8, xzr			; CHECK-NEXT: sub x8, x1, #16 // =16
				; CHECK-NEXT: sub x9, x0, #48 // =48
				; CHECK-NEXT: mov w10, #1024
	; CHECK-NEXT: .LBB1_1: // %vector.body			; CHECK-NEXT: .LBB1_1: // %vector.body
	; CHECK-NEXT: // =>This Inner Loop Header: Depth=1			; CHECK-NEXT: // =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: ld3 { v0.4s, v1.4s, v2.4s }, [x0], #48			; CHECK-NEXT: add x9, x9, #48 // =48
				; CHECK-NEXT: ld3 { v0.4s, v1.4s, v2.4s }, [x9]
				; CHECK-NEXT: subs x10, x10, #4 // =4
	; CHECK-NEXT: fmul v3.4s, v0.4s, v0.4s			; CHECK-NEXT: fmul v3.4s, v0.4s, v0.4s
	; CHECK-NEXT: fmla v3.4s, v1.4s, v1.4s			; CHECK-NEXT: fmla v3.4s, v1.4s, v1.4s
	; CHECK-NEXT: fmla v3.4s, v2.4s, v2.4s			; CHECK-NEXT: fmla v3.4s, v2.4s, v2.4s
	; CHECK-NEXT: str q3, [x1, x8]			; CHECK-NEXT: str q3, [x8, #16]!
	; CHECK-NEXT: add x8, x8, #16 // =16
	; CHECK-NEXT: cmp x8, #1, lsl #12 // =4096
	; CHECK-NEXT: b.ne .LBB1_1			; CHECK-NEXT: b.ne .LBB1_1
	; CHECK-NEXT: // %bb.2: // %while.end			; CHECK-NEXT: // %bb.2: // %while.end
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	entry:			entry:
	br label %vector.body			br label %vector.body

	vector.body: ; preds = %vector.body, %entry			vector.body: ; preds = %vector.body, %entry
	%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]			%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
	Show All 18 Lines

	while.end: ; preds = %vector.body			while.end: ; preds = %vector.body
	ret void			ret void
	}			}

	define void @vld4(float* nocapture readonly %pSrc, float* noalias nocapture %pDst, i32 %numSamples) {			define void @vld4(float* nocapture readonly %pSrc, float* noalias nocapture %pDst, i32 %numSamples) {
	; CHECK-LABEL: vld4:			; CHECK-LABEL: vld4:
	; CHECK: // %bb.0: // %entry			; CHECK: // %bb.0: // %entry
	; CHECK-NEXT: mov x8, xzr			; CHECK-NEXT: sub x8, x1, #32 // =32
				; CHECK-NEXT: sub x9, x0, #64 // =64
				; CHECK-NEXT: mov w10, #1024
	; CHECK-NEXT: .LBB2_1: // %vector.body			; CHECK-NEXT: .LBB2_1: // %vector.body
	; CHECK-NEXT: // =>This Inner Loop Header: Depth=1			; CHECK-NEXT: // =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: ld4 { v0.4s, v1.4s, v2.4s, v3.4s }, [x0], #64			; CHECK-NEXT: add x9, x9, #64 // =64
	; CHECK-NEXT: add x9, x1, x8			; CHECK-NEXT: ld4 { v0.4s, v1.4s, v2.4s, v3.4s }, [x9]
	; CHECK-NEXT: add x8, x8, #32 // =32			; CHECK-NEXT: add x8, x8, #32 // =32
	; CHECK-NEXT: cmp x8, #2, lsl #12 // =8192			; CHECK-NEXT: subs x10, x10, #4 // =4
	; CHECK-NEXT: fmul v4.4s, v0.4s, v0.4s			; CHECK-NEXT: fmul v4.4s, v0.4s, v0.4s
	; CHECK-NEXT: fmla v4.4s, v1.4s, v1.4s			; CHECK-NEXT: fmla v4.4s, v1.4s, v1.4s
	; CHECK-NEXT: fmul v5.4s, v2.4s, v2.4s			; CHECK-NEXT: fmul v5.4s, v2.4s, v2.4s
	; CHECK-NEXT: fmla v5.4s, v3.4s, v3.4s			; CHECK-NEXT: fmla v5.4s, v3.4s, v3.4s
	; CHECK-NEXT: st2 { v4.4s, v5.4s }, [x9]			; CHECK-NEXT: st2 { v4.4s, v5.4s }, [x8]
	; CHECK-NEXT: b.ne .LBB2_1			; CHECK-NEXT: b.ne .LBB2_1
	; CHECK-NEXT: // %bb.2: // %while.end			; CHECK-NEXT: // %bb.2: // %while.end
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	entry:			entry:
	br label %vector.body			br label %vector.body

	vector.body: ; preds = %vector.body, %entry			vector.body: ; preds = %vector.body, %entry
	%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]			%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
	Show All 23 Lines
	while.end: ; preds = %vector.body			while.end: ; preds = %vector.body
	ret void			ret void
	}			}

	define void @twosrc(float* nocapture readonly %pSrc, float* nocapture readonly %pSrc2, float* noalias nocapture %pDst, i32 %numSamples) {			define void @twosrc(float* nocapture readonly %pSrc, float* nocapture readonly %pSrc2, float* noalias nocapture %pDst, i32 %numSamples) {
	; CHECK-LABEL: twosrc:			; CHECK-LABEL: twosrc:
	; CHECK: // %bb.0: // %entry			; CHECK: // %bb.0: // %entry
	; CHECK-NEXT: mov x8, xzr			; CHECK-NEXT: mov x8, xzr
				; CHECK-NEXT: sub x9, x2, #16 // =16
	; CHECK-NEXT: .LBB3_1: // %vector.body			; CHECK-NEXT: .LBB3_1: // %vector.body
	; CHECK-NEXT: // =>This Inner Loop Header: Depth=1			; CHECK-NEXT: // =>This Inner Loop Header: Depth=1
	; CHECK-NEXT: add x9, x0, x8			; CHECK-NEXT: add x10, x0, x8
	; CHECK-NEXT: add x10, x1, x8			; CHECK-NEXT: add x11, x1, x8
	; CHECK-NEXT: ld2 { v0.4s, v1.4s }, [x9]			; CHECK-NEXT: ld2 { v0.4s, v1.4s }, [x10]
	; CHECK-NEXT: ld2 { v2.4s, v3.4s }, [x10]			; CHECK-NEXT: ld2 { v2.4s, v3.4s }, [x11]
	; CHECK-NEXT: add x8, x8, #32 // =32			; CHECK-NEXT: add x8, x8, #32 // =32
	; CHECK-NEXT: cmp x8, #2, lsl #12 // =8192			; CHECK-NEXT: cmp x8, #2, lsl #12 // =8192
	; CHECK-NEXT: fmul v4.4s, v2.4s, v0.4s			; CHECK-NEXT: fmul v4.4s, v2.4s, v0.4s
	; CHECK-NEXT: fmla v4.4s, v1.4s, v3.4s			; CHECK-NEXT: fmla v4.4s, v1.4s, v3.4s
	; CHECK-NEXT: str q4, [x2], #16			; CHECK-NEXT: str q4, [x9, #16]!
	; CHECK-NEXT: b.ne .LBB3_1			; CHECK-NEXT: b.ne .LBB3_1
	; CHECK-NEXT: // %bb.2: // %while.end			; CHECK-NEXT: // %bb.2: // %while.end
	; CHECK-NEXT: ret			; CHECK-NEXT: ret
	entry:			entry:
	br label %vector.body			br label %vector.body

	vector.body: ; preds = %vector.body, %entry			vector.body: ; preds = %vector.body, %entry
	%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]			%index = phi i64 [ 0, %entry ], [ %index.next, %vector.body ]
	Show All 23 Lines

llvm/test/Transforms/LoopStrengthReduce/AArch64/lsr-memcpy.ll

	; RUN: llc -mtriple=arm64-unknown-unknown -mcpu=cyclone -pre-RA-sched=list-hybrid < %s \| FileCheck %s			; RUN: llc -mtriple=arm64-unknown-unknown -mcpu=cyclone -pre-RA-sched=list-hybrid < %s \| FileCheck %s
	; rdar://10232252			; rdar://10232252
	; Prevent LSR of doing poor choice that cannot be folded in addressing mode			; Prevent LSR of doing poor choice that cannot be folded in addressing mode

	; Remove the -pre-RA-sched=list-hybrid option after fixing:			; Remove the -pre-RA-sched=list-hybrid option after fixing:
	; <rdar://problem/12702735> [ARM64][coalescer] need better register			; <rdar://problem/12702735> [ARM64][coalescer] need better register
	; coalescing for simple unit tests.			; coalescing for simple unit tests.

	; CHECK: testCase			; CHECK: testCase
	; CHECK: %while.body{{$}}			; CHECK: %while.body{{$}}
	; CHECK: ldr [[STREG:x[0-9]+]], [{{x[0-9]+}}], #8			; CHECK: ldr [[STREG:x[0-9]+]], [{{x[0-9]+}}, #8]!
	; CHECK-NEXT: str [[STREG]], [{{x[0-9]+}}], #8			; CHECK-NEXT: str [[STREG]], [{{x[0-9]+}}, #8]!
	; CHECK: %while.end			; CHECK: %while.end
	define i32 @testCase() nounwind ssp {			define i32 @testCase() nounwind ssp {
	entry:			entry:
	br label %while.body			br label %while.body

	while.body: ; preds = %while.body, %entry			while.body: ; preds = %while.body, %entry
	%len.06 = phi i64 [ 1288, %entry ], [ %sub, %while.body ]			%len.06 = phi i64 [ 1288, %entry ], [ %sub, %while.body ]
	%pDst.05 = phi i64* [ inttoptr (i64 6442450944 to i64*), %entry ], [ %incdec.ptr1, %while.body ]			%pDst.05 = phi i64* [ inttoptr (i64 6442450944 to i64*), %entry ], [ %incdec.ptr1, %while.body ]
	Show All 13 Lines

llvm/test/Transforms/LoopStrengthReduce/AArch64/lsr-memset.ll

	; RUN: llc < %s -O3 -mtriple=arm64-unknown-unknown -mcpu=cyclone -pre-RA-sched=list-hybrid \| FileCheck %s			; RUN: llc < %s -O3 -mtriple=arm64-unknown-unknown -mcpu=cyclone -pre-RA-sched=list-hybrid \| FileCheck %s
	; <rdar://problem/11635990> [arm64] [lsr] Inefficient EA/loop-exit calc in bzero_phys			; <rdar://problem/11635990> [arm64] [lsr] Inefficient EA/loop-exit calc in bzero_phys
	;			;
	; LSR on loop %while.cond should reassociate non-address mode			; LSR on loop %while.cond should reassociate non-address mode
	; expressions at use %cmp16 to avoid sinking computation into %while.body18.			; expressions at use %cmp16 to avoid sinking computation into %while.body18.
	;			;
	; Remove the -pre-RA-sched=list-hybrid option after fixing:			; Remove the -pre-RA-sched=list-hybrid option after fixing:
	; <rdar://problem/12702735> [ARM64][coalescer] need better register			; <rdar://problem/12702735> [ARM64][coalescer] need better register
	; coalescing for simple unit tests.			; coalescing for simple unit tests.

	; CHECK: @memset			; CHECK: @memset
	; CHECK: %while.body18{{$}}			; CHECK: %while.body18{{$}}
	; CHECK: str x{{[0-9]+}}, [x{{[0-9]+}}], #8			; CHECK: str x{{[0-9]+}}, [x{{[0-9]+}}, #8]!
	; First set the IVREG variable, then use it			; First set the IVREG variable, then use it
	; CHECK-NEXT: sub [[IVREG:x[0-9]+]],			; CHECK-NEXT: sub [[IVREG:x[0-9]+]],
	; CHECK: [[IVREG]], #8			; CHECK: [[IVREG]], #8
	; CHECK-NEXT: cmp [[IVREG]], #7			; CHECK-NEXT: cmp [[IVREG]], #7
	; CHECK-NEXT: b.hi			; CHECK-NEXT: b.hi
	define i8* @memset(i8* %dest, i32 %val, i64 %len) nounwind ssp noimplicitfloat {			define i8* @memset(i8* %dest, i32 %val, i64 %len) nounwind ssp noimplicitfloat {
	entry:			entry:
	%cmp = icmp eq i64 %len, 0			%cmp = icmp eq i64 %len, 0
	▲ Show 20 Lines • Show All 80 Lines • Show Last 20 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Favor pre-increments and implement TTI::getPreferredAddressingModeAbandonedPublic

Details

Diff Detail

Unit TestsFailed

Event Timeline

Revision Contents

Diff 327124

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.h

llvm/lib/Target/AArch64/AArch64TargetTransformInfo.cpp

llvm/test/CodeGen/AArch64/aarch64-matrix-umull-smull.ll

llvm/test/CodeGen/AArch64/arm64-scaled_iv.ll

llvm/test/CodeGen/AArch64/falkor-hwpf-fix.ll

llvm/test/CodeGen/AArch64/pr27816.ll

llvm/test/CodeGen/AArch64/ragreedy-local-interval-cost.ll

llvm/test/CodeGen/AArch64/vldn_shuffle.ll

llvm/test/Transforms/LoopStrengthReduce/AArch64/lsr-memcpy.ll

llvm/test/Transforms/LoopStrengthReduce/AArch64/lsr-memset.ll

[AArch64] Favor pre-increments and implement TTI::getPreferredAddressingMode
AbandonedPublic