This is an archive of the discontinued LLVM Phabricator instance.

llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
598	This is just moving some code from `optimizeBlock`? If that's unrelated having this in a separate patch would make it slightly easier to review IMO :)
1668	This is just moving some code from `optimizeBlock`?

The load/store opt pass is already pretty expensive in terms of compile-time. Did you see any compile-time regressions in your testing? Also, what performance results have you collected?

llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
599	+1 to Florian's comment. Also, this comment should remain in optimizeBlock.
1403–1404	Should be } else {

mcrosier added inline comments.Nov 14 2017, 6:48 AM

llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
660	You could just add SM and STI to the AArch64LoadStoreOpt class and then initialize it in runOnMachineFunction.
689	You can drop this else.
1373	Add a comment: // Evaluate if the new instruction is a better choice than the old ones.
1374	I think you can return NextI here. This will ensure we skip the Update instruction (i.e., the base address add/sub), which is never a candidate for LdStUpdateMerging.

mcrosier added a reviewer: junbuml.Nov 14 2017, 6:49 AM

In D39976#924618, @mcrosier wrote:

The load/store opt pass is already pretty expensive in terms of compile-time. Did you see any compile-time regressions in your testing? Also, what performance results have you collected?

Cost model check is done at the very end when we are merging instructions, after most of the other checks are already done. Having said that, yes compile time regression test is a good idea. As for the performance, A72 and A57 both have higher latency with decoder bubble on load with pre-post update. That is definitely going to benefit where we look at cost model and decide not to create LD update. If we were to leave LD/ADD in epilogue and this pass not create LD update instruction, that's benefit as well.

No objection in high level. IMHO, it will not cause significant compile time regression, but better to make sure. I also agree with splitting this patch: one for the profitability check and one for refactoring.

llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
656	No need to repeat this check for every candiate.
674–676	Can we guarantee these pointers are always none null?

There's some code refactoring here that I'll put in a separate patch soon.

Also, since the the extra check is only performed for candidate instructions, the impact on compile time is negligible.

The performance difference can be significant, especially in FP benchmarks with tight loops over gobbles of data.

Thank you for your feedback.

llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
674–676	Yes because `getSchedClassDesc()` returns the address of a static array element.

Please, let me know if I addressed your concerns before I split the code refactoring in another patch.

Thank you,

In D39976#925078, @evandro wrote:

Please, let me know if I addressed your concerns before I split the code refactoring in another patch.

One minor comment, but feel free to go ahead and split the patch.

llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
657	Rather than repeatedly call hasInstrSchedModel(), why not just store the result in a bool much like you did with OptForSize..

evandro marked an inline comment as done.Nov 14 2017, 3:59 PM

evandro added a parent revision: D40090: [AArch64] Refactor the loads and stores optimizer.Nov 15 2017, 9:50 AM

Updated the patch to contain just the cost model evaluation after separating the code refactoring in a separate patch, D40090.

evandro mentioned this in D40107: [AArch64] Remove obsoleted feature.Nov 15 2017, 4:04 PM

evandro added a child revision: D40107: [AArch64] Remove obsoleted feature.

junbuml added inline comments.Nov 16 2017, 8:53 AM

llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
695	Why don't we use <= , instead of < ?

evandro added inline comments.Nov 16 2017, 11:52 AM

llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
695	Because typically decoding an instruction into multiple uops costs more in the front end of the pipeline than decoding multiple instructions into single uops.

Clarify the heuristics used when evaluating the costs.

evandro marked 2 inline comments as done.Nov 17 2017, 9:26 AM

junbuml added inline comments.Nov 17 2017, 9:40 AM

llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
695	Can you add more comment why you do this? Is this target independent ?

evandro added inline comments.Nov 17 2017, 10:49 AM

llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
695	I'm not a hardware designer, but AFAIK many targets hiccup when an instruction is decoded into more than one uop. How bad the hiccup is, if at all, does depend on the target though. Some decrease the decode bandwidth, typically by inserting a bubble whose size depends on the design. This heuristic is an attempt at mitigating the new instruction inducing such a bubble. If the new instruction has a shorter latency, then it's chosen. One might wonder if it's still a good choice if it induces a bubble, but I could not devise a satisfying heuristic. If the latency of the new instruction is the same as the combined latency of the both old ones, then the potential of inducing a bubble is considered. If either of the old instructions had multiple uops, then even if the new one has them too it's probably no worse than before. However, if neither of the old instructions resulted in multiple uops, the new one is chosen only if it results in fewer uops than before. One might argue that, if bubbles when decoding into multiple uops are the norm among targets, it'd be better to choose the new instruction only if it doesn't potentially induce bubbles itself. If the new instruction has a longer latency, then it's discarded. Again, if it mitigates decode bubbles it might still be profitable, but the conditions seem hard to weigh in general.

Try to make the heuristics code clearer.

evandro added a parent revision: D40511: [AArch64] Fix scheduling resources for post indexed loads and stores.Nov 27 2017, 11:54 AM

Update test cases.

Add diagnostic messages.

Ping 🔔

For me, the heuristic seems pretty reasonable, but I'm not fully sure if this can make a right decision target-independently. Someone else who have better idea in hardware level may need to review this.

llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
691	Isn't it possible to do just TSM.getNumMicroOps(&MIA) and remove above *SCA = TSM.resolveSchedClass(&MIA) in line 678.
707	I'm not sure if it's okay to compare values from computeInstrLatency() and getNumMicroOps() ?

@kristof.beyls, @mcrosier, @apinski-cavium?

llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
691	Yes, it is.
707	That's what the heuristic does, as briefly explained in the debug message below.

Add test case for the STREAM benchmark checking the results for most targets. This way, the target maintainers can evaluate the effects of this patch on the generated code for the respective targets.

evandro marked an inline comment as done.Dec 8 2017, 1:08 PM

¡Ping! 🔔🔔

gberry added inline comments.Dec 18 2017, 11:21 AM

llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
695	It seems like if you want to take this multi-uop bubble into account, you should make it explicit, and make the subtarget opt in to it. The simple heuristic here to me would be: if (newLatency < oldLatency) return true; if (newLatency > oldLatency) return false; if (newUops < oldUops) return true; if (newUops > oldUops) return false; if (newMultiUopPenalty < oldMultiUopPenalty) return true; return false;
707	I think Jun's point was that these values are in completely different units, so comparing them doesn't make sense.

evandro added inline comments.Dec 18 2017, 11:49 AM

llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
695	How do you propose that the new penalty be calculated?
707	It´s very common in process control to use one variable to effect another, as long as there´s at least an implied correlation between them.

Attempt at Interpreting @gberry´s suggestion.

Like the previous update, but favoring code size on A57, the only other target I can test on, while keeping the same performance.

Ping! 🔔

¡Ping! 🔔🔔

gberry added reviewers: MatzeB, t.p.northover.Jan 11 2018, 12:04 PM

gberry added inline comments.

llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
714	I would suggest that unless we know of a case where this makes a difference, make the simpler change of just doing return UopDif <= 0 at this point, assuming that has the desired effect for your target. This assumes that with latency and uops being equal, a reduction in instructions could still be beneficial.
llvm/test/CodeGen/AArch64/ldst-opt.ll
1034	Are these changes intentional?
1343	Are all of these changes needed?
llvm/test/CodeGen/AArch64/machine-outliner-remarks.ll
98	Are these changes needed?
llvm/test/CodeGen/AArch64/stream-neon.ll
1 ↗	(On Diff #127610)	This seems like it is too large to be a lit test. Is it really testing anything that hasn't been tested already?

evandro added inline comments.Jan 11 2018, 1:15 PM

llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp
714	This would lead to the test below, `stream-neon.ll`, to fail for the Exynos targets. The reason is that, for instance, though the latency of loading a pair of quad registers with the post index addressing mode has the same latency as when using the register offset addressing mode, it is a two uop instruction, which severely limits the decode bandwidth by inserting a bubble. This is a fairly common issue in other targets too, which this heuristic tries to capture, assuming that it's better to keep the decode bandwidth for a couple single uop instructions than incurring a decode bubble for a multiple uop instruction. Except, of course, when optimizing for size.
llvm/test/CodeGen/AArch64/ldst-opt.ll
1034	Yes, though not necessary. Checking for the return is not germane to the feature being tested.
1343	Ditto.
llvm/test/CodeGen/AArch64/machine-outliner-remarks.ll
98	Yes, since the changes in this patch affect the resulting code for this test, unless it's optimized for size.
llvm/test/CodeGen/AArch64/stream-neon.ll
1 ↗	(On Diff #127610)	Probably not. Perhaps the same could be accomplished if `ldst-opt.ll` were changed to test for the same targets as this test, when the result would be same.

evandro mentioned this in D40511: [AArch64] Fix scheduling resources for post indexed loads and stores.Jan 12 2018, 11:01 AM

Ping! 🔔

¡Ping! 🔔🔔

Here are some results I got on a Juno board with {{-mcpu=cortex-a57}}:

SPECint2000	%
164.gzip	0.0
175.vpr	-0.4
176.gcc	0.0
177.mesa	0.8
179.art	0.1
181.mcf	-0.3
183.equake	0.9
186.crafty	0.8
188.ammp	0.0
197.parser	0.3
252.eon	-0.2
253.perlbmk	-0.2
254.gap	0.1
255.vortex	0.0
256.bzip2	0.5
300.twolf	5.1

I've thought about this some more and tested it out on Falkor. As currently written this change causes SIMD store instructions to not have pre/post increments folded into them, causing minor performance regressions. I have the following general reservations as well:

does using the max latency of the load/store and add make sense given that the operations are dependent?
does always favoring latency over number of uops (an approximation of throughput) make sense? unless the operation is on the critical path I would think not.

This combined with the assumptions about multiple uop instructions (which also is not true for Falkor), I would suggest perhaps a better approach would be a add a target-specific property that would allow you to avoid the specific opcodes that are a problem for your target.

In D39976#998081, @gberry wrote:

I've thought about this some more and tested it out on Falkor. As currently written this change causes SIMD store instructions to not have pre/post increments folded into them, causing minor performance regressions.

I see that they're modeled with a latency of 0 cycles and 4 uops. Are the units they need, ST and VSD, really used for 0 cycles?

I have the following general reservations as well:

does using the max latency of the load/store and add make sense given that the operations are dependent?

They're only dependent for the pre index addressing mode. However, since the latency of the load and of the store is considerably larger even in this case, methinks that it's a sensible approximation.

does always favoring latency over number of uops (an approximation of throughput) make sense? unless the operation is on the critical path I would think not.

In previous versions of this patch I tried to weigh both metrics, but found it difficult to come up with a satisfying heuristic. Any ideas?

This combined with the assumptions about multiple uop instructions (which also is not true for Falkor), I would suggest perhaps a better approach would be a add a target-specific property that would allow you to avoid the specific opcodes that are a problem for your target.

Perhaps the cost function could be target specific behind. Thoughts?

Make the heuristics processor specific.

Ping! 🔔

¡Ping! 🔔🔔

Would it not be simpler to just add a subtarget bool that controls whether the problematic opcodes are emitted and set it for your subtargets (similar to the way STRQroIsSlow is handled)? That way you could avoid generating them not just in this pass, but in ISel and frame lowering?

In D39976#1017321, @gberry wrote:

Would it not be simpler to just add a subtarget bool that controls whether the problematic opcodes are emitted and set it for your subtargets (similar to the way STRQroIsSlow is handled)? That way you could avoid generating them not just in this pass, but in ISel and frame lowering?

Methinks that the gist is to move away from features and to rely more on the cost model. In the case of this patch, it also removes the feature FeatureSlowPaired128 in D40107.

In D39976#1017864, @evandro wrote:

In D39976#1017321, @gberry wrote:

Would it not be simpler to just add a subtarget bool that controls whether the problematic opcodes are emitted and set it for your subtargets (similar to the way STRQroIsSlow is handled)? That way you could avoid generating them not just in this pass, but in ISel and frame lowering?

Methinks that the gist is to move away from features and to rely more on the cost model. In the case of this patch, it also removes the feature FeatureSlowPaired128 in D40107.

That seems like a worthwhile goal, but this change doesn't really seem to be accomplishing that. If the sched model is being used by a subtarget-specific heuristic, that seems like just a more roundabout way of achieving the same result for your subtarget. Is there any net effect of this change combined with D40107?

In D39976#1027475, @gberry wrote:

In D39976#1017864, @evandro wrote:

Methinks that the gist is to move away from features and to rely more on the cost model. In the case of this patch, it also removes the feature FeatureSlowPaired128 in D40107.

That seems like a worthwhile goal, but this change doesn't really seem to be accomplishing that. If the sched model is being used by a subtarget-specific heuristic, that seems like just a more roundabout way of achieving the same result for your subtarget. Is there any net effect of this change combined with D40107?

FeatureSlowPaired128 was just too coarse. The alternative would be to change it to something more specific, like FeatureSlowSomePaired128Sometimes, and then create yet another when for the next generation to specialize it further. Instead, querying the scheduling model seems to be a much more reasonable approach.

In D39976#998081, @gberry wrote:

I've thought about this some more and tested it out on Falkor. As currently written this change causes SIMD store instructions to not have pre/post increments folded into them, causing minor performance regressions.

Were a test with -mcpu=falkor added to llvm/test/CodeGen/AArch64/ldst-opt.ll, how should the checks look like?

Also, can you please address my first question in https://reviews.llvm.org/D39976#999520 ?

Ping! 🔔

In D39976#1027478, @evandro wrote:

In D39976#1027475, @gberry wrote:

In D39976#1017864, @evandro wrote:

Methinks that the gist is to move away from features and to rely more on the cost model. In the case of this patch, it also removes the feature FeatureSlowPaired128 in D40107.

That seems like a worthwhile goal, but this change doesn't really seem to be accomplishing that. If the sched model is being used by a subtarget-specific heuristic, that seems like just a more roundabout way of achieving the same result for your subtarget. Is there any net effect of this change combined with D40107?

FeatureSlowPaired128 was just too coarse. The alternative would be to change it to something more specific, like FeatureSlowSomePaired128Sometimes, and then create yet another when for the next generation to specialize it further. Instead, querying the scheduling model seems to be a much more reasonable approach.

I'm more confused now. 'FeatureSlowPaired128' controls whether certain load/store opcodes are combined to form paired load/stores. But this change prevents some load/store opcodes from having their base register increment folded in. The two seem unrelated.

I'm also concerned that this change is introducing a very specific target hook and is recomputing the same "slowness" of opcodes over and over even though it doesn't depend on the context. Perhaps a more general subtarget array of "slow" opcodes would be a better choice, which Exynos could initialize based on its scheduling model for these opcodes if you think there is going to be differences in future CPUs.

In D39976#1031443, @evandro wrote:

In D39976#998081, @gberry wrote:

I've thought about this some more and tested it out on Falkor. As currently written this change causes SIMD store instructions to not have pre/post increments folded into them, causing minor performance regressions.

Were a test with -mcpu=falkor added to llvm/test/CodeGen/AArch64/ldst-opt.ll, how should the checks look like?

I would expect this change to not change the code generated for Falkor, so whatever is generated currently

In D39976#999520, @evandro wrote:

In D39976#998081, @gberry wrote:

I've thought about this some more and tested it out on Falkor. As currently written this change causes SIMD store instructions to not have pre/post increments folded into them, causing minor performance regressions.

I see that they're modeled with a latency of 0 cycles and 4 uops. Are the units they need, ST and VSD, really used for 0 cycles?

That's not how I understand the scheduling model to work. Resources are used for 'ResourceCycles' cycles (which defaults to 1), so in this case ST and VSD are used for 1 cycle. The Latency of the store is set to 0, since it doesn't write a register, so the latency doesn't mean anything as far as I can tell. The pre/post increment version has a Latency of 3 on the first def index (i.e. the updated base register) since that is the latency of reading this new register value.

In D39976#1040345, @gberry wrote:

In D39976#1027478, @evandro wrote:

FeatureSlowPaired128 was just too coarse. The alternative would be to change it to something more specific, like FeatureSlowSomePaired128Sometimes, and then create yet another when for the next generation to specialize it further. Instead, querying the scheduling model seems to be a much more reasonable approach.

I'm more confused now. 'FeatureSlowPaired128' controls whether certain load/store opcodes are combined to form paired load/stores. But this change prevents some load/store opcodes from having their base register increment folded in. The two seem unrelated.

This change is more generic and flexible than FeatureSlowPaired128. This change controls not only when loads and stores are paired, but also other foldings that this pass performs, including the pre or post indexing of the offset register.

I'm also concerned that this change is introducing a very specific target hook and is recomputing the same "slowness" of opcodes over and over even though it doesn't depend on the context. Perhaps a more general subtarget array of "slow" opcodes would be a better choice, which Exynos could initialize based on its scheduling model for these opcodes if you think there is going to be differences in future CPUs.

AFAIK, the code performs table look ups, which should be fairly efficient. And, yes, just like there are differences in how well some loads and stores perform in M1 and M3, it's likely that more differences will come in their successors.

In D39976#1049844, @evandro wrote:

In D39976#1040345, @gberry wrote:

In D39976#1027478, @evandro wrote:

FeatureSlowPaired128 was just too coarse. The alternative would be to change it to something more specific, like FeatureSlowSomePaired128Sometimes, and then create yet another when for the next generation to specialize it further. Instead, querying the scheduling model seems to be a much more reasonable approach.

I'm more confused now. 'FeatureSlowPaired128' controls whether certain load/store opcodes are combined to form paired load/stores. But this change prevents some load/store opcodes from having their base register increment folded in. The two seem unrelated.

This change is more generic and flexible than FeatureSlowPaired128. This change controls not only when loads and stores are paired, but also other foldings that this pass performs, including the pre or post indexing of the offset register.

That's not what the code looks like it is doing. isReplacementProfitable() is only being called from mergeUpdateInsn(), which is only called when folding to form base register incrementing load/stores.

I'm also concerned that this change is introducing a very specific target hook and is recomputing the same "slowness" of opcodes over and over even though it doesn't depend on the context. Perhaps a more general subtarget array of "slow" opcodes would be a better choice, which Exynos could initialize based on its scheduling model for these opcodes if you think there is going to be differences in future CPUs.

AFAIK, the code performs table look ups, which should be fairly efficient. And, yes, just like there are differences in how well some loads and stores perform in M1 and M3, it's likely that more differences will come in their successors.

The patch also adds a function pointer call for each potential instruction pair to be optimized. I'm not that bothered by the compile-time impact of this, it just seems like too specific of a subtarget hook to be adding to me. I would appreciate hearing what other people think about this.

In D39976#1054542, @gberry wrote:

In D39976#1049844, @evandro wrote:

This change is more generic and flexible than FeatureSlowPaired128. This change controls not only when loads and stores are paired, but also other foldings that this pass performs, including the pre or post indexing of the offset register.

That's not what the code looks like it is doing. isReplacementProfitable() is only being called from mergeUpdateInsn(), which is only called when folding to form base register incrementing load/stores.

For now, as it's a generic enough interface to be used by many other peephole optimizations.

AFAIK, the code performs table look ups, which should be fairly efficient. And, yes, just like there are differences in how well some loads and stores perform in M1 and M3, it's likely that more differences will come in their successors.

The patch also adds a function pointer call for each potential instruction pair to be optimized. I'm not that bothered by the compile-time impact of this, it just seems like too specific of a subtarget hook to be adding to me. I would appreciate hearing what other people think about this.

So am I.

Thank you for the feedback.

Ping! 🔔

mcrosier resigned from this revision.Apr 11 2018, 11:29 AM

Revision Contents

Path

Size

llvm/

lib/

Target/

AArch64/

AArch64LoadStoreOptimizer.cpp

43 lines

AArch64Subtarget.h

10 lines

AArch64Subtarget.cpp

29 lines

test/

CodeGen/

AArch64/

ldst-opt.ll

184 lines

machine-outliner-remarks.ll

2 lines

machine-outliner.ll

2 lines

Diff 133518

llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp

Show All 22 Lines
#include "llvm/Analysis/AliasAnalysis.h"		#include "llvm/Analysis/AliasAnalysis.h"
#include "llvm/CodeGen/MachineBasicBlock.h"		#include "llvm/CodeGen/MachineBasicBlock.h"
#include "llvm/CodeGen/MachineFunction.h"		#include "llvm/CodeGen/MachineFunction.h"
#include "llvm/CodeGen/MachineFunctionPass.h"		#include "llvm/CodeGen/MachineFunctionPass.h"
#include "llvm/CodeGen/MachineInstr.h"		#include "llvm/CodeGen/MachineInstr.h"
#include "llvm/CodeGen/MachineInstrBuilder.h"		#include "llvm/CodeGen/MachineInstrBuilder.h"
#include "llvm/CodeGen/MachineOperand.h"		#include "llvm/CodeGen/MachineOperand.h"
#include "llvm/CodeGen/TargetRegisterInfo.h"		#include "llvm/CodeGen/TargetRegisterInfo.h"
		#include "llvm/CodeGen/TargetSchedule.h"
#include "llvm/IR/DebugLoc.h"		#include "llvm/IR/DebugLoc.h"
		#include "llvm/MC/MCInstrDesc.h"
#include "llvm/MC/MCRegisterInfo.h"		#include "llvm/MC/MCRegisterInfo.h"
		#include "llvm/MC/MCSchedule.h"
#include "llvm/Pass.h"		#include "llvm/Pass.h"
#include "llvm/Support/CommandLine.h"		#include "llvm/Support/CommandLine.h"
#include "llvm/Support/Debug.h"		#include "llvm/Support/Debug.h"
#include "llvm/Support/ErrorHandling.h"		#include "llvm/Support/ErrorHandling.h"
#include "llvm/Support/raw_ostream.h"		#include "llvm/Support/raw_ostream.h"
#include <cassert>		#include <cassert>
#include <cstdint>		#include <cstdint>
#include <iterator>		#include <iterator>
▲ Show 20 Lines • Show All 50 Lines • ▼ Show 20 Lines	struct AArch64LoadStoreOpt : public MachineFunctionPass {

AArch64LoadStoreOpt() : MachineFunctionPass(ID) {		AArch64LoadStoreOpt() : MachineFunctionPass(ID) {
initializeAArch64LoadStoreOptPass(*PassRegistry::getPassRegistry());		initializeAArch64LoadStoreOptPass(*PassRegistry::getPassRegistry());
}		}

AliasAnalysis *AA;		AliasAnalysis *AA;
const AArch64InstrInfo *TII;		const AArch64InstrInfo *TII;
const TargetRegisterInfo *TRI;		const TargetRegisterInfo *TRI;
		const TargetSubtargetInfo *STI;
const AArch64Subtarget *Subtarget;		const AArch64Subtarget *Subtarget;

// Track which registers have been modified and used.		// Track which registers have been modified and used.
BitVector ModifiedRegs, UsedRegs;		BitVector ModifiedRegs, UsedRegs;

		// Target has a cost model.
		bool HasCostModel;
		TargetSchedModel TSM;

		// Function is being optimized for code size.
		bool OptForMinSize;

void getAnalysisUsage(AnalysisUsage &AU) const override {		void getAnalysisUsage(AnalysisUsage &AU) const override {
AU.addRequired<AAResultsWrapperPass>();		AU.addRequired<AAResultsWrapperPass>();
MachineFunctionPass::getAnalysisUsage(AU);		MachineFunctionPass::getAnalysisUsage(AU);
}		}

// Scan the instructions looking for a load/store that can be combined		// Scan the instructions looking for a load/store that can be combined
// with the current instruction into a load/store pair.		// with the current instruction into a load/store pair.
// Return the matching instruction if one is found, else MBB->end().		// Return the matching instruction if one is found, else MBB->end().
Show All 37 Lines	struct AArch64LoadStoreOpt : public MachineFunctionPass {
MachineBasicBlock::iterator		MachineBasicBlock::iterator
findMatchingUpdateInsnBackward(MachineBasicBlock::iterator I, unsigned Limit);		findMatchingUpdateInsnBackward(MachineBasicBlock::iterator I, unsigned Limit);

// Find an instruction that updates the base register of the ld/st		// Find an instruction that updates the base register of the ld/st
// instruction.		// instruction.
bool isMatchingUpdateInsn(MachineInstr &MemMI, MachineInstr &MI,		bool isMatchingUpdateInsn(MachineInstr &MemMI, MachineInstr &MI,
unsigned BaseReg, int Offset);		unsigned BaseReg, int Offset);

		// Evaluate if the new instruction is a better choice than the old ones.
		bool isReplacementProfitable(unsigned New, unsigned OpA, unsigned OpB);

// Merge a pre- or post-index base register update into a ld/st instruction.		// Merge a pre- or post-index base register update into a ld/st instruction.
MachineBasicBlock::iterator		MachineBasicBlock::iterator
mergeUpdateInsn(MachineBasicBlock::iterator I,		mergeUpdateInsn(MachineBasicBlock::iterator I,
MachineBasicBlock::iterator Update, bool IsPreIdx);		MachineBasicBlock::iterator Update, bool IsPreIdx);

// Find and merge zero store instructions.		// Find and merge zero store instructions.
bool tryToMergeZeroStInst(MachineBasicBlock::iterator &MBBI);		bool tryToMergeZeroStInst(MachineBasicBlock::iterator &MBBI);

▲ Show 20 Lines • Show All 411 Lines • ▼ Show 20 Lines

static bool isPromotableZeroStoreInst(MachineInstr &MI) {		static bool isPromotableZeroStoreInst(MachineInstr &MI) {
unsigned Opc = MI.getOpcode();		unsigned Opc = MI.getOpcode();
return (Opc == AArch64::STRWui \|\| Opc == AArch64::STURWi \|\|		return (Opc == AArch64::STRWui \|\| Opc == AArch64::STURWi \|\|
isNarrowStore(Opc)) &&		isNarrowStore(Opc)) &&
getLdStRegOp(MI).getReg() == AArch64::WZR;		getLdStRegOp(MI).getReg() == AArch64::WZR;
}		}

static bool isPromotableLoadFromStore(MachineInstr &MI) {		static bool isPromotableLoadFromStore(MachineInstr &MI) {
		fhahnUnsubmitted Done Reply Inline Actions This is just moving some code from `optimizeBlock`? If that's unrelated having this in a separate patch would make it slightly easier to review IMO :) fhahn: This is just moving some code from `optimizeBlock`? If that's unrelated having this in a…
switch (MI.getOpcode()) {		switch (MI.getOpcode()) {
		mcrosierUnsubmitted Done Reply Inline Actions +1 to Florian's comment. Also, this comment should remain in optimizeBlock. mcrosier: +1 to Florian's comment. Also, this comment should remain in optimizeBlock.
default:		default:
return false;		return false;
// Scaled instructions.		// Scaled instructions.
case AArch64::LDRBBui:		case AArch64::LDRBBui:
case AArch64::LDRHHui:		case AArch64::LDRHHui:
case AArch64::LDRWui:		case AArch64::LDRWui:
case AArch64::LDRXui:		case AArch64::LDRXui:
// Unscaled instructions.		// Unscaled instructions.
Show All 40 Lines	static bool isMergeableLdStUpdate(MachineInstr &MI) {
case AArch64::LDPSi:		case AArch64::LDPSi:
case AArch64::LDPSWi:		case AArch64::LDPSWi:
case AArch64::LDPDi:		case AArch64::LDPDi:
case AArch64::LDPQi:		case AArch64::LDPQi:
case AArch64::LDPWi:		case AArch64::LDPWi:
case AArch64::LDPXi:		case AArch64::LDPXi:
case AArch64::STPSi:		case AArch64::STPSi:
case AArch64::STPDi:		case AArch64::STPDi:
case AArch64::STPQi:		case AArch64::STPQi:
		junbumlUnsubmitted Done Reply Inline Actions No need to repeat this check for every candiate. junbuml: No need to repeat this check for every candiate.
case AArch64::STPWi:		case AArch64::STPWi:
		mcrosierUnsubmitted Done Reply Inline Actions Rather than repeatedly call hasInstrSchedModel(), why not just store the result in a bool much like you did with OptForSize.. mcrosier: Rather than repeatedly call hasInstrSchedModel(), why not just store the result in a bool much…
case AArch64::STPXi:		case AArch64::STPXi:
// Make sure this is a reg+imm (as opposed to an address reloc).		// Make sure this is a reg+imm (as opposed to an address reloc).
if (!getLdStOffsetOp(MI).isImm())		if (!getLdStOffsetOp(MI).isImm())
		mcrosierUnsubmitted Done Reply Inline Actions You could just add SM and STI to the AArch64LoadStoreOpt class and then initialize it in runOnMachineFunction. mcrosier: You could just add SM and STI to the AArch64LoadStoreOpt class and then initialize it in…
return false;		return false;

return true;		return true;
}		}
}		}

		bool AArch64LoadStoreOpt::isReplacementProfitable
		(unsigned New, unsigned OpA, unsigned OpB) {
		// Default as profitable if optimizing for size or
		// in the absence of a cost model.
		if (OptForMinSize \|\| !HasCostModel) {
		DEBUG(dbgs() << "Evaluating instructions: replacement by default.\n");
		return true;
		}

		return (Subtarget->isReplacementProfitable(TSM, New, OpA, OpB));
		junbumlUnsubmitted Done Reply Inline Actions Can we guarantee these pointers are always none null? junbuml: Can we guarantee these pointers are always none null?
		evandroAuthorUnsubmitted Not Done Reply Inline Actions Yes because `getSchedClassDesc()` returns the address of a static array element. evandro: Yes because `getSchedClassDesc()` returns the address of a static array element.
		}

MachineBasicBlock::iterator		MachineBasicBlock::iterator
AArch64LoadStoreOpt::mergeNarrowZeroStores(MachineBasicBlock::iterator I,		AArch64LoadStoreOpt::mergeNarrowZeroStores(MachineBasicBlock::iterator I,
MachineBasicBlock::iterator MergeMI,		MachineBasicBlock::iterator MergeMI,
const LdStPairFlags &Flags) {		const LdStPairFlags &Flags) {
assert(isPromotableZeroStoreInst(I) && isPromotableZeroStoreInst(MergeMI) &&		assert(isPromotableZeroStoreInst(I) && isPromotableZeroStoreInst(MergeMI) &&
"Expected promotable zero stores.");		"Expected promotable zero stores.");

MachineBasicBlock::iterator NextI = I;		MachineBasicBlock::iterator NextI = I;
++NextI;		++NextI;
// If NextI is the second of the two instructions to be merged, we need		// If NextI is the second of the two instructions to be merged, we need
// to skip one further. Either way we merge will invalidate the iterator,		// to skip one further. Either way we merge will invalidate the iterator,
		mcrosierUnsubmitted Done Reply Inline Actions You can drop this else. mcrosier: You can drop this else.
// and we don't need to scan the new instruction, as it's a pairwise		// and we don't need to scan the new instruction, as it's a pairwise
// instruction, which we're not considering for further action anyway.		// instruction, which we're not considering for further action anyway.
		junbumlUnsubmitted Done Reply Inline Actions Isn't it possible to do just TSM.getNumMicroOps(&MIA) and remove above SCA = TSM.resolveSchedClass(&MIA) in line 678. junbuml:* Isn't it possible to do just TSM.getNumMicroOps(&MIA) and remove above *SCA = TSM.
		evandroAuthorUnsubmitted Done Reply Inline Actions Yes, it is. evandro: Yes, it is.
if (NextI == MergeMI)		if (NextI == MergeMI)
++NextI;		++NextI;

unsigned Opc = I->getOpcode();		unsigned Opc = I->getOpcode();
		junbumlUnsubmitted Done Reply Inline Actions Why don't we use <= , instead of < ? junbuml: Why don't we use <= , instead of < ?
		evandroAuthorUnsubmitted Done Reply Inline Actions Because typically decoding an instruction into multiple uops costs more in the front end of the pipeline than decoding multiple instructions into single uops. evandro: Because typically decoding an instruction into multiple uops costs more in the front end of the…
		junbumlUnsubmitted Not Done Reply Inline Actions Can you add more comment why you do this? Is this target independent ? junbuml: Can you add more comment why you do this? Is this target independent ?
		evandroAuthorUnsubmitted Not Done Reply Inline Actions I'm not a hardware designer, but AFAIK many targets hiccup when an instruction is decoded into more than one uop. How bad the hiccup is, if at all, does depend on the target though. Some decrease the decode bandwidth, typically by inserting a bubble whose size depends on the design. This heuristic is an attempt at mitigating the new instruction inducing such a bubble. If the new instruction has a shorter latency, then it's chosen. One might wonder if it's still a good choice if it induces a bubble, but I could not devise a satisfying heuristic. If the latency of the new instruction is the same as the combined latency of the both old ones, then the potential of inducing a bubble is considered. If either of the old instructions had multiple uops, then even if the new one has them too it's probably no worse than before. However, if neither of the old instructions resulted in multiple uops, the new one is chosen only if it results in fewer uops than before. One might argue that, if bubbles when decoding into multiple uops are the norm among targets, it'd be better to choose the new instruction only if it doesn't potentially induce bubbles itself. If the new instruction has a longer latency, then it's discarded. Again, if it mitigates decode bubbles it might still be profitable, but the conditions seem hard to weigh in general. evandro: I'm not a hardware designer, but AFAIK many targets hiccup when an instruction is decoded into…
		gberryUnsubmitted Not Done Reply Inline Actions It seems like if you want to take this multi-uop bubble into account, you should make it explicit, and make the subtarget opt in to it. The simple heuristic here to me would be: if (newLatency < oldLatency) return true; if (newLatency > oldLatency) return false; if (newUops < oldUops) return true; if (newUops > oldUops) return false; if (newMultiUopPenalty < oldMultiUopPenalty) return true; return false; gberry: It seems like if you want to take this multi-uop bubble into account, you should make it…
		evandroAuthorUnsubmitted Not Done Reply Inline Actions How do you propose that the new penalty be calculated? evandro: How do you propose that the new penalty be calculated?
bool IsScaled = !TII->isUnscaledLdSt(Opc);		bool IsScaled = !TII->isUnscaledLdSt(Opc);
int OffsetStride = IsScaled ? 1 : getMemScale(*I);		int OffsetStride = IsScaled ? 1 : getMemScale(*I);

bool MergeForward = Flags.getMergeForward();		bool MergeForward = Flags.getMergeForward();
// Insert our new paired instruction after whichever of the paired		// Insert our new paired instruction after whichever of the paired
// instructions MergeForward indicates.		// instructions MergeForward indicates.
MachineBasicBlock::iterator InsertionPoint = MergeForward ? MergeMI : I;		MachineBasicBlock::iterator InsertionPoint = MergeForward ? MergeMI : I;
// Also based on MergeForward is from where we copy the base register operand		// Also based on MergeForward is from where we copy the base register operand
// so we get the flags compatible with the input code.		// so we get the flags compatible with the input code.
const MachineOperand &BaseRegOp =		const MachineOperand &BaseRegOp =
MergeForward ? getLdStBaseOp(MergeMI) : getLdStBaseOp(I);		MergeForward ? getLdStBaseOp(MergeMI) : getLdStBaseOp(I);

		junbumlUnsubmitted Not Done Reply Inline Actions I'm not sure if it's okay to compare values from computeInstrLatency() and getNumMicroOps() ? junbuml: I'm not sure if it's okay to compare values from computeInstrLatency() and getNumMicroOps() ?
		evandroAuthorUnsubmitted Not Done Reply Inline Actions That's what the heuristic does, as briefly explained in the debug message below. evandro: That's what the heuristic does, as briefly explained in the debug message below.
		gberryUnsubmitted Not Done Reply Inline Actions I think Jun's point was that these values are in completely different units, so comparing them doesn't make sense. gberry: I think Jun's point was that these values are in completely different units, so comparing them…
		evandroAuthorUnsubmitted Not Done Reply Inline Actions It´s very common in process control to use one variable to effect another, as long as there´s at least an implied correlation between them. evandro: It´s very common in process control to use one variable to effect another, as long as there´s…
// Which register is Rt and which is Rt2 depends on the offset order.		// Which register is Rt and which is Rt2 depends on the offset order.
MachineInstr *RtMI;		MachineInstr *RtMI;
if (getLdStOffsetOp(*I).getImm() ==		if (getLdStOffsetOp(*I).getImm() ==
getLdStOffsetOp(*MergeMI).getImm() + OffsetStride)		getLdStOffsetOp(*MergeMI).getImm() + OffsetStride)
RtMI = &*MergeMI;		RtMI = &*MergeMI;
else		else
RtMI = &*I;		RtMI = &*I;
		gberryUnsubmitted Not Done Reply Inline Actions I would suggest that unless we know of a case where this makes a difference, make the simpler change of just doing return UopDif <= 0 at this point, assuming that has the desired effect for your target. This assumes that with latency and uops being equal, a reduction in instructions could still be beneficial. gberry: I would suggest that unless we know of a case where this makes a difference, make the simpler…
		evandroAuthorUnsubmitted Not Done Reply Inline Actions This would lead to the test below, `stream-neon.ll`, to fail for the Exynos targets. The reason is that, for instance, though the latency of loading a pair of quad registers with the post index addressing mode has the same latency as when using the register offset addressing mode, it is a two uop instruction, which severely limits the decode bandwidth by inserting a bubble. This is a fairly common issue in other targets too, which this heuristic tries to capture, assuming that it's better to keep the decode bandwidth for a couple single uop instructions than incurring a decode bubble for a multiple uop instruction. Except, of course, when optimizing for size. evandro: This would lead to the test below, `stream-neon.ll`, to fail for the Exynos targets. The…

int OffsetImm = getLdStOffsetOp(*RtMI).getImm();		int OffsetImm = getLdStOffsetOp(*RtMI).getImm();
// Change the scaled offset from small to large type.		// Change the scaled offset from small to large type.
if (IsScaled) {		if (IsScaled) {
assert(((OffsetImm & 1) == 0) && "Unexpected offset to merge");		assert(((OffsetImm & 1) == 0) && "Unexpected offset to merge");
OffsetImm /= 2;		OffsetImm /= 2;
}		}

▲ Show 20 Lines • Show All 642 Lines • ▼ Show 20 Lines	AArch64LoadStoreOpt::mergeUpdateInsn(MachineBasicBlock::iterator I,
int Value = Update->getOperand(2).getImm();		int Value = Update->getOperand(2).getImm();
assert(AArch64_AM::getShiftValue(Update->getOperand(3).getImm()) == 0 &&		assert(AArch64_AM::getShiftValue(Update->getOperand(3).getImm()) == 0 &&
"Can't merge 1 << 12 offset into pre-/post-indexed load / store");		"Can't merge 1 << 12 offset into pre-/post-indexed load / store");
if (Update->getOpcode() == AArch64::SUBXri)		if (Update->getOpcode() == AArch64::SUBXri)
Value = -Value;		Value = -Value;

unsigned NewOpc = IsPreIdx ? getPreIndexedOpcode(I->getOpcode())		unsigned NewOpc = IsPreIdx ? getPreIndexedOpcode(I->getOpcode())
: getPostIndexedOpcode(I->getOpcode());		: getPostIndexedOpcode(I->getOpcode());

		mcrosierUnsubmitted Done Reply Inline Actions Add a comment: // Evaluate if the new instruction is a better choice than the old ones. mcrosier: Add a comment: // Evaluate if the new instruction is a better choice than the old ones.
		// Evaluate if the new instruction is a better choice than both old ones.
		mcrosierUnsubmitted Done Reply Inline Actions I think you can return NextI here. This will ensure we skip the Update instruction (i.e., the base address add/sub), which is never a candidate for LdStUpdateMerging. mcrosier: I think you can return NextI here. This will ensure we skip the Update instruction (i.e., the…
		if (!isReplacementProfitable(NewOpc, I->getOpcode(), Update->getOpcode()))
		return NextI;

MachineInstrBuilder MIB;		MachineInstrBuilder MIB;
if (!isPairedLdSt(*I)) {		if (!isPairedLdSt(*I)) {
// Non-paired instruction.		// Non-paired instruction.
MIB = BuildMI(*I->getParent(), I, I->getDebugLoc(), TII->get(NewOpc))		MIB = BuildMI(*I->getParent(), I, I->getDebugLoc(), TII->get(NewOpc))
.add(getLdStRegOp(*Update))		.add(getLdStRegOp(*Update))
.add(getLdStRegOp(*I))		.add(getLdStRegOp(*I))
.add(getLdStBaseOp(*I))		.add(getLdStBaseOp(*I))
.addImm(Value)		.addImm(Value)
Show All 9 Lines	MIB = BuildMI(*I->getParent(), I, I->getDebugLoc(), TII->get(NewOpc))
.addImm(Value / Scale)		.addImm(Value / Scale)
.setMemRefs(I->memoperands_begin(), I->memoperands_end());		.setMemRefs(I->memoperands_begin(), I->memoperands_end());
}		}
(void)MIB;		(void)MIB;

if (IsPreIdx) {		if (IsPreIdx) {
++NumPreFolded;		++NumPreFolded;
DEBUG(dbgs() << "Creating pre-indexed load/store.");		DEBUG(dbgs() << "Creating pre-indexed load/store.");
} else {		} else {
++NumPostFolded;		++NumPostFolded;
		mcrosierUnsubmitted Done Reply Inline Actions Should be } else { mcrosier: Should be } else {
DEBUG(dbgs() << "Creating post-indexed load/store.");		DEBUG(dbgs() << "Creating post-indexed load/store.");
}		}
DEBUG(dbgs() << " Replacing instructions:\n ");		DEBUG(dbgs() << " Replacing instructions:\n ");
DEBUG(I->print(dbgs()));		DEBUG(I->print(dbgs()));
DEBUG(dbgs() << " ");		DEBUG(dbgs() << " ");
DEBUG(Update->print(dbgs()));		DEBUG(Update->print(dbgs()));
DEBUG(dbgs() << " with instruction:\n ");		DEBUG(dbgs() << " with instruction:\n ");
DEBUG(((MachineInstr *)MIB)->print(dbgs()));		DEBUG(((MachineInstr *)MIB)->print(dbgs()));
DEBUG(dbgs() << "\n");		DEBUG(dbgs() << "\n");

// Erase the old instructions for the block.		// Erase the old instructions for the block.
▲ Show 20 Lines • Show All 244 Lines • ▼ Show 20 Lines	if (Paired != E) {
// us what the next instruction is after it's done mucking about.		// us what the next instruction is after it's done mucking about.
MBBI = mergePairedInsns(MBBI, Paired, Flags);		MBBI = mergePairedInsns(MBBI, Paired, Flags);
return true;		return true;
}		}
return false;		return false;
}		}

bool AArch64LoadStoreOpt::tryToMergeLdStUpdate		bool AArch64LoadStoreOpt::tryToMergeLdStUpdate
(MachineBasicBlock::iterator &MBBI) {		(MachineBasicBlock::iterator &MBBI) {
		fhahnUnsubmitted Done Reply Inline Actions This is just moving some code from `optimizeBlock`? fhahn: This is just moving some code from `optimizeBlock`?
MachineInstr &MI = *MBBI;		MachineInstr &MI = *MBBI;
MachineBasicBlock::iterator E = MI.getParent()->end();		MachineBasicBlock::iterator E = MI.getParent()->end();
MachineBasicBlock::iterator Update;		MachineBasicBlock::iterator Update;

// Look forward to try to form a post-index instruction. For example,		// Look forward to try to form a post-index instruction. For example,
// ldr x0, [x20]		// ldr x0, [x20]
// add x20, x20, #32		// add x20, x20, #32
// merged into:		// merged into:
▲ Show 20 Lines • Show All 116 Lines • ▼ Show 20 Lines	bool AArch64LoadStoreOpt::runOnMachineFunction(MachineFunction &Fn) {
if (skipFunction(Fn.getFunction()))		if (skipFunction(Fn.getFunction()))
return false;		return false;

Subtarget = &static_cast<const AArch64Subtarget &>(Fn.getSubtarget());		Subtarget = &static_cast<const AArch64Subtarget &>(Fn.getSubtarget());
TII = static_cast<const AArch64InstrInfo *>(Subtarget->getInstrInfo());		TII = static_cast<const AArch64InstrInfo *>(Subtarget->getInstrInfo());
TRI = Subtarget->getRegisterInfo();		TRI = Subtarget->getRegisterInfo();
AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();		AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();

		OptForMinSize = Fn.getFunction().optForMinSize();

		const TargetSubtargetInfo &STI = Fn.getSubtarget();
		TSM.init(STI.getSchedModel(), &STI, STI.getInstrInfo());
		// TODO: For now, only support targets with a scheduling model. In order to
		// support a target that has itineraries instead, then
		// isReplacementProfitable() has to be modified to calculate the latency
		// and the number of uops.
		HasCostModel = TSM.hasInstrSchedModel();

// Resize the modified and used register bitfield trackers. We do this once		// Resize the modified and used register bitfield trackers. We do this once
// per function and then clear the bitfield each time we optimize a load or		// per function and then clear the bitfield each time we optimize a load or
// store.		// store.
ModifiedRegs.resize(TRI->getNumRegs());		ModifiedRegs.resize(TRI->getNumRegs());
UsedRegs.resize(TRI->getNumRegs());		UsedRegs.resize(TRI->getNumRegs());

bool Modified = false;		bool Modified = false;
bool enableNarrowZeroStOpt = !Subtarget->requiresStrictAlign();		bool enableNarrowZeroStOpt = !Subtarget->requiresStrictAlign();
Show All 21 Lines

llvm/lib/Target/AArch64/AArch64Subtarget.h

Show First 20 Lines • Show All 121 Lines • ▼ Show 20 Lines	protected:
uint16_t PrefetchDistance = 0;		uint16_t PrefetchDistance = 0;
uint16_t MinPrefetchStride = 1;		uint16_t MinPrefetchStride = 1;
unsigned MaxPrefetchIterationsAhead = UINT_MAX;		unsigned MaxPrefetchIterationsAhead = UINT_MAX;
unsigned PrefFunctionAlignment = 0;		unsigned PrefFunctionAlignment = 0;
unsigned PrefLoopAlignment = 0;		unsigned PrefLoopAlignment = 0;
unsigned MaxJumpTableSize = 0;		unsigned MaxJumpTableSize = 0;
unsigned WideningBaseCost = 0;		unsigned WideningBaseCost = 0;

		bool (*IsReplacementProfitable)(const TargetSchedModel &,
		unsigned, unsigned, unsigned) = nullptr;

// ReserveX18 - X18 is not available as a general purpose register.		// ReserveX18 - X18 is not available as a general purpose register.
bool ReserveX18;		bool ReserveX18;

bool IsLittle;		bool IsLittle;

/// TargetTriple - What processor and OS we're targeting.		/// TargetTriple - What processor and OS we're targeting.
Triple TargetTriple;		Triple TargetTriple;

▲ Show 20 Lines • Show All 161 Lines • ▼ Show 20 Lines	switch (TLInfo.getTargetMachine().getCodeModel()) {
// where it is the same as Small for almost all purposes.		// where it is the same as Small for almost all purposes.
case CodeModel::Small:		case CodeModel::Small:
return true;		return true;
default:		default:
return false;		return false;
}		}
}		}

		bool isReplacementProfitable(const TargetSchedModel &TSM,
		unsigned New, unsigned OpA, unsigned OpB) const {
		if (IsReplacementProfitable)
		return IsReplacementProfitable(TSM, New, OpA, OpB);

		return true;
		}
/// ParseSubtargetFeatures - Parses features string setting specified		/// ParseSubtargetFeatures - Parses features string setting specified
/// subtarget options. Definition of function is auto generated by tblgen.		/// subtarget options. Definition of function is auto generated by tblgen.
void ParseSubtargetFeatures(StringRef CPU, StringRef FS);		void ParseSubtargetFeatures(StringRef CPU, StringRef FS);

/// ClassifyGlobalReference - Find the target operand flags that describe		/// ClassifyGlobalReference - Find the target operand flags that describe
/// how a global value should be referenced for the current subtarget.		/// how a global value should be referenced for the current subtarget.
unsigned char ClassifyGlobalReference(const GlobalValue *GV,		unsigned char ClassifyGlobalReference(const GlobalValue *GV,
const TargetMachine &TM) const;		const TargetMachine &TM) const;
Show All 27 Lines

llvm/lib/Target/AArch64/AArch64Subtarget.cpp

Show First 20 Lines • Show All 41 Lines • ▼ Show 20 Lines
UseAddressTopByteIgnored("aarch64-use-tbi", cl::desc("Assume that top byte of "		UseAddressTopByteIgnored("aarch64-use-tbi", cl::desc("Assume that top byte of "
"an address is ignored"), cl::init(false), cl::Hidden);		"an address is ignored"), cl::init(false), cl::Hidden);

static cl::opt<bool>		static cl::opt<bool>
UseNonLazyBind("aarch64-enable-nonlazybind",		UseNonLazyBind("aarch64-enable-nonlazybind",
cl::desc("Call nonlazybind functions via direct GOT load"),		cl::desc("Call nonlazybind functions via direct GOT load"),
cl::init(false), cl::Hidden);		cl::init(false), cl::Hidden);

		static bool IsExynosReplacementProfitable(const TargetSchedModel &TSM,
		unsigned New,
		unsigned OpA, unsigned OpB) {
		auto *TII = static_cast
		<const AArch64InstrInfo *>(TSM.getSubtargetInfo()->getInstrInfo());
		auto *SM = TSM.getMCSchedModel();
		auto *SCN = SM->getSchedClassDesc(TII->get(New).getSchedClass()),
		*SCA = SM->getSchedClassDesc(TII->get(OpA).getSchedClass()),
		*SCB = SM->getSchedClassDesc(TII->get(OpB).getSchedClass());
		long UopN = SCN->NumMicroOps,
		UopA = SCA->NumMicroOps,
		UopB = SCB->NumMicroOps;

		// The replacement instr is profitable if it is simpler.
		if (UopN < UopA + UopB)
		return true;
		if (UopN > UopA + UopB)
		return false;

		// The replacement instr is profitable if it is as complex.
		if (UopA > 1 \|\| UopB > 1)
		return true;

		// The replacement instr is not profitable.
		return false;
		}

AArch64Subtarget &		AArch64Subtarget &
AArch64Subtarget::initializeSubtargetDependencies(StringRef FS,		AArch64Subtarget::initializeSubtargetDependencies(StringRef FS,
StringRef CPUString) {		StringRef CPUString) {
// Determine default and user-specified characteristics		// Determine default and user-specified characteristics

if (CPUString.empty())		if (CPUString.empty())
CPUString = "generic";		CPUString = "generic";

Show All 14 Lines	case Cyclone:
MinPrefetchStride = 2048;		MinPrefetchStride = 2048;
MaxPrefetchIterationsAhead = 3;		MaxPrefetchIterationsAhead = 3;
break;		break;
case CortexA57:		case CortexA57:
MaxInterleaveFactor = 4;		MaxInterleaveFactor = 4;
PrefFunctionAlignment = 4;		PrefFunctionAlignment = 4;
break;		break;
case ExynosM1:		case ExynosM1:
		IsReplacementProfitable = IsExynosReplacementProfitable;
MaxInterleaveFactor = 4;		MaxInterleaveFactor = 4;
MaxJumpTableSize = 8;		MaxJumpTableSize = 8;
PrefFunctionAlignment = 4;		PrefFunctionAlignment = 4;
PrefLoopAlignment = 3;		PrefLoopAlignment = 3;
break;		break;
case ExynosM3:		case ExynosM3:
		IsReplacementProfitable = IsExynosReplacementProfitable;
MaxInterleaveFactor = 4;		MaxInterleaveFactor = 4;
MaxJumpTableSize = 20;		MaxJumpTableSize = 20;
PrefFunctionAlignment = 5;		PrefFunctionAlignment = 5;
PrefLoopAlignment = 4;		PrefLoopAlignment = 4;
break;		break;
case Falkor:		case Falkor:
MaxInterleaveFactor = 4;		MaxInterleaveFactor = 4;
// FIXME: remove this to enable 64-bit SLP if performance looks good.		// FIXME: remove this to enable 64-bit SLP if performance looks good.
▲ Show 20 Lines • Show All 178 Lines • Show Last 20 Lines

llvm/test/CodeGen/AArch64/ldst-opt.ll

; RUN: llc -mtriple=aarch64-linux-gnu -aarch64-enable-atomic-cfg-tidy=0 -disable-lsr -verify-machineinstrs -o - %s \| FileCheck --check-prefix=CHECK --check-prefix=NOSTRICTALIGN %s		; RUN: llc -mtriple=aarch64-linux-gnu -aarch64-enable-atomic-cfg-tidy=0 -disable-lsr -verify-machineinstrs -o - %s \| FileCheck %s --check-prefixes=CHECK,GENERIC,NOSTRICTALIGN
; RUN: llc -mtriple=aarch64-linux-gnu -mattr=+strict-align -aarch64-enable-atomic-cfg-tidy=0 -disable-lsr -verify-machineinstrs -o - %s \| FileCheck --check-prefix=CHECK --check-prefix=STRICTALIGN %s		; RUN: llc -mtriple=aarch64-linux-gnu -aarch64-enable-atomic-cfg-tidy=0 -disable-lsr -verify-machineinstrs -mattr=+strict-align -o - %s \| FileCheck %s --check-prefixes=CHECK,GENERIC,STRICTALIGN
		; RUN: llc -mtriple=aarch64-linux-gnu -aarch64-enable-atomic-cfg-tidy=0 -disable-lsr -verify-machineinstrs -mcpu=exynos-m1 -o - %s \| FileCheck %s --check-prefixes=CHECK,EXYNOS,EXYNOSM1
		; RUN: llc -mtriple=aarch64-linux-gnu -aarch64-enable-atomic-cfg-tidy=0 -disable-lsr -verify-machineinstrs -mcpu=exynos-m3 -o - %s \| FileCheck %s --check-prefixes=CHECK,EXYNOS,EXYNOSM3,NOSTRICTALIGN

; This file contains tests for the AArch64 load/store optimizer.		; This file contains tests for the AArch64 load/store optimizer.

%padding = type { i8, i8, i8, i8 }		%padding = type { i8, i8, i8, i8 }
%s.byte = type { i8, i8 }		%s.byte = type { i8, i8 }
%s.halfword = type { i16, i16 }		%s.halfword = type { i16, i16 }
%s.word = type { i32, i32 }		%s.word = type { i32, i32 }
%s.doubleword = type { i64, i32 }		%s.doubleword = type { i64, i64 }
%s.quadword = type { fp128, i32 }		%s.quadword = type { fp128, fp128 }
%s.float = type { float, i32 }		%s.float = type { float, i32 }
%s.double = type { double, i32 }		%s.double = type { double, i32 }
%struct.byte = type { %padding, %s.byte }		%struct.byte = type { %padding, %s.byte }
%struct.halfword = type { %padding, %s.halfword }		%struct.halfword = type { %padding, %s.halfword }
%struct.word = type { %padding, %s.word }		%struct.word = type { %padding, %s.word }
%struct.doubleword = type { %padding, %s.doubleword }		%struct.doubleword = type { %padding, %s.doubleword }
%struct.quadword = type { %padding, %s.quadword }		%struct.quadword = type { %padding, %s.quadword }
%struct.float = type { %padding, %s.float }		%struct.float = type { %padding, %s.float }
▲ Show 20 Lines • Show All 211 Lines • ▼ Show 20 Lines
; ...		; ...
; add x0, x0, #32		; add x0, x0, #32
; ->		; ->
; (ldp\|stp) w1, w2, [x0, #32]!		; (ldp\|stp) w1, w2, [x0, #32]!
;		;

define void @load-pair-pre-indexed-word(%struct.word* %ptr) nounwind {		define void @load-pair-pre-indexed-word(%struct.word* %ptr) nounwind {
; CHECK-LABEL: load-pair-pre-indexed-word		; CHECK-LABEL: load-pair-pre-indexed-word
; CHECK: ldp w{{[0-9]+}}, w{{[0-9]+}}, [x0, #32]!		; GENERIC: ldp w{{[0-9]+}}, w{{[0-9]+}}, [x0, #32]!
; CHECK-NOT: add x0, x0, #32		; GENERIC-NOT: add x0, x0, #32
		; EXYNOS: ldp w{{[0-9]+}}, w{{[0-9]+}}, [x0, #32]{{$}}
		; EXYNOS: add x0, x0, #32
entry:		entry:
%a = getelementptr inbounds %struct.word, %struct.word* %ptr, i64 0, i32 1, i32 0		%a = getelementptr inbounds %struct.word, %struct.word* %ptr, i64 0, i32 1, i32 0
%a1 = load i32, i32* %a, align 4		%a1 = load i32, i32* %a, align 4
%b = getelementptr inbounds %struct.word, %struct.word* %ptr, i64 0, i32 1, i32 1		%b = getelementptr inbounds %struct.word, %struct.word* %ptr, i64 0, i32 1, i32 1
%b1 = load i32, i32* %b, align 4		%b1 = load i32, i32* %b, align 4
%add = add i32 %a1, %b1		%add = add i32 %a1, %b1
br label %bar		br label %bar
bar:		bar:
Show All 13 Lines	entry:
store i32 %val, i32* %b, align 4		store i32 %val, i32* %b, align 4
br label %bar		br label %bar
bar:		bar:
%c = getelementptr inbounds %struct.word, %struct.word* %ptr, i64 0, i32 1		%c = getelementptr inbounds %struct.word, %struct.word* %ptr, i64 0, i32 1
tail call void @bar_word(%s.word* %c, i32 %val)		tail call void @bar_word(%s.word* %c, i32 %val)
ret void		ret void
}		}

		define void @load-pair-pre-indexed-doubleword(%struct.doubleword* %ptr) nounwind {
		; CHECK-LABEL: load-pair-pre-indexed-doubleword
		; GENERIC: ldp x{{[0-9]+}}, x{{[0-9]+}}, [x0, #32]!
		; GENERIC-NOT: add x0, x0, #32
		; EXYNOS: ldp x{{[0-9]+}}, x{{[0-9]+}}, [x0, #32]{{$}}
		; EXYNOS: add x0, x0, #32
		entry:
		%a = getelementptr inbounds %struct.doubleword, %struct.doubleword* %ptr, i64 0, i32 1, i32 0
		%a1 = load i64, i64* %a, align 8
		%b = getelementptr inbounds %struct.doubleword, %struct.doubleword* %ptr, i64 0, i32 1, i32 1
		%b1 = load i64, i64* %b, align 8
		%add = add i64 %a1, %b1
		br label %bar
		bar:
		%c = getelementptr inbounds %struct.doubleword, %struct.doubleword* %ptr, i64 0, i32 1
		tail call void @bar_doubleword(%s.doubleword* %c, i64 %add)
		ret void
		}

		define void @store-pair-pre-indexed-doubleword(%struct.doubleword* %ptr, i64 %val) nounwind {
		; CHECK-LABEL: store-pair-pre-indexed-doubleword
		; CHECK: stp x{{[0-9]+}}, x{{[0-9]+}}, [x0, #32]!
		; CHECK-NOT: add x0, x0, #32
		entry:
		%a = getelementptr inbounds %struct.doubleword, %struct.doubleword* %ptr, i64 0, i32 1, i32 0
		store i64 %val, i64* %a, align 8
		%b = getelementptr inbounds %struct.doubleword, %struct.doubleword* %ptr, i64 0, i32 1, i32 1
		store i64 %val, i64* %b, align 8
		br label %bar
		bar:
		%c = getelementptr inbounds %struct.doubleword, %struct.doubleword* %ptr, i64 0, i32 1
		tail call void @bar_doubleword(%s.doubleword* %c, i64 %val)
		ret void
		}

; Check the following transform:		; Check the following transform:
;		;
; add x8, x8, #16		; add x8, x8, #16
; ...		; ...
; ldr X, [x8]		; ldr X, [x8]
; ->		; ->
; ldr X, [x8, #16]!		; ldr X, [x8, #16]!
;		;
▲ Show 20 Lines • Show All 748 Lines • ▼ Show 20 Lines
; ...		; ...
; add x20, x20, #32		; add x20, x20, #32
; ->		; ->
; stp w0, [x20], #32		; stp w0, [x20], #32

define void @store-pair-post-indexed-word() nounwind {		define void @store-pair-post-indexed-word() nounwind {
; CHECK-LABEL: store-pair-post-indexed-word		; CHECK-LABEL: store-pair-post-indexed-word
; CHECK: stp w{{[0-9]+}}, w{{[0-9]+}}, [sp], #16		; CHECK: stp w{{[0-9]+}}, w{{[0-9]+}}, [sp], #16
; CHECK: ret
gberryUnsubmitted Not Done Reply Inline Actions Are these changes intentional? gberry: Are these changes intentional?
evandroAuthorUnsubmitted Not Done Reply Inline Actions Yes, though not necessary. Checking for the return is not germane to the feature being tested. evandro: Yes, though not necessary. Checking for the return is not germane to the feature being tested.
%src = alloca { i32, i32 }, align 8		%src = alloca { i32, i32 }, align 8
%dst = alloca { i32, i32 }, align 8		%dst = alloca { i32, i32 }, align 8

%src.realp = getelementptr inbounds { i32, i32 }, { i32, i32 }* %src, i32 0, i32 0		%src.realp = getelementptr inbounds { i32, i32 }, { i32, i32 }* %src, i32 0, i32 0
%src.real = load i32, i32* %src.realp		%src.real = load i32, i32* %src.realp
%src.imagp = getelementptr inbounds { i32, i32 }, { i32, i32 }* %src, i32 0, i32 1		%src.imagp = getelementptr inbounds { i32, i32 }, { i32, i32 }* %src, i32 0, i32 1
%src.imag = load i32, i32* %src.imagp		%src.imag = load i32, i32* %src.imagp

%dst.realp = getelementptr inbounds { i32, i32 }, { i32, i32 }* %dst, i32 0, i32 0		%dst.realp = getelementptr inbounds { i32, i32 }, { i32, i32 }* %dst, i32 0, i32 0
%dst.imagp = getelementptr inbounds { i32, i32 }, { i32, i32 }* %dst, i32 0, i32 1		%dst.imagp = getelementptr inbounds { i32, i32 }, { i32, i32 }* %dst, i32 0, i32 1
store i32 %src.real, i32* %dst.realp		store i32 %src.real, i32* %dst.realp
store i32 %src.imag, i32* %dst.imagp		store i32 %src.imag, i32* %dst.imagp
ret void		ret void
}		}

define void @store-pair-post-indexed-doubleword() nounwind {		define void @store-pair-post-indexed-doubleword() nounwind {
; CHECK-LABEL: store-pair-post-indexed-doubleword		; CHECK-LABEL: store-pair-post-indexed-doubleword
; CHECK: stp x{{[0-9]+}}, x{{[0-9]+}}, [sp], #32		; CHECK: stp x{{[0-9]+}}, x{{[0-9]+}}, [sp], #32
; CHECK: ret
%src = alloca { i64, i64 }, align 8		%src = alloca { i64, i64 }, align 8
%dst = alloca { i64, i64 }, align 8		%dst = alloca { i64, i64 }, align 8

%src.realp = getelementptr inbounds { i64, i64 }, { i64, i64 }* %src, i32 0, i32 0		%src.realp = getelementptr inbounds { i64, i64 }, { i64, i64 }* %src, i32 0, i32 0
%src.real = load i64, i64* %src.realp		%src.real = load i64, i64* %src.realp
%src.imagp = getelementptr inbounds { i64, i64 }, { i64, i64 }* %src, i32 0, i32 1		%src.imagp = getelementptr inbounds { i64, i64 }, { i64, i64 }* %src, i32 0, i32 1
%src.imag = load i64, i64* %src.imagp		%src.imag = load i64, i64* %src.imagp

%dst.realp = getelementptr inbounds { i64, i64 }, { i64, i64 }* %dst, i32 0, i32 0		%dst.realp = getelementptr inbounds { i64, i64 }, { i64, i64 }* %dst, i32 0, i32 0
%dst.imagp = getelementptr inbounds { i64, i64 }, { i64, i64 }* %dst, i32 0, i32 1		%dst.imagp = getelementptr inbounds { i64, i64 }, { i64, i64 }* %dst, i32 0, i32 1
store i64 %src.real, i64* %dst.realp		store i64 %src.real, i64* %dst.realp
store i64 %src.imag, i64* %dst.imagp		store i64 %src.imag, i64* %dst.imagp
ret void		ret void
}		}

define void @store-pair-post-indexed-float() nounwind {		define void @store-pair-post-indexed-float() nounwind {
; CHECK-LABEL: store-pair-post-indexed-float		; CHECK-LABEL: store-pair-post-indexed-float
; CHECK: stp s{{[0-9]+}}, s{{[0-9]+}}, [sp], #16		; CHECK: stp s{{[0-9]+}}, s{{[0-9]+}}, [sp], #16
; CHECK: ret
%src = alloca { float, float }, align 8		%src = alloca { float, float }, align 8
%dst = alloca { float, float }, align 8		%dst = alloca { float, float }, align 8

%src.realp = getelementptr inbounds { float, float }, { float, float }* %src, i32 0, i32 0		%src.realp = getelementptr inbounds { float, float }, { float, float }* %src, i32 0, i32 0
%src.real = load float, float* %src.realp		%src.real = load float, float* %src.realp
%src.imagp = getelementptr inbounds { float, float }, { float, float }* %src, i32 0, i32 1		%src.imagp = getelementptr inbounds { float, float }, { float, float }* %src, i32 0, i32 1
%src.imag = load float, float* %src.imagp		%src.imag = load float, float* %src.imagp

%dst.realp = getelementptr inbounds { float, float }, { float, float }* %dst, i32 0, i32 0		%dst.realp = getelementptr inbounds { float, float }, { float, float }* %dst, i32 0, i32 0
%dst.imagp = getelementptr inbounds { float, float }, { float, float }* %dst, i32 0, i32 1		%dst.imagp = getelementptr inbounds { float, float }, { float, float }* %dst, i32 0, i32 1
store float %src.real, float* %dst.realp		store float %src.real, float* %dst.realp
store float %src.imag, float* %dst.imagp		store float %src.imag, float* %dst.imagp
ret void		ret void
}		}

define void @store-pair-post-indexed-double() nounwind {		define void @store-pair-post-indexed-double() nounwind {
; CHECK-LABEL: store-pair-post-indexed-double		; CHECK-LABEL: store-pair-post-indexed-double
; CHECK: stp d{{[0-9]+}}, d{{[0-9]+}}, [sp], #32		; CHECK: stp d{{[0-9]+}}, d{{[0-9]+}}, [sp], #32
; CHECK: ret
%src = alloca { double, double }, align 8		%src = alloca { double, double }, align 8
%dst = alloca { double, double }, align 8		%dst = alloca { double, double }, align 8

%src.realp = getelementptr inbounds { double, double }, { double, double }* %src, i32 0, i32 0		%src.realp = getelementptr inbounds { double, double }, { double, double }* %src, i32 0, i32 0
%src.real = load double, double* %src.realp		%src.real = load double, double* %src.realp
%src.imagp = getelementptr inbounds { double, double }, { double, double }* %src, i32 0, i32 1		%src.imagp = getelementptr inbounds { double, double }, { double, double }* %src, i32 0, i32 1
%src.imag = load double, double* %src.imagp		%src.imag = load double, double* %src.imagp

%dst.realp = getelementptr inbounds { double, double }, { double, double }* %dst, i32 0, i32 0		%dst.realp = getelementptr inbounds { double, double }, { double, double }* %dst, i32 0, i32 0
%dst.imagp = getelementptr inbounds { double, double }, { double, double }* %dst, i32 0, i32 1		%dst.imagp = getelementptr inbounds { double, double }, { double, double }* %dst, i32 0, i32 1
store double %src.real, double* %dst.realp		store double %src.real, double* %dst.realp
store double %src.imag, double* %dst.imagp		store double %src.imag, double* %dst.imagp
ret void		ret void
}		}

		define void @store-pair-post-indexed-quadword() nounwind {
		; CHECK-LABEL: store-pair-post-indexed-quadword
		; GENERIC: stp q{{[0-9]+}}, q{{[0-9]+}}, [sp], #64
		; EXYNOSM1: str q{{[0-9]+}}, [sp]
		; EXYNOSM1: str q{{[0-9]+}}, [sp, #16]
		; EXYNOSM3: stp q{{[0-9]+}}, q{{[0-9]+}}, [sp]{{$}}
		%src = alloca { fp128, fp128 }, align 8
		%dst = alloca { fp128, fp128 }, align 8

		%src.realp = getelementptr inbounds { fp128, fp128 }, { fp128, fp128 }* %src, i32 0, i32 0
		%src.real = load fp128, fp128* %src.realp
		%src.imagp = getelementptr inbounds { fp128, fp128 }, { fp128, fp128 }* %src, i32 0, i32 1
		%src.imag = load fp128, fp128* %src.imagp

		%dst.realp = getelementptr inbounds { fp128, fp128 }, { fp128, fp128 }* %dst, i32 0, i32 0
		%dst.imagp = getelementptr inbounds { fp128, fp128 }, { fp128, fp128 }* %dst, i32 0, i32 1
		store fp128 %src.real, fp128* %dst.realp
		store fp128 %src.imag, fp128* %dst.imagp
		ret void
		}

; Check the following transform:		; Check the following transform:
;		;
; (ldr\|str) X, [x20]		; (ldr\|str) X, [x20]
; ...		; ...
; sub x20, x20, #16		; sub x20, x20, #16
; ->		; ->
; (ldr\|str) X, [x20], #-16		; (ldr\|str) X, [x20], #-16
;		;
▲ Show 20 Lines • Show All 167 Lines • ▼ Show 20 Lines	for.body:
%cond = icmp sgt i64 %dec.i, 0		%cond = icmp sgt i64 %dec.i, 0
br i1 %cond, label %for.body, label %end		br i1 %cond, label %for.body, label %end
end:		end:
ret void		ret void
}		}

define void @post-indexed-paired-min-offset(i64* %a, i64* %b, i64 %count) nounwind {		define void @post-indexed-paired-min-offset(i64* %a, i64* %b, i64 %count) nounwind {
; CHECK-LABEL: post-indexed-paired-min-offset		; CHECK-LABEL: post-indexed-paired-min-offset
; CHECK: ldp x{{[0-9]+}}, x{{[0-9]+}}, [x{{[0-9]+}}], #-512		; GENERIC: ldp x{{[0-9]+}}, x{{[0-9]+}}, [x{{[0-9]+}}], #-512
		; EXYNOS: ldp x{{[0-9]+}}, x{{[0-9]+}}, [x{{[0-9]+}}]{{$}}
; CHECK: stp x{{[0-9]+}}, x{{[0-9]+}}, [x{{[0-9]+}}], #-512		; CHECK: stp x{{[0-9]+}}, x{{[0-9]+}}, [x{{[0-9]+}}], #-512
br label %for.body		br label %for.body
for.body:		for.body:
%phi1 = phi i64* [ %gep4, %for.body ], [ %b, %0 ]		%phi1 = phi i64* [ %gep4, %for.body ], [ %b, %0 ]
%phi2 = phi i64* [ %gep3, %for.body ], [ %a, %0 ]		%phi2 = phi i64* [ %gep3, %for.body ], [ %a, %0 ]
%i = phi i64 [ %dec.i, %for.body], [ %count, %0 ]		%i = phi i64 [ %dec.i, %for.body], [ %count, %0 ]
%gep1 = getelementptr i64, i64* %phi1, i64 1		%gep1 = getelementptr i64, i64* %phi1, i64 1
%load1 = load i64, i64* %gep1		%load1 = load i64, i64* %gep1
Show All 36 Lines	end:
ret void		ret void
}		}

; DAGCombiner::MergeConsecutiveStores merges this into a vector store,		; DAGCombiner::MergeConsecutiveStores merges this into a vector store,
; replaceZeroVectorStore should split the vector store back into		; replaceZeroVectorStore should split the vector store back into
; scalar stores which should get merged by AArch64LoadStoreOptimizer.		; scalar stores which should get merged by AArch64LoadStoreOptimizer.
define void @merge_zr32(i32* %p) {		define void @merge_zr32(i32* %p) {
; CHECK-LABEL: merge_zr32:		; CHECK-LABEL: merge_zr32:
; CHECK: // %entry		; NOSTRICTALIGN: str xzr, [x{{[0-9]+}}]
gberryUnsubmitted Not Done Reply Inline Actions Are all of these changes needed? gberry: Are all of these changes needed?
evandroAuthorUnsubmitted Not Done Reply Inline Actions Ditto. evandro: Ditto.
; NOSTRICTALIGN-NEXT: str xzr, [x{{[0-9]+}}]		; STRICTALIGN: stp wzr, wzr, [x{{[0-9]+}}]
; STRICTALIGN-NEXT: stp wzr, wzr, [x{{[0-9]+}}]
; CHECK-NEXT: ret
entry:		entry:
store i32 0, i32* %p		store i32 0, i32* %p
%p1 = getelementptr i32, i32* %p, i32 1		%p1 = getelementptr i32, i32* %p, i32 1
store i32 0, i32* %p1		store i32 0, i32* %p1
ret void		ret void
}		}

; Same as merge_zr32 but the merged stores should also get paried.		; Same as merge_zr32 but the merged stores should also get paried.
define void @merge_zr32_2(i32* %p) {		define void @merge_zr32_2(i32* %p) {
; CHECK-LABEL: merge_zr32_2:		; CHECK-LABEL: merge_zr32_2:
; CHECK: // %entry		; NOSTRICTALIGN: stp xzr, xzr, [x{{[0-9]+}}]
; NOSTRICTALIGN-NEXT: stp xzr, xzr, [x{{[0-9]+}}]		; STRICTALIGN: stp wzr, wzr, [x{{[0-9]+}}]
; STRICTALIGN-NEXT: stp wzr, wzr, [x{{[0-9]+}}]
; STRICTALIGN-NEXT: stp wzr, wzr, [x{{[0-9]+}}, #8]		; STRICTALIGN-NEXT: stp wzr, wzr, [x{{[0-9]+}}, #8]
; CHECK-NEXT: ret
entry:		entry:
store i32 0, i32* %p		store i32 0, i32* %p
%p1 = getelementptr i32, i32* %p, i32 1		%p1 = getelementptr i32, i32* %p, i32 1
store i32 0, i32* %p1		store i32 0, i32* %p1
%p2 = getelementptr i32, i32* %p, i64 2		%p2 = getelementptr i32, i32* %p, i64 2
store i32 0, i32* %p2		store i32 0, i32* %p2
%p3 = getelementptr i32, i32* %p, i64 3		%p3 = getelementptr i32, i32* %p, i64 3
store i32 0, i32* %p3		store i32 0, i32* %p3
ret void		ret void
}		}

; Like merge_zr32_2, but checking the largest allowed stp immediate offset.		; Like merge_zr32_2, but checking the largest allowed stp immediate offset.
define void @merge_zr32_2_offset(i32* %p) {		define void @merge_zr32_2_offset(i32* %p) {
; CHECK-LABEL: merge_zr32_2_offset:		; CHECK-LABEL: merge_zr32_2_offset:
; CHECK: // %entry		; NOSTRICTALIGN: stp xzr, xzr, [x{{[0-9]+}}, #504]
; NOSTRICTALIGN-NEXT: stp xzr, xzr, [x{{[0-9]+}}, #504]		; STRICTALIGN: str wzr, [x{{[0-9]+}}, #504]
; STRICTALIGN-NEXT: str wzr, [x{{[0-9]+}}, #504]
; STRICTALIGN-NEXT: str wzr, [x{{[0-9]+}}, #508]		; STRICTALIGN-NEXT: str wzr, [x{{[0-9]+}}, #508]
; STRICTALIGN-NEXT: str wzr, [x{{[0-9]+}}, #512]		; STRICTALIGN-NEXT: str wzr, [x{{[0-9]+}}, #512]
; STRICTALIGN-NEXT: str wzr, [x{{[0-9]+}}, #516]		; STRICTALIGN-NEXT: str wzr, [x{{[0-9]+}}, #516]
; CHECK-NEXT: ret
entry:		entry:
%p0 = getelementptr i32, i32* %p, i32 126		%p0 = getelementptr i32, i32* %p, i32 126
store i32 0, i32* %p0		store i32 0, i32* %p0
%p1 = getelementptr i32, i32* %p, i32 127		%p1 = getelementptr i32, i32* %p, i32 127
store i32 0, i32* %p1		store i32 0, i32* %p1
%p2 = getelementptr i32, i32* %p, i64 128		%p2 = getelementptr i32, i32* %p, i64 128
store i32 0, i32* %p2		store i32 0, i32* %p2
%p3 = getelementptr i32, i32* %p, i64 129		%p3 = getelementptr i32, i32* %p, i64 129
store i32 0, i32* %p3		store i32 0, i32* %p3
ret void		ret void
}		}

; Like merge_zr32, but replaceZeroVectorStore should not split this		; Like merge_zr32, but replaceZeroVectorStore should not split this
; vector store since the address offset is too large for the stp		; vector store since the address offset is too large for the stp
; instruction.		; instruction.
define void @no_merge_zr32_2_offset(i32* %p) {		define void @no_merge_zr32_2_offset(i32* %p) {
; CHECK-LABEL: no_merge_zr32_2_offset:		; CHECK-LABEL: no_merge_zr32_2_offset:
; CHECK: // %entry		; NOSTRICTALIGN: movi v[[REG:[0-9]]].2d, #0000000000000000
; NOSTRICTALIGN-NEXT: movi v[[REG:[0-9]]].2d, #0000000000000000
; NOSTRICTALIGN-NEXT: str q[[REG]], [x{{[0-9]+}}, #4096]		; NOSTRICTALIGN-NEXT: str q[[REG]], [x{{[0-9]+}}, #4096]
; STRICTALIGN-NEXT: str wzr, [x{{[0-9]+}}, #4096]		; STRICTALIGN: str wzr, [x{{[0-9]+}}, #4096]
; STRICTALIGN-NEXT: str wzr, [x{{[0-9]+}}, #4100]		; STRICTALIGN-NEXT: str wzr, [x{{[0-9]+}}, #4100]
; STRICTALIGN-NEXT: str wzr, [x{{[0-9]+}}, #4104]		; STRICTALIGN-NEXT: str wzr, [x{{[0-9]+}}, #4104]
; STRICTALIGN-NEXT: str wzr, [x{{[0-9]+}}, #4108]		; STRICTALIGN-NEXT: str wzr, [x{{[0-9]+}}, #4108]
; CHECK-NEXT: ret
entry:		entry:
%p0 = getelementptr i32, i32* %p, i32 1024		%p0 = getelementptr i32, i32* %p, i32 1024
store i32 0, i32* %p0		store i32 0, i32* %p0
%p1 = getelementptr i32, i32* %p, i32 1025		%p1 = getelementptr i32, i32* %p, i32 1025
store i32 0, i32* %p1		store i32 0, i32* %p1
%p2 = getelementptr i32, i32* %p, i64 1026		%p2 = getelementptr i32, i32* %p, i64 1026
store i32 0, i32* %p2		store i32 0, i32* %p2
%p3 = getelementptr i32, i32* %p, i64 1027		%p3 = getelementptr i32, i32* %p, i64 1027
store i32 0, i32* %p3		store i32 0, i32* %p3
ret void		ret void
}		}

; Like merge_zr32, but replaceZeroVectorStore should not split the		; Like merge_zr32, but replaceZeroVectorStore should not split the
; vector store since the zero constant vector has multiple uses, so we		; vector store since the zero constant vector has multiple uses, so we
; err on the side that allows for stp q instruction generation.		; err on the side that allows for stp q instruction generation.
define void @merge_zr32_3(i32* %p) {		define void @merge_zr32_3(i32* %p) {
; CHECK-LABEL: merge_zr32_3:		; CHECK-LABEL: merge_zr32_3:
; CHECK: // %entry		; NOSTRICTALIGN: movi v[[REG:[0-9]]].2d, #0000000000000000
; NOSTRICTALIGN-NEXT: movi v[[REG:[0-9]]].2d, #0000000000000000
; NOSTRICTALIGN-NEXT: stp q[[REG]], q[[REG]], [x{{[0-9]+}}]		; NOSTRICTALIGN-NEXT: stp q[[REG]], q[[REG]], [x{{[0-9]+}}]
; STRICTALIGN-NEXT: stp wzr, wzr, [x{{[0-9]+}}]		; STRICTALIGN: stp wzr, wzr, [x{{[0-9]+}}]
; STRICTALIGN-NEXT: stp wzr, wzr, [x{{[0-9]+}}, #8]		; STRICTALIGN-NEXT: stp wzr, wzr, [x{{[0-9]+}}, #8]
; STRICTALIGN-NEXT: stp wzr, wzr, [x{{[0-9]+}}, #16]		; STRICTALIGN-NEXT: stp wzr, wzr, [x{{[0-9]+}}, #16]
; STRICTALIGN-NEXT: stp wzr, wzr, [x{{[0-9]+}}, #24]		; STRICTALIGN-NEXT: stp wzr, wzr, [x{{[0-9]+}}, #24]
; CHECK-NEXT: ret
entry:		entry:
store i32 0, i32* %p		store i32 0, i32* %p
%p1 = getelementptr i32, i32* %p, i32 1		%p1 = getelementptr i32, i32* %p, i32 1
store i32 0, i32* %p1		store i32 0, i32* %p1
%p2 = getelementptr i32, i32* %p, i64 2		%p2 = getelementptr i32, i32* %p, i64 2
store i32 0, i32* %p2		store i32 0, i32* %p2
%p3 = getelementptr i32, i32* %p, i64 3		%p3 = getelementptr i32, i32* %p, i64 3
store i32 0, i32* %p3		store i32 0, i32* %p3
%p4 = getelementptr i32, i32* %p, i64 4		%p4 = getelementptr i32, i32* %p, i64 4
store i32 0, i32* %p4		store i32 0, i32* %p4
%p5 = getelementptr i32, i32* %p, i64 5		%p5 = getelementptr i32, i32* %p, i64 5
store i32 0, i32* %p5		store i32 0, i32* %p5
%p6 = getelementptr i32, i32* %p, i64 6		%p6 = getelementptr i32, i32* %p, i64 6
store i32 0, i32* %p6		store i32 0, i32* %p6
%p7 = getelementptr i32, i32* %p, i64 7		%p7 = getelementptr i32, i32* %p, i64 7
store i32 0, i32* %p7		store i32 0, i32* %p7
ret void		ret void
}		}

; Like merge_zr32, but with 2-vector type.		; Like merge_zr32, but with 2-vector type.
define void @merge_zr32_2vec(<2 x i32>* %p) {		define void @merge_zr32_2vec(<2 x i32>* %p) {
; CHECK-LABEL: merge_zr32_2vec:		; CHECK-LABEL: merge_zr32_2vec:
; CHECK: // %entry		; NOSTRICTALIGN: str xzr, [x{{[0-9]+}}]
; NOSTRICTALIGN-NEXT: str xzr, [x{{[0-9]+}}]		; STRICTALIGN: stp wzr, wzr, [x{{[0-9]+}}]
; STRICTALIGN-NEXT: stp wzr, wzr, [x{{[0-9]+}}]
; CHECK-NEXT: ret
entry:		entry:
store <2 x i32> zeroinitializer, <2 x i32>* %p		store <2 x i32> zeroinitializer, <2 x i32>* %p
ret void		ret void
}		}

; Like merge_zr32, but with 3-vector type.		; Like merge_zr32, but with 3-vector type.
define void @merge_zr32_3vec(<3 x i32>* %p) {		define void @merge_zr32_3vec(<3 x i32>* %p) {
; CHECK-LABEL: merge_zr32_3vec:		; CHECK-LABEL: merge_zr32_3vec:
; CHECK: // %entry		; NOSTRICTALIGN: str xzr, [x{{[0-9]+}}]
; NOSTRICTALIGN-NEXT: str xzr, [x{{[0-9]+}}]
; NOSTRICTALIGN-NEXT: str wzr, [x{{[0-9]+}}, #8]		; NOSTRICTALIGN-NEXT: str wzr, [x{{[0-9]+}}, #8]
; STRICTALIGN-NEXT: stp wzr, wzr, [x{{[0-9]+}}]		; STRICTALIGN: stp wzr, wzr, [x{{[0-9]+}}]
; STRICTALIGN-NEXT: str wzr, [x{{[0-9]+}}, #8]		; STRICTALIGN-NEXT: str wzr, [x{{[0-9]+}}, #8]
; CHECK-NEXT: ret
entry:		entry:
store <3 x i32> zeroinitializer, <3 x i32>* %p		store <3 x i32> zeroinitializer, <3 x i32>* %p
ret void		ret void
}		}

; Like merge_zr32, but with 4-vector type.		; Like merge_zr32, but with 4-vector type.
define void @merge_zr32_4vec(<4 x i32>* %p) {		define void @merge_zr32_4vec(<4 x i32>* %p) {
; CHECK-LABEL: merge_zr32_4vec:		; CHECK-LABEL: merge_zr32_4vec:
; CHECK: // %entry		; NOSTRICTALIGN: stp xzr, xzr, [x{{[0-9]+}}]
; NOSTRICTALIGN-NEXT: stp xzr, xzr, [x{{[0-9]+}}]		; STRICTALIGN: stp wzr, wzr, [x{{[0-9]+}}]
; STRICTALIGN-NEXT: stp wzr, wzr, [x{{[0-9]+}}]
; STRICTALIGN-NEXT: stp wzr, wzr, [x{{[0-9]+}}, #8]		; STRICTALIGN-NEXT: stp wzr, wzr, [x{{[0-9]+}}, #8]
; CHECK-NEXT: ret
entry:		entry:
store <4 x i32> zeroinitializer, <4 x i32>* %p		store <4 x i32> zeroinitializer, <4 x i32>* %p
ret void		ret void
}		}

; Like merge_zr32, but with 2-vector float type.		; Like merge_zr32, but with 2-vector float type.
define void @merge_zr32_2vecf(<2 x float>* %p) {		define void @merge_zr32_2vecf(<2 x float>* %p) {
; CHECK-LABEL: merge_zr32_2vecf:		; CHECK-LABEL: merge_zr32_2vecf:
; CHECK: // %entry		; NOSTRICTALIGN: str xzr, [x{{[0-9]+}}]
; NOSTRICTALIGN-NEXT: str xzr, [x{{[0-9]+}}]		; STRICTALIGN: stp wzr, wzr, [x{{[0-9]+}}]
; STRICTALIGN-NEXT: stp wzr, wzr, [x{{[0-9]+}}]
; CHECK-NEXT: ret
entry:		entry:
store <2 x float> zeroinitializer, <2 x float>* %p		store <2 x float> zeroinitializer, <2 x float>* %p
ret void		ret void
}		}

; Like merge_zr32, but with 4-vector float type.		; Like merge_zr32, but with 4-vector float type.
define void @merge_zr32_4vecf(<4 x float>* %p) {		define void @merge_zr32_4vecf(<4 x float>* %p) {
; CHECK-LABEL: merge_zr32_4vecf:		; CHECK-LABEL: merge_zr32_4vecf:
; CHECK: // %entry		; NOSTRICTALIGN: stp xzr, xzr, [x{{[0-9]+}}]
; NOSTRICTALIGN-NEXT: stp xzr, xzr, [x{{[0-9]+}}]		; STRICTALIGN: stp wzr, wzr, [x{{[0-9]+}}]
; STRICTALIGN-NEXT: stp wzr, wzr, [x{{[0-9]+}}]
; STRICTALIGN-NEXT: stp wzr, wzr, [x{{[0-9]+}}, #8]		; STRICTALIGN-NEXT: stp wzr, wzr, [x{{[0-9]+}}, #8]
; CHECK-NEXT: ret
entry:		entry:
store <4 x float> zeroinitializer, <4 x float>* %p		store <4 x float> zeroinitializer, <4 x float>* %p
ret void		ret void
}		}

; Similar to merge_zr32, but for 64-bit values.		; Similar to merge_zr32, but for 64-bit values.
define void @merge_zr64(i64* %p) {		define void @merge_zr64(i64* %p) {
; CHECK-LABEL: merge_zr64:		; CHECK-LABEL: merge_zr64:
; CHECK: // %entry		; CHECK: stp xzr, xzr, [x{{[0-9]+}}]
; CHECK-NEXT: stp xzr, xzr, [x{{[0-9]+}}]
; CHECK-NEXT: ret		; CHECK-NEXT: ret
entry:		entry:
store i64 0, i64* %p		store i64 0, i64* %p
%p1 = getelementptr i64, i64* %p, i64 1		%p1 = getelementptr i64, i64* %p, i64 1
store i64 0, i64* %p1		store i64 0, i64* %p1
ret void		ret void
}		}

; Similar to merge_zr32, but for 64-bit values and with unaligned stores.		; Similar to merge_zr32, but for 64-bit values and with unaligned stores.
define void @merge_zr64_unalign(<2 x i64>* %p) {		define void @merge_zr64_unalign(<2 x i64>* %p) {
; CHECK-LABEL: merge_zr64_unalign:		; CHECK-LABEL: merge_zr64_unalign:
; CHECK: // %entry		; NOSTRICTALIGN: stp xzr, xzr, [x{{[0-9]+}}]
; NOSTRICTALIGN-NEXT: stp xzr, xzr, [x{{[0-9]+}}]
; STRICTALIGN: strb		; STRICTALIGN: strb
; STRICTALIGN: strb		; STRICTALIGN: strb
; STRICTALIGN: strb		; STRICTALIGN: strb
; STRICTALIGN: strb		; STRICTALIGN: strb
; STRICTALIGN: strb		; STRICTALIGN: strb
; STRICTALIGN: strb		; STRICTALIGN: strb
; STRICTALIGN: strb		; STRICTALIGN: strb
; STRICTALIGN: strb		; STRICTALIGN: strb
; STRICTALIGN: strb		; STRICTALIGN: strb
; STRICTALIGN: strb		; STRICTALIGN: strb
; STRICTALIGN: strb		; STRICTALIGN: strb
; STRICTALIGN: strb		; STRICTALIGN: strb
; STRICTALIGN: strb		; STRICTALIGN: strb
; STRICTALIGN: strb		; STRICTALIGN: strb
; STRICTALIGN: strb		; STRICTALIGN: strb
; STRICTALIGN: strb		; STRICTALIGN: strb
; CHECK-NEXT: ret
entry:		entry:
store <2 x i64> zeroinitializer, <2 x i64>* %p, align 1		store <2 x i64> zeroinitializer, <2 x i64>* %p, align 1
ret void		ret void
}		}

; Similar to merge_zr32_3, replaceZeroVectorStore should not split the		; Similar to merge_zr32_3, replaceZeroVectorStore should not split the
; vector store since the zero constant vector has multiple uses.		; vector store since the zero constant vector has multiple uses.
define void @merge_zr64_2(i64* %p) {		define void @merge_zr64_2(i64* %p) {
; CHECK-LABEL: merge_zr64_2:		; CHECK-LABEL: merge_zr64_2:
; CHECK: // %entry		; NOSTRICTALIGN: movi v[[REG:[0-9]]].2d, #0000000000000000
; NOSTRICTALIGN-NEXT: movi v[[REG:[0-9]]].2d, #0000000000000000
; NOSTRICTALIGN-NEXT: stp q[[REG]], q[[REG]], [x{{[0-9]+}}]		; NOSTRICTALIGN-NEXT: stp q[[REG]], q[[REG]], [x{{[0-9]+}}]
; STRICTALIGN-NEXT: stp xzr, xzr, [x{{[0-9]+}}]		; STRICTALIGN: stp xzr, xzr, [x{{[0-9]+}}]
; STRICTALIGN-NEXT: stp xzr, xzr, [x{{[0-9]+}}, #16]		; STRICTALIGN-NEXT: stp xzr, xzr, [x{{[0-9]+}}, #16]
; CHECK-NEXT: ret
entry:		entry:
store i64 0, i64* %p		store i64 0, i64* %p
%p1 = getelementptr i64, i64* %p, i64 1		%p1 = getelementptr i64, i64* %p, i64 1
store i64 0, i64* %p1		store i64 0, i64* %p1
%p2 = getelementptr i64, i64* %p, i64 2		%p2 = getelementptr i64, i64* %p, i64 2
store i64 0, i64* %p2		store i64 0, i64* %p2
%p3 = getelementptr i64, i64* %p, i64 3		%p3 = getelementptr i64, i64* %p, i64 3
store i64 0, i64* %p3		store i64 0, i64* %p3
ret void		ret void
}		}

; Like merge_zr64, but with 2-vector double type.		; Like merge_zr64, but with 2-vector double type.
define void @merge_zr64_2vecd(<2 x double>* %p) {		define void @merge_zr64_2vecd(<2 x double>* %p) {
; CHECK-LABEL: merge_zr64_2vecd:		; CHECK-LABEL: merge_zr64_2vecd:
; CHECK: // %entry		; CHECK: stp xzr, xzr, [x{{[0-9]+}}]
; CHECK-NEXT: stp xzr, xzr, [x{{[0-9]+}}]
; CHECK-NEXT: ret
entry:		entry:
store <2 x double> zeroinitializer, <2 x double>* %p		store <2 x double> zeroinitializer, <2 x double>* %p
ret void		ret void
}		}

; Like merge_zr64, but with 3-vector i64 type.		; Like merge_zr64, but with 3-vector i64 type.
define void @merge_zr64_3vec(<3 x i64>* %p) {		define void @merge_zr64_3vec(<3 x i64>* %p) {
; CHECK-LABEL: merge_zr64_3vec:		; CHECK-LABEL: merge_zr64_3vec:
; CHECK: // %entry		; CHECK: stp xzr, xzr, [x{{[0-9]+}}]
; CHECK-NEXT: stp xzr, xzr, [x{{[0-9]+}}]
; CHECK-NEXT: str xzr, [x{{[0-9]+}}, #16]		; CHECK-NEXT: str xzr, [x{{[0-9]+}}, #16]
; CHECK-NEXT: ret
entry:		entry:
store <3 x i64> zeroinitializer, <3 x i64>* %p		store <3 x i64> zeroinitializer, <3 x i64>* %p
ret void		ret void
}		}

; Like merge_zr64_2, but with 4-vector double type.		; Like merge_zr64_2, but with 4-vector double type.
define void @merge_zr64_4vecd(<4 x double>* %p) {		define void @merge_zr64_4vecd(<4 x double>* %p) {
; CHECK-LABEL: merge_zr64_4vecd:		; CHECK-LABEL: merge_zr64_4vecd:
; CHECK: // %entry		; CHECK: movi v[[REG:[0-9]]].2d, #0000000000000000
; CHECK-NEXT: movi v[[REG:[0-9]]].2d, #0000000000000000		; GENERIC-NEXT: stp q[[REG]], q[[REG]], [x{{[0-9]+}}]
; CHECK-NEXT: stp q[[REG]], q[[REG]], [x{{[0-9]+}}]		; EXYNOSM1-NEXT: str q[[REG]], [x{{[0-9]+}}, #16]
; CHECK-NEXT: ret		; EXYNOSM1-NEXT: str q[[REG]], [x{{[0-9]+}}]
		; EXYNOSM3-NEXT: stp q[[REG]], q[[REG]], [x{{[0-9]+}}]
entry:		entry:
store <4 x double> zeroinitializer, <4 x double>* %p		store <4 x double> zeroinitializer, <4 x double>* %p
ret void		ret void
}		}

; Verify that non-consecutive merges do not generate q0		; Verify that non-consecutive merges do not generate q0
define void @merge_multiple_128bit_stores(i64* %p) {		define void @merge_multiple_128bit_stores(i64* %p) {
; CHECK-LABEL: merge_multiple_128bit_stores		; CHECK-LABEL: merge_multiple_128bit_stores
; CHECK: // %entry		; NOSTRICTALIGN: movi v[[REG:[0-9]]].2d, #0000000000000000
; NOSTRICTALIGN-NEXT: movi v[[REG:[0-9]]].2d, #0000000000000000
; NOSTRICTALIGN-NEXT: str q0, [x0]		; NOSTRICTALIGN-NEXT: str q0, [x0]
; NOSTRICTALIGN-NEXT: stur q0, [x0, #24]		; NOSTRICTALIGN-NEXT: stur q0, [x0, #24]
; NOSTRICTALIGN-NEXT: str q0, [x0, #48]		; NOSTRICTALIGN-NEXT: str q0, [x0, #48]
; STRICTALIGN-NEXT: stp xzr, xzr, [x0]		; STRICTALIGN: stp xzr, xzr, [x0]
; STRICTALIGN-NEXT: stp xzr, xzr, [x0, #24]		; STRICTALIGN-NEXT: stp xzr, xzr, [x0, #24]
; STRICTALIGN-NEXT: stp xzr, xzr, [x0, #48]		; STRICTALIGN-NEXT: stp xzr, xzr, [x0, #48]
; CHECK-NEXT: ret
entry:		entry:
store i64 0, i64* %p		store i64 0, i64* %p
%p1 = getelementptr i64, i64* %p, i64 1		%p1 = getelementptr i64, i64* %p, i64 1
store i64 0, i64* %p1		store i64 0, i64* %p1
%p3 = getelementptr i64, i64* %p, i64 3		%p3 = getelementptr i64, i64* %p, i64 3
store i64 0, i64* %p3		store i64 0, i64* %p3
%p4 = getelementptr i64, i64* %p, i64 4		%p4 = getelementptr i64, i64* %p, i64 4
store i64 0, i64* %p4		store i64 0, i64* %p4
%p6 = getelementptr i64, i64* %p, i64 6		%p6 = getelementptr i64, i64* %p, i64 6
store i64 0, i64* %p6		store i64 0, i64* %p6
%p7 = getelementptr i64, i64* %p, i64 7		%p7 = getelementptr i64, i64* %p, i64 7
store i64 0, i64* %p7		store i64 0, i64* %p7
ret void		ret void
}		}

; Verify that large stores generate stp q		; Verify that large stores generate stp q
define void @merge_multiple_128bit_stores_consec(i64* %p) {		define void @merge_multiple_128bit_stores_consec(i64* %p) {
; CHECK-LABEL: merge_multiple_128bit_stores_consec		; CHECK-LABEL: merge_multiple_128bit_stores_consec
; CHECK: // %entry		; NOSTRICTALIGN: movi v[[REG:[0-9]]].2d, #0000000000000000
; NOSTRICTALIGN-NEXT: movi v[[REG:[0-9]]].2d, #0000000000000000
; NOSTRICTALIGN-NEXT: stp q[[REG]], q[[REG]], [x{{[0-9]+}}]		; NOSTRICTALIGN-NEXT: stp q[[REG]], q[[REG]], [x{{[0-9]+}}]
; NOSTRICTALIGN-NEXT: stp q[[REG]], q[[REG]], [x{{[0-9]+}}, #32]		; NOSTRICTALIGN-NEXT: stp q[[REG]], q[[REG]], [x{{[0-9]+}}, #32]
; STRICTALIGN-NEXT: stp xzr, xzr, [x0]		; STRICTALIGN: stp xzr, xzr, [x0]
; STRICTALIGN-NEXT: stp xzr, xzr, [x0, #16]		; STRICTALIGN-NEXT: stp xzr, xzr, [x0, #16]
; STRICTALIGN-NEXT: stp xzr, xzr, [x0, #32]		; STRICTALIGN-NEXT: stp xzr, xzr, [x0, #32]
; STRICTALIGN-NEXT: stp xzr, xzr, [x0, #48]		; STRICTALIGN-NEXT: stp xzr, xzr, [x0, #48]
; CHECK-NEXT: ret
entry:		entry:
store i64 0, i64* %p		store i64 0, i64* %p
%p1 = getelementptr i64, i64* %p, i64 1		%p1 = getelementptr i64, i64* %p, i64 1
store i64 0, i64* %p1		store i64 0, i64* %p1
%p2 = getelementptr i64, i64* %p, i64 2		%p2 = getelementptr i64, i64* %p, i64 2
store i64 0, i64* %p2		store i64 0, i64* %p2
%p3 = getelementptr i64, i64* %p, i64 3		%p3 = getelementptr i64, i64* %p, i64 3
store i64 0, i64* %p3		store i64 0, i64* %p3
%p4 = getelementptr i64, i64* %p, i64 4		%p4 = getelementptr i64, i64* %p, i64 4
store i64 0, i64* %p4		store i64 0, i64* %p4
%p5 = getelementptr i64, i64* %p, i64 5		%p5 = getelementptr i64, i64* %p, i64 5
store i64 0, i64* %p5		store i64 0, i64* %p5
%p6 = getelementptr i64, i64* %p, i64 6		%p6 = getelementptr i64, i64* %p, i64 6
store i64 0, i64* %p6		store i64 0, i64* %p6
%p7 = getelementptr i64, i64* %p, i64 7		%p7 = getelementptr i64, i64* %p, i64 7
store i64 0, i64* %p7		store i64 0, i64* %p7
ret void		ret void
}		}

; Check for bug 34674 where invalid add of xzr was being generated.		; Check for bug 34674 where invalid add of xzr was being generated.
; CHECK-LABEL: bug34674:		; CHECK-LABEL: bug34674:
; CHECK: // %entry		; CHECK: mov [[ZREG:x[0-9]+]], {{#0\|xzr}}
; CHECK-NEXT: mov [[ZREG:x[0-9]+]], xzr
; CHECK-DAG: stp xzr, xzr, [x0]		; CHECK-DAG: stp xzr, xzr, [x0]
; CHECK-DAG: add x{{[0-9]+}}, [[ZREG]], #1		; CHECK-DAG: add x{{[0-9]+}}, [[ZREG]], #1
define i64 @bug34674(<2 x i64>* %p) {		define i64 @bug34674(<2 x i64>* %p) {
entry:		entry:
store <2 x i64> zeroinitializer, <2 x i64>* %p		store <2 x i64> zeroinitializer, <2 x i64>* %p
%p2 = bitcast <2 x i64>* %p to i64*		%p2 = bitcast <2 x i64>* %p to i64*
%ld = load i64, i64* %p2		%ld = load i64, i64* %p2
%add = add i64 %ld, 1		%add = add i64 %ld, 1
ret i64 %add		ret i64 %add
}		}

llvm/test/CodeGen/AArch64/machine-outliner-remarks.ll

Show First 20 Lines • Show All 89 Lines • ▼ Show 20 Lines	define void @bar() #0 !dbg !27 {
%4 = alloca i32, align 4		%4 = alloca i32, align 4
store i32 0, i32* %1, align 4		store i32 0, i32* %1, align 4
store i32 1, i32* %2, align 4, !dbg !33		store i32 1, i32* %2, align 4, !dbg !33
store i32 2, i32* %3, align 4		store i32 2, i32* %3, align 4
store i32 3, i32* %4, align 4, !dbg !35		store i32 3, i32* %4, align 4, !dbg !35
ret void		ret void
}		}

attributes #0 = { noredzone nounwind ssp uwtable "no-frame-pointer-elim"="false" "target-cpu"="cyclone" }		attributes #0 = { optsize minsize noredzone nounwind ssp uwtable "no-frame-pointer-elim"="false" "target-cpu"="cyclone" }
gberryUnsubmitted Not Done Reply Inline Actions Are these changes needed? gberry: Are these changes needed?
evandroAuthorUnsubmitted Not Done Reply Inline Actions Yes, since the changes in this patch affect the resulting code for this test, unless it's optimized for size. evandro: Yes, since the changes in this patch affect the resulting code for this test, unless it's…

!llvm.dbg.cu = !{!0}		!llvm.dbg.cu = !{!0}
!llvm.module.flags = !{!3, !4, !5, !6}		!llvm.module.flags = !{!3, !4, !5, !6}
!llvm.ident = !{!7}		!llvm.ident = !{!7}

!0 = distinct !DICompileUnit(language: DW_LANG_C99, file: !1, isOptimized: false, runtimeVersion: 0, emissionKind: FullDebug, enums: !2)		!0 = distinct !DICompileUnit(language: DW_LANG_C99, file: !1, isOptimized: false, runtimeVersion: 0, emissionKind: FullDebug, enums: !2)
!1 = !DIFile(filename: "machine-outliner-remarks.ll", directory: "/tmp")		!1 = !DIFile(filename: "machine-outliner-remarks.ll", directory: "/tmp")
!2 = !{}		!2 = !{}
Show All 17 Lines

llvm/test/CodeGen/AArch64/machine-outliner.ll

	Show First 20 Lines • Show All 55 Lines • ▼ Show 20 Lines
	; CHECK: orr w8, wzr, #0x1			; CHECK: orr w8, wzr, #0x1
	; CHECK-NEXT: stp w8, wzr, [sp, #8]			; CHECK-NEXT: stp w8, wzr, [sp, #8]
	; CHECK-NEXT: orr w8, wzr, #0x2			; CHECK-NEXT: orr w8, wzr, #0x2
	; CHECK-NEXT: str w8, [sp, #4]			; CHECK-NEXT: str w8, [sp, #4]
	; CHECK-NEXT: orr w8, wzr, #0x3			; CHECK-NEXT: orr w8, wzr, #0x3
	; CHECK-NEXT: str w8, [sp], #16			; CHECK-NEXT: str w8, [sp], #16
	; CHECK-NEXT: ret			; CHECK-NEXT: ret

	attributes #0 = { noredzone nounwind ssp uwtable "no-frame-pointer-elim"="false" "target-cpu"="cyclone" }			attributes #0 = { optsize minsize noredzone nounwind ssp uwtable "no-frame-pointer-elim"="false" "target-cpu"="cyclone" }

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Query the target when folding loads and storesNeeds ReviewPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 133518

llvm/lib/Target/AArch64/AArch64LoadStoreOptimizer.cpp

llvm/lib/Target/AArch64/AArch64Subtarget.h

llvm/lib/Target/AArch64/AArch64Subtarget.cpp

llvm/test/CodeGen/AArch64/ldst-opt.ll

llvm/test/CodeGen/AArch64/machine-outliner-remarks.ll

llvm/test/CodeGen/AArch64/machine-outliner.ll

[AArch64] Query the target when folding loads and stores
Needs ReviewPublic