This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] add SSA Load Store optimization pass
AbandonedPublic

Authored by JongwonLee on Apr 8 2016, 2:34 AM.

Download Raw Diff

Details

Reviewers

jmolloy
junbuml

Summary

Find consecutive two 32-bit loads and consecutive two 32-bit stores that write the values of the consecutive 32-bit loads. Transform the loads/stores to 64-bit load/store.

When the wide load/store is unscaled(ldur/stur), offset needs not to be changed.
 e.g.,
   %vreg2 = LDURWi %vreg0, -76;
   %vreg3 = LDURWi %vreg0, -72;
   STURWi %vreg2, %vreg1, -44;
   STURWi %vreg3, %vreg1, -40;
   ; becomes
   %vreg2 = LDURXi %vreg0, -76;
   STURXi %vreg2, %vreg1, -44;

When the wide load/store is scaled(ldr/str), offset should be a half of the original value.
 e.g.,
   %vreg2 = LDRWui %vreg0, 4;
   %vreg3 = LDRWui %vreg0, 5;
   STRWui %vreg2, %vreg1, 2;
   STRWui %vreg3, %vreg1, 3;
   ; becomes
   %vreg2 = LDRXui %vreg0, 2;
   STRXui %vreg2, %vreg1, 1;

When the original load/store is scaled(ldr/str) and it has an odd offset value, it can be widened if (-256 <= unscaled offset value < 256) is satisfied.
cf.) unscaled offset value = scaled offset value * memory scale size
 e.g.,
   %vreg2 = LDRWui %vreg0, 13;
   %vreg3 = LDRWui %vreg0, 14;
   STRWui %vreg2, %vreg1, 37;
   STRWui %vreg3, %vreg1, 38;
   ; becomes
   %vreg2 = LDURXi %vreg0, 52; 52 = 13 * 4
   STURXi %vreg2, %vreg1, 148; 148 = 37 * 4

Diff Detail

Event Timeline

JongwonLee updated this revision to Diff 53003.Apr 8 2016, 2:34 AM

JongwonLee retitled this revision from to [AArch64] add SSA Load Store optimization pass.

JongwonLee updated this object.

JongwonLee added reviewers: mcrosier, jmolloy, junbuml.

JongwonLee added subscribers: llvm-commits, flyingforyou.

Herald added subscribers: rengolin, aemerson. · View Herald TranscriptApr 8 2016, 2:34 AM

flyingforyou added subscribers: sebpop, hiraditya.Apr 8 2016, 2:40 AM

Hi Jongwon,

The idea is interesting, but I'm not an expert on this, so I'll let James have a more thorough look, but I do have some questions.

Don't we already have a load/store optimisation pass? Why have you decided to add a new pass, rather than a new step on the current pass?

Also, you seem to have written code for a large number of cases, with corner cases (including offset < |256|), and there aren't enough tests to cover all the cases. Can you please add more to make sure both basic and corner cases are covered?

Thanks,
--renato

lib/Target/AArch64/AArch64SSALoadStoreOptimizer.cpp
72	This functionality seems to match with TII's isUnscaledLdSt(), maybe it could be merged there?
169	You have added the ops with scale 1, and an assert to make sure you're passing the right opcode. Shouldn't you just call getMemScale here and use it as an opcode check, and set the flag with (Stride == 1)?
287	The names of the variables 'loads' and 'stores' is not very explanatory. Maybe call them deletedLoads/Stores? old?
408	This comment is in the wrong place. It should be at a higher level, where people can actually read it before going through the code.
lib/Target/AArch64/AArch64TargetMachine.cpp
114	Maybe not leave it on by default as a first approach?
test/CodeGen/AArch64/ldst-opt.ll
1125	Can you explain this change? Looks suspicious...
test/CodeGen/AArch64/ssa-ldst-opt.ll
2	It's good to add the 'aarch64-ssa-load-store-opt' flag, even if it ends up enabled by default for two reasons: It'll document that this is what the test is about, so people can easily find in the code what the change was. If it ever ends up disabled by default, the tests will not break.

The change specifically handle two consecutive loads / stores widening in SSA form. I'm not sure if this is a good motivating use case to add a new pass. Did you see any performance gain with this change?

test/CodeGen/AArch64/ssa-ldst-opt.ll
10–14	Merging the second load to the first load seems to be wrong without alias check between the second load and the first store.

Drive by review.. :)

lib/Target/AArch64/AArch64SSALoadStoreOptimizer.cpp
2	ExynosOpt? :)
152	return UnscaledOffset < 256 && UnscaledOffset >= -256;
325	Why not SmallVector?
381	Has this been clang-formatted?
387	No need for extra braces.
391	No need for extra braces.
447	loads -> Loads
447	http://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly In general, names should be in camel case (e.g. TextFileReader and isLValue()).
448	stores -> Stores
449	range-based loop?
460	No need for extra braces.
467	No need for extra braces.
471	No need for extra braces.

I have answered some of questions. Remaining questions will be answered later.

lib/Target/AArch64/AArch64SSALoadStoreOptimizer.cpp
325	Thanks. I will fix it.
448	It will be changed to StoresWillBeDeleted.
449	The loop iterates from the first instruction to the end instruction in a basic block. Is this the answer you expect? I don't know the exact meaning of the range-based loop.

JongwonLee added inline comments.Apr 11 2016, 3:40 AM

lib/Target/AArch64/AArch64SSALoadStoreOptimizer.cpp
2	Thanks. It should be AArch64SSALoadStoreOptimizer.cpp.
72	TII's isUnscaledLdSt() just returns boolean type accroding to opcode. Actually, getMemScale() is called dependently on the result of isUnscaledLdSt(). This usage pattern is also found in AArch64LoadStoreOptimizer.cpp.
169	I cannot catch your point. Could you show me a code sample you want?
287	I agree with your opinion. Thanks. I think Loads/StoresWillBeDeleted is appropriate.
408	I referenced this format from AArch64LoadStoreOptimizer.cpp. I think there will be more chances to optimize at this level. So, I think that the explanation is necessary for each case even though this patch starts with just one optimization case. Maybe we will have .. 2) description for second optimization ... 3) description for third optimization ...
447	It will be changed to LoadsWillBeDeleted.
test/CodeGen/AArch64/ldst-opt.ll
1125	This change is to prevent the optimization of this patch from being applied. This test is for another optimization.

Some good changes, some comments, but I still wonder why:

You haven't done this in the already existing LdStOpt pass.
You enable this pass by default now, instead of let people play with it for a while.

cheers,
--renato

lib/Target/AArch64/AArch64SSALoadStoreOptimizer.cpp
72	To me, it seems like isUnscaledLdSt should be just: return (getMemScale() == 1);
169	int FirstMIOffsetStride = getMemScale(FirstMI); bool FirstMIIsUnscaled = (FirstMIOffsetStride == 1); or, if you do the transformation I mentioned above: int FirstMIOffsetStride = TII->getMemScale(FirstMI); bool FirstMIIsUnscaled = TII->isUnscaledLdSt(FirstMI) Inliners could common up both calls to getMemScale.
408	Good point.
test/CodeGen/AArch64/ldst-opt.ll
1125	Ah! Makes sense.

In D18890#396878, @rengolin wrote:

Some good changes, some comments, but I still wonder why:

You haven't done this in the already existing LdStOpt pass.

I imagine one benefit of merging the 32 bit loads/stores before register allocation is that it saves a register, right?

ldr w2, [x0]
ldr w3, [x0, #4]
str w2, [x1]
str w3, [x1, #4]

becomes

ldr x2, [x0]
str x2, [x1]

saving x3.

JongwonLee added a comment.Apr 12 2016, 2:49 AM

This comment was removed by JongwonLee.

test/CodeGen/AArch64/ssa-ldst-opt.ll
10–14	Thanks. I missed this case. In the test code, metadata for tbaa is added. In the source code,the routine for checking alias is added.

In D18890#397239, @mcrosier wrote:

You haven't done this in the already existing LdStOpt pass.

I imagine one benefit of merging the 32 bit loads/stores before register allocation is that it saves a register, right?

Right, good point.

JongwonLee added inline comments.Apr 12 2016, 3:53 AM

lib/Target/AArch64/AArch64SSALoadStoreOptimizer.cpp
72	But there exists unscaled ld/st with mem scale that is not 1.
169	Thanks for the example. But mem-scale is not always the same as offset-stride. The stride of scaled ld/st is 1, and the stride of unscaled is mem-scale.
381	Is this format is wrong? What is required in the clang-format?
lib/Target/AArch64/AArch64TargetMachine.cpp
114	I fixed it to be off by default.
test/CodeGen/AArch64/ssa-ldst-opt.ll
2	'aarch64-ssa-load-store-opt' is disabled by default. The explanation about this test is described.

JongwonLee updated this revision to Diff 53378.Apr 12 2016, 3:54 AM

JongwonLee edited edge metadata.

I'm happy with the changes, with formatting fixes pointed by @mcrosier. I'll leave the approval to @junbuml, though.

lib/Target/AArch64/AArch64SSALoadStoreOptimizer.cpp
73	Do'h, ignore me.
170	Right.
382	clang-format is a tool that you pass on the source that formats it as a given standard. The default standard is LLVM's, so that new code always adhere to the standard without much effort.
test/CodeGen/AArch64/ssa-ldst-opt.ll
3	Thanks!

Hi Jongwon
Thanks you for the update with the additional alias check. However, as you merge up the second load/store to the first load/stores, I believe you need to check if there is any instruction in between the first and second load/store, which may alias with the second load/store, not just for the first store and the second load.

I'm also curious how this pass was motivated. Did you see any performance gain with this change?

Please, see my inline comments for minor issues.

lib/Target/AArch64/AArch64SSALoadStoreOptimizer.cpp
22–30	Please remove header files not used in this file. For example : llvm/Support/raw_ostream.h llvm/IR/Constants.h llvm/CodeGen/MachineTraceMetrics.h llvm/CodeGen/TargetSchedule.h
81	It appears that you handles STURWi, STRWui, LDRWui, LDURWi in this patch. So, I don't think you need to keep all the other cases.
183–185	Is there any programmatic way to express the meaning of constants you use here instead of directly using them here?
191	It will be good to add more comments about that this function also insert elements in the SmallVector in the access order.
196	It will be good to make sure that we are handling instructions expected to be passed in this function by adding assert() or some checks ?
220–229	Looks like you can simplify this if statements. For example, "FirstMIBaseReg == SecondMIBaseReg" could be checked earlier. if (FirstMIOffset + FirstMIOffsetStride == SecondMIOffset) and if (FirstMIOffset < SecondMIOffset) is redundant.
240	It will be good to make sure that MI and MergeMI are instructions expected in this function.
243	unsigned Reg = RegOp.getReg();
262	Cannot you do just : InsertionPoint = MI;
270	llvm_unreachable("Unexpected MI's opcode.");
273–276	NewOpc = ShouldBeUnscaled ? AArch64::LDURXi : AArch64::LDRXui;
331	Should it be assert () because you specifically handle AArch64::STURWi, AArch64::STRWui ?
350	Same here. I think it should be assert () because you specifically handle AArch64::LDRWui, AArch64::LDURWi ?
357–358	ConsecutiveStores and ConsecutiveLoads will be better.
360	You may want to do MBBI++ here, instead of MBBI++in line 326. Or you can use do/while loop.
370	Maybe assert() here as you specifically detect STURWi and STRWui.
414–421	Do we need to check this again? Or maybe assert() ?
425	Don't you also need to check if there is any instruction in between the first and second load, which is aliased with the second loads? The same check should be performed for stores.
455–456	Please change : // loads/stores // to 64-bit load/store. into // loads/stores to 64-bit load/store.
481–482	Please fully use 80 column line.
494–495	Can you change it to range-based loop ?
524	Please remove TRI if not used here.

In D18890#398755, @junbuml wrote:

Hi Jongwon
Thanks you for the update with the additional alias check. However, as you merge up the second load/store to the first load/stores, I believe you need to check if there is any instruction in between the first and second load/store, which may alias with the second load/store, not just for the first store and the second load.

I'm also curious how this pass was motivated. Did you see any performance gain with this change?

Please, see my inline comments for minor issues.

Hi junbum,
Thanks for the comments.

I think the alias check is needed for (first store, second load) and (second store, first load). I think we don't need to check any other instructions except 4 instructions (2 loads, 2 stores) because the first store is the only use instruction of the first load and the second store is also the only use instruction of the second load.

(1) reg1 = load [mem1]
(2) reg2 = load [mem1 +4]
(3) store reg1, [mem2]
(4) store reg2, [mem2 + 4]

(3) is the only use instruction of reg1
(4) is the only use instruction of reg2

This work is motivated from some test cases. Even if the work doesn't get the performance gain for all cases, it has positive effect for some cases. I'll show the data when ready.

lib/Target/AArch64/AArch64SSALoadStoreOptimizer.cpp
183–185	Fixed the code like the below. #define MAX_UNSCALED_OFFSET 255 #define MIN_UNSCALED_OFFSET 256 ... return UnscaledOffset <= MAX_UNSCALED_OFFSET && UnscaledOffset >= MIN_UNSCALED_OFFSET; ... Is that O.K?
331	Some test cases shows that this condition is not satisfied when opcode is AArch64::STURWi or AArch64::STRWui.
350	Some test cases shows that this condition is not satisfied when opcode is AArch64::LDRWui or AArch64::LDURWui.
370	Some test cases shows that this condition is not satisfied when opcode is AArch64::STURWi or AArch64::STRWui.
414–421	We need to check this. For example, reg2 = load [mem1] reg1 = load [mem1 +4] store reg1, [mem2] store reg2, [mem2 +4] If we don't check, we will get the following code. reg2 = wide-load [mem1] wide-store reg1, [mem2] Resultant code gets to use reg1 without definition.
425	I think the alias check is needed for (first store, second load) and (second store, first load). I think we don't need to check any other instructions except 4 instructions (2 loads, 2 stores) because the first store is the only use instruction of the first load and the second store is also the only use instruction of the second load. (1) reg1 = load [mem1] (2) reg2 = load [mem1 +4] (3) store reg1, [mem2] (4) store reg2, [mem2 + 4] (3) is the only use instruction of reg1 (4) is the only use instruction of reg2

JongwonLee updated this revision to Diff 53684.Apr 14 2016, 3:58 AM

junbuml added inline comments.Apr 14 2016, 8:09 AM

lib/Target/AArch64/AArch64SSALoadStoreOptimizer.cpp
34	-256
112–115	return AA->alias(LocA, LocB);
415–422	I see your point. In that case why don't you change like : reg2 = wide-load [mem1] wide-store reg2, [mem2]
426	In order to merge the second load / store to the first load / store, we need to make sure that there is no memory operations aliasing with the second one. For example, in the code below, there is a store between the loads. In this case we cannot move the second load to the first without knowing that mem1 and mem3 is not pointing the same address. reg1 = load [mem1] store reg3, [mem3 + 4] reg2 = load [mem1 +4] store reg1, [mem2] store reg2, [mem2 + 4]

JongwonLee marked an inline comment as done.Apr 14 2016, 7:38 PM

JongwonLee added inline comments.

lib/Target/AArch64/AArch64SSALoadStoreOptimizer.cpp
415–422	reg2 = load [mem1] reg1 = load [mem1 +4] The above two loads can be the below one load. reg2 = wide-laod [mem1] store reg1, [mem2] store reg2, [mem2 +4] But, the above two stores cannot be the below one store. wide-store reg2, [mem2] The correct form of wide-store should have reg1 as its source operand. wide-store reg1, [mem2]
426	Thanks. You are right. I fixed the code to consider all the aliases between two loads and between two stores.

JongwonLee updated this revision to Diff 53831.Apr 14 2016, 7:38 PM

JongwonLee updated this revision to Diff 53852.Apr 15 2016, 12:33 AM

Hi Jongwon,

Sorry it's taken little long to revisit. Please see my inline comments.

lib/Target/AArch64/AArch64SSALoadStoreOptimizer.cpp
118	This function basically assume that MIa and MIb is in the same basic block. But, isn't it possible that MIa and MIb could be in different basic blocks?
124–137	I believe you could do something like this : for (auto &MBBI : make_range(MIa->getIterator(), MIb->getIterator()))
145	When you check alias between two stores you should also check if there is any load aliased with the second store if you always move the second to the first.
407–425	I don't think you need to have these alias checks specifically because you check alias between two loads and two stores below.
498	It seems that you don't handle volatile. You could use isCandidateToMergeOrPair().

Hi,

I'm confused about why you're doing this optimization here instead of much earlier in the compiler. This is basically memcpy idiom recognition - taking multiple 32-bit loads/stores and converting them into a llvm.memcpy intrinsic for perfect lowering should do the same job, and would work for all backends.

Have you investigated doing this much earlier (IR, pre-ISel?)

James

lib/Target/AArch64/AArch64SSALoadStoreOptimizer.cpp
2	This comment is mangled and has gone onto a new line. Please truncate it like all other header comments.
33	We use integer constants, not #defines.

This revision now requires changes to proceed.Apr 25 2016, 5:45 AM

In D18890#410557, @jmolloy wrote:

Hi,

I'm confused about why you're doing this optimization here instead of much earlier in the compiler. This is basically memcpy idiom recognition - taking multiple 32-bit loads/stores and converting them into a llvm.memcpy intrinsic for perfect lowering should do the same job, and would work for all backends.

Have you investigated doing this much earlier (IR, pre-ISel?)

James

Hi James.
This patch is considering only AArch64 not other backends. Actually, earlier level work that can merge 32-bit loads/stores is IR-level SLP vectorization. Current SLP vectorization does not support 64-bit width packing, so I tried to extend the range of SLP vectorization. Please see the patch (http://reviews.llvm.org/D18237, http://reviews.llvm.org/D19151).

lib/Target/AArch64/AArch64SSALoadStoreOptimizer.cpp
118	It would be possible that the load/store instructions in different basic blocks, but it's not considered in this patch.
145	Thanks. I will fix this.

JongwonLee updated this revision to Diff 55179.Apr 27 2016, 3:31 AM

JongwonLee edited edge metadata.

Hi,

This patch is considering only AArch64 not other backends. Actually, earlier level work that can merge 32-bit loads/stores is IR-level SLP vectorization. Current SLP vectorization does not support 64-bit width packing, so I tried to extend the range of SLP vectorization. Please see the patch (http://reviews.llvm.org/D18237, http://reviews.llvm.org/D19151).

OK, but this functionality is useful for other backends too. It isn't ideal in the long term to put generic functionality needlessly in target-specific areas.

If you're implementing this in the SLP vectorizer, why do you need to do it here as well?

Cheers,

James

OK, but this functionality is useful for other backends too. It isn't ideal in the long term to put generic functionality needlessly in target-specific areas.

If you're implementing this in the SLP vectorizer, why do you need to do it here as well?

The support of 64-bit SLP vectorization and this patch have some overlaps but not identical. Each has its own parts for optimization. And this patch can supplement SLP vectorization. For example, when loop unrolling is called after SLP vectorization, we lose the chance to pack 32-bit loads/stores unless SLP vectorization is called again. This patch can catch the chance that would happen due to the order of IR-level optimizations.

This patch was motivated by performance improvement in commercial benchmark.
However, this patch has different tendency in SPEC and LLVM test-suite.
In SPEC benchmark, compile-time increases by about 10 %. In LLVM test-suite, compile-time decreases by about 0.02 %. Execution time has no regression in both SPEC and LLVM test-suites (0.04% and 0.37% improvement respectively). The data are measured from the average of three times executions.

Hi,

So why aren't you doing this in codegen prepare or dag combine?

Cheers,

James

In D18890#417820, @jmolloy wrote:

Hi,

So why aren't you doing this in codegen prepare or dag combine?

Cheers,

James

Hi,
I didn't try this work on other positions except backend.
As I mentioned before, this work is only considering AArch64 not other backends.
Is there any advantages of doing this work on codegen prepare or dag combine?

In D18890#418230, @JongwonLee wrote:

In D18890#417820, @jmolloy wrote:

Hi,

So why aren't you doing this in codegen prepare or dag combine?

Cheers,

James

Hi,
I didn't try this work on other positions except backend.
As I mentioned before, this work is only considering AArch64 not other backends.
Is there any advantages of doing this work on codegen prepare or dag combine?

Hi Jongwon,

Sure, there are advantages. Modifying IR is substantially easier and less prone to error than modifying machine instructions. This is also a generic optimization fixing a problem that likely affects more than just the AArch64 target, therefore the right thing to do is to implement it in such a way that it will benefit other targets (unless that causes a very high cost).

In this case it would seem to me quicker, easier and less technical debt later to implement this higher up in the compiler.

James

In D18890#420988, @jmolloy wrote:

In this case it would seem to me quicker, easier and less technical debt later to implement this higher up in the compiler.

mcrosier resigned from this revision.May 5 2016, 10:28 AM

mcrosier removed a reviewer: mcrosier.

In D18890#420988, @jmolloy wrote:

In D18890#418230, @JongwonLee wrote:

In D18890#417820, @jmolloy wrote:

Hi,

So why aren't you doing this in codegen prepare or dag combine?

Cheers,

James

Hi,
I didn't try this work on other positions except backend.
As I mentioned before, this work is only considering AArch64 not other backends.
Is there any advantages of doing this work on codegen prepare or dag combine?

Hi Jongwon,

Sure, there are advantages. Modifying IR is substantially easier and less prone to error than modifying machine instructions. This is also a generic optimization fixing a problem that likely affects more than just the AArch64 target, therefore the right thing to do is to implement it in such a way that it will benefit other targets (unless that causes a very high cost).

In this case it would seem to me quicker, easier and less technical debt later to implement this higher up in the compiler.

James

Hi James
Thank you for the advice.
I'll try to do this in higher level in the compiler.

Jongwon

JongwonLee abandoned this revision.May 10 2016, 1:50 AM

evandro added a subscriber: evandro.May 11 2016, 11:34 AM

Revision Contents

Path

Size

lib/

Target/

AArch64/

AArch64.h

2 lines

AArch64SSALoadStoreOptimizer.cpp

514 lines

AArch64TargetMachine.cpp

8 lines

CMakeLists.txt

1 line

test/

CodeGen/

AArch64/

ldst-opt.ll

4 lines

ssa-ldst-opt.ll

182 lines

Diff 55179

lib/Target/AArch64/AArch64.h

	Show All 40 Lines
	FunctionPass *createAArch64AddressTypePromotionPass();			FunctionPass *createAArch64AddressTypePromotionPass();
	FunctionPass *createAArch64A57FPLoadBalancing();			FunctionPass *createAArch64A57FPLoadBalancing();
	FunctionPass *createAArch64A53Fix835769();			FunctionPass *createAArch64A53Fix835769();

	FunctionPass *createAArch64CleanupLocalDynamicTLSPass();			FunctionPass *createAArch64CleanupLocalDynamicTLSPass();

	FunctionPass *createAArch64CollectLOHPass();			FunctionPass *createAArch64CollectLOHPass();

				FunctionPass *createAArch64SSALoadStoreOptPass();

	void initializeAArch64ExpandPseudoPass(PassRegistry&);			void initializeAArch64ExpandPseudoPass(PassRegistry&);
	} // end namespace llvm			} // end namespace llvm

	#endif			#endif

lib/Target/AArch64/AArch64SSALoadStoreOptimizer.cpp

This file was added.

				//=- AArch64SSALoadStoreOptimizer.cpp - AArch64 load/store opt. in SSA form -=//
				//
				mcrosierUnsubmitted Not Done Reply Inline Actions ExynosOpt? :) mcrosier: ExynosOpt? :)
				JongwonLeeAuthorUnsubmitted Not Done Reply Inline Actions Thanks. It should be AArch64SSALoadStoreOptimizer.cpp. JongwonLee: Thanks. It should be AArch64SSALoadStoreOptimizer.cpp.
				jmolloyUnsubmitted Done Reply Inline Actions This comment is mangled and has gone onto a new line. Please truncate it like all other header comments. jmolloy: This comment is mangled and has gone onto a new line. Please truncate it like all other header…
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// This file contains a pass that performs load / store related peephole
				// optimizations in SSA form. This pass should be run before register
				// allocation.

				// ===---------------------------------------------------------------------===//

				#include "AArch64InstrInfo.h"
				#include "AArch64Subtarget.h"
				#include "llvm/ADT/SmallVector.h"
				#include "llvm/CodeGen/MachineBasicBlock.h"
				#include "llvm/CodeGen/MachineFunctionPass.h"
				#include "llvm/CodeGen/MachineInstr.h"
				#include "llvm/CodeGen/MachineInstrBuilder.h"
				#include "llvm/CodeGen/MachineRegisterInfo.h"
				#include "llvm/Support/Debug.h"
				#include "llvm/Target/TargetInstrInfo.h"

				using namespace llvm;

				#define DEBUG_TYPE "aarch64-ssa-ldst-opt"

				junbumlUnsubmitted Done Reply Inline Actions Please remove header files not used in this file. For example : llvm/Support/raw_ostream.h llvm/IR/Constants.h llvm/CodeGen/MachineTraceMetrics.h llvm/CodeGen/TargetSchedule.h junbuml: Please remove header files not used in this file. For example : llvm/Support/raw_ostream.h…
				namespace {
				class AArch64SSALoadStoreOpt : public MachineFunctionPass {

				jmolloyUnsubmitted Done Reply Inline Actions We use integer constants, not #defines. jmolloy: We use integer constants, not #defines.
				const AArch64InstrInfo *TII;
				junbumlUnsubmitted Done Reply Inline Actions -256 junbuml: -256
				MachineRegisterInfo *MRI;
				const AArch64Subtarget *Subtarget;
				AliasAnalysis *AA;

				public:
				static char ID;
				AArch64SSALoadStoreOpt() : MachineFunctionPass(ID) {}
				bool tryToWidenLdStInst(MachineInstr *MI,
				std::set<MachineInstr *> &StoresWillBeDeleted,
				std::set<MachineInstr *> &LoadsWillBeDeleted);
				bool instrAliased(MachineInstr MIa, MachineInstr MIb);
				bool checkOffsetRange(MachineInstr *MI);
				bool isConsecutive(MachineInstr FirstMI, MachineInstr SecondMI,
				SmallVector<MachineInstr *, 2> &MIs);
				bool hasAliasBetween(MachineInstr MIa, MachineInstr MIb);
				void buildWideLdStInst(MachineInstr MI, MachineInstr MergeMI);
				bool optimizeBlock(MachineBasicBlock &MBB);
				bool runOnMachineFunction(MachineFunction &MF) override;
				const char *getPassName() const override {
				return "AArch64 SSA load / store optimization pass";
				}

				void getAnalysisUsage(AnalysisUsage &AU) const override {
				AU.addRequired<AAResultsWrapperPass>();
				AU.addPreserved<AAResultsWrapperPass>();
				MachineFunctionPass::getAnalysisUsage(AU);
				}
				};
				char AArch64SSALoadStoreOpt::ID = 0;
				} // namespace

				// Scaling factor for unscaled load or store.
				static int getMemScale(MachineInstr *MI) {
				switch (MI->getOpcode()) {
				default:
				llvm_unreachable("Opcode has unknown scale!");
				case AArch64::LDRWui:
				case AArch64::LDURWi:
				rengolinUnsubmitted Not Done Reply Inline Actions This functionality seems to match with TII's isUnscaledLdSt(), maybe it could be merged there? rengolin: This functionality seems to match with TII's isUnscaledLdSt(), maybe it could be merged there?
				JongwonLeeAuthorUnsubmitted Not Done Reply Inline Actions TII's isUnscaledLdSt() just returns boolean type accroding to opcode. Actually, getMemScale() is called dependently on the result of isUnscaledLdSt(). This usage pattern is also found in AArch64LoadStoreOptimizer.cpp. JongwonLee: TII's isUnscaledLdSt() just returns boolean type accroding to opcode. Actually, getMemScale()…
				rengolinUnsubmitted Not Done Reply Inline Actions To me, it seems like isUnscaledLdSt should be just: return (getMemScale() == 1); rengolin: To me, it seems like isUnscaledLdSt should be just: return (getMemScale() == 1);
				JongwonLeeAuthorUnsubmitted Not Done Reply Inline Actions But there exists unscaled ld/st with mem scale that is not 1. JongwonLee: But there exists unscaled ld/st with mem scale that is not 1.
				case AArch64::STRWui:
				rengolinUnsubmitted Not Done Reply Inline Actions Do'h, ignore me. rengolin: Do'h, ignore me.
				case AArch64::STURWi:
				return 4;
				}
				}

				static const MachineOperand &getLdStRegOp(const MachineInstr *MI) {
				return MI->getOperand(0);
				}
				junbumlUnsubmitted Done Reply Inline Actions It appears that you handles STURWi, STRWui, LDRWui, LDURWi in this patch. So, I don't think you need to keep all the other cases. junbuml: It appears that you handles STURWi, STRWui, LDRWui, LDURWi in this patch. So, I don't think…

				static const MachineOperand &getLdStBaseOp(const MachineInstr *MI) {
				return MI->getOperand(1);
				}

				static const MachineOperand &getLdStOffsetOp(const MachineInstr *MI) {
				return MI->getOperand(2);
				}

				// Check if the memory operands of two instructions are aliased to each other.
				bool AArch64SSALoadStoreOpt::instrAliased(MachineInstr *MIa,
				MachineInstr *MIb) {
				MachineMemOperand *MMOA =
				MIa->hasOneMemOperand() ? *MIa->memoperands_begin() : nullptr;
				MachineMemOperand *MMOB =
				MIb->hasOneMemOperand() ? *MIb->memoperands_begin() : nullptr;

				if (!MMOA \|\| !MMOB)
				return true;

				if (!MMOA->getValue() \|\| !MMOB->getValue())
				return true;

				MemoryLocation LocA(MMOA->getValue(), MMOA->getSize(), MMOA->getAAInfo());
				MemoryLocation LocB(MMOB->getValue(), MMOB->getSize(), MMOB->getAAInfo());

				return AA->alias(LocA, LocB);
				}

				// Check if there is any instruction between MIa and MIb, which may alias with
				// the instruction located later between MIa and MIb.
				bool AArch64SSALoadStoreOpt::hasAliasBetween(MachineInstr *MIa,
				MachineInstr *MIb) {
				MachineInstr *EndMI = MIb;
				junbumlUnsubmitted Not Done Reply Inline Actions return AA->alias(LocA, LocB); junbuml: return AA->alias(LocA, LocB);
				bool IsStartMIMet = false;
				for (auto &MBBI : make_range(MIa->getIterator(), MIb->getIterator())) {
				MachineInstr *MI = &MBBI;
				junbumlUnsubmitted Not Done Reply Inline Actions This function basically assume that MIa and MIb is in the same basic block. But, isn't it possible that MIa and MIb could be in different basic blocks? junbuml: This function basically assume that MIa and MIb is in the same basic block. But, isn't it…
				JongwonLeeAuthorUnsubmitted Not Done Reply Inline Actions It would be possible that the load/store instructions in different basic blocks, but it's not considered in this patch. JongwonLee: It would be possible that the load/store instructions in different basic blocks, but it's not…

				// When neither MIa nor MIb is not met, check if MI is either MIa or MIb.
				// If met, set MI to EndMI.
				if (!IsStartMIMet) {
				if (MI == MIa) {
				IsStartMIMet = true;
				} else if (MI == MIb) {
				EndMI = MIa;
				IsStartMIMet = true;
				}
				continue;
				}

				// Check if MI is aliased with EndMI untill MI becomes EndMI.
				// If MI is revealed as an alised instruction, finish the search.
				// When MI is equal to EndMI, it means that there doesn't exist any alias
				// instruction between MIa and MIb, so finish the search.
				if (MI == EndMI) {
				return false;
				junbumlUnsubmitted Done Reply Inline Actions I believe you could do something like this : for (auto &MBBI : make_range(MIa->getIterator(), MIb->getIterator())) junbuml: I believe you could do something like this : for (auto &MBBI : make_range(MIa->getIterator()…
				} else if (MI->mayStore()) {
				// Check if MI is aliased with EndMI.
				if (instrAliased(MI, EndMI))
				return true;
				} else if (MI->mayLoad() && EndMI->mayStore()) {
				if (instrAliased(MI, EndMI))
				return true;
				}
				junbumlUnsubmitted Not Done Reply Inline Actions When you check alias between two stores you should also check if there is any load aliased with the second store if you always move the second to the first. junbuml: When you check alias between two stores you should also check if there is any load aliased…
				JongwonLeeAuthorUnsubmitted Not Done Reply Inline Actions Thanks. I will fix this. JongwonLee: Thanks. I will fix this.
				}

				return false;
				}

				bool AArch64SSALoadStoreOpt::checkOffsetRange(MachineInstr *MI) {
				bool IsScaled = !TII->isUnscaledLdSt(MI);
				mcrosierUnsubmitted Done Reply Inline Actions return UnscaledOffset < 256 && UnscaledOffset >= -256; mcrosier: return UnscaledOffset < 256 && UnscaledOffset >= -256;
				int Offset = getLdStOffsetOp(MI).getImm();

				// When the original load/store is scaled(ldr/str) and it has an odd offset
				// value, it can be widened
				// if (MIN_UNSCALED_OFFSET <= unscaled offset value <= MAX_UNSCALED_OFFSET)
				// is satisfied.
				// cf.) unscaled offset value = scaled offset value * memory scale size
				if (IsScaled) {
				int UnscaledOffset = Offset * getMemScale(MI);
				if (Offset % 2)
				return UnscaledOffset <= 255 && UnscaledOffset >= -256;
				}

				return true;
				}

				// Check if two loads/stores have consecutive memory accesses.
				rengolinUnsubmitted Not Done Reply Inline Actions You have added the ops with scale 1, and an assert to make sure you're passing the right opcode. Shouldn't you just call getMemScale here and use it as an opcode check, and set the flag with (Stride == 1)? rengolin: You have added the ops with scale 1, and an assert to make sure you're passing the right opcode.
				JongwonLeeAuthorUnsubmitted Not Done Reply Inline Actions I cannot catch your point. Could you show me a code sample you want? JongwonLee: I cannot catch your point. Could you show me a code sample you want?
				rengolinUnsubmitted Not Done Reply Inline Actions int FirstMIOffsetStride = getMemScale(FirstMI); bool FirstMIIsUnscaled = (FirstMIOffsetStride == 1); or, if you do the transformation I mentioned above: int FirstMIOffsetStride = TII->getMemScale(FirstMI); bool FirstMIIsUnscaled = TII->isUnscaledLdSt(FirstMI) Inliners could common up both calls to getMemScale. rengolin: int FirstMIOffsetStride = getMemScale(FirstMI); bool FirstMIIsUnscaled =…
				JongwonLeeAuthorUnsubmitted Not Done Reply Inline Actions Thanks for the example. But mem-scale is not always the same as offset-stride. The stride of scaled ld/st is 1, and the stride of unscaled is mem-scale. JongwonLee: Thanks for the example. But mem-scale is not always the same as offset-stride. The stride of…
				// If found, the SmallVector has the consecutive instructions in the access
				rengolinUnsubmitted Not Done Reply Inline Actions Right. rengolin: Right.
				// order.
				bool AArch64SSALoadStoreOpt::isConsecutive(
				MachineInstr FirstMI, MachineInstr SecondMI,
				SmallVector<MachineInstr *, 2> &MIs) {
				assert((FirstMI->mayLoad() ? SecondMI->mayLoad()
				: (FirstMI->mayStore() && SecondMI->mayStore())) &&
				"Unexpected input instructions");
				bool FirstMIIsUnscaled = TII->isUnscaledLdSt(FirstMI);
				unsigned FirstMIBaseReg = getLdStBaseOp(FirstMI).getReg();
				int FirstMIOffset = getLdStOffsetOp(FirstMI).getImm();
				int FirstMIOffsetStride = FirstMIIsUnscaled ? getMemScale(FirstMI) : 1;

				bool SecondMIIsUnscaled = TII->isUnscaledLdSt(SecondMI);
				unsigned SecondMIBaseReg = getLdStBaseOp(SecondMI).getReg();
				int SecondMIOffset = getLdStOffsetOp(SecondMI).getImm();
				junbumlUnsubmitted Not Done Reply Inline Actions Is there any programmatic way to express the meaning of constants you use here instead of directly using them here? junbuml: Is there any programmatic way to express the meaning of constants you use here instead of…
				JongwonLeeAuthorUnsubmitted Not Done Reply Inline Actions Fixed the code like the below. #define MAX_UNSCALED_OFFSET 255 #define MIN_UNSCALED_OFFSET 256 ... return UnscaledOffset <= MAX_UNSCALED_OFFSET && UnscaledOffset >= MIN_UNSCALED_OFFSET; ... Is that O.K? JongwonLee: Fixed the code like the below. #define MAX_UNSCALED_OFFSET 255 #define MIN_UNSCALED_OFFSET 256…
				if (FirstMIIsUnscaled != SecondMIIsUnscaled) {
				// We're trying to pack instructions that differ in how they are scaled.
				// If FirstMI is scaled then scale the offset of MI accordingly.
				// Otherwise, do the opposite (i.e., make MI's offset unscaled).
				int MemSize = getMemScale(SecondMI);
				if (SecondMIIsUnscaled) {
				junbumlUnsubmitted Done Reply Inline Actions It will be good to add more comments about that this function also insert elements in the SmallVector in the access order. junbuml: It will be good to add more comments about that this function also insert elements in the…
				// If the unscaled offset isn't a multiple of the MemSize, we can't
				// pack the operations together: bail and keep looking.
				if (SecondMIOffset % MemSize)
				return false;

				junbumlUnsubmitted Done Reply Inline Actions It will be good to make sure that we are handling instructions expected to be passed in this function by adding assert() or some checks ? junbuml: It will be good to make sure that we are handling instructions expected to be passed in this…
				SecondMIOffset /= MemSize;
				} else {
				SecondMIOffset *= MemSize;
				}
				}

				if (FirstMIBaseReg == SecondMIBaseReg) {
				if (FirstMIOffset + FirstMIOffsetStride == SecondMIOffset) {
				MIs.push_back(FirstMI);
				MIs.push_back(SecondMI);
				return true;
				} else if (FirstMIOffset == SecondMIOffset + FirstMIOffsetStride) {
				MIs.push_back(SecondMI);
				MIs.push_back(FirstMI);
				return true;
				}
				}

				return false;
				}

				// Build wide load/store instruction. MI has lower offset. MergeMI has higher
				// offset.
				void AArch64SSALoadStoreOpt::buildWideLdStInst(MachineInstr *MI,
				MachineInstr *MergeMI) {
				assert((MI->mayLoad() ? MergeMI->mayLoad()
				: (MI->mayStore() && MergeMI->mayStore())) &&
				"Unexpected input instructions");
				MachineBasicBlock *MBB = MI->getParent();
				MachineOperand &RegOp = MI->getOperand(0);
				unsigned Reg = RegOp.getReg();
				const MachineOperand &BaseRegOp = MI->getOperand(1);
				bool IsScaled = !TII->isUnscaledLdSt(MI->getOpcode());
				junbumlUnsubmitted Done Reply Inline Actions Looks like you can simplify this if statements. For example, "FirstMIBaseReg == SecondMIBaseReg" could be checked earlier. if (FirstMIOffset + FirstMIOffsetStride == SecondMIOffset) and if (FirstMIOffset < SecondMIOffset) is redundant. junbuml: Looks like you can simplify this if statements. For example, "FirstMIBaseReg ==…
				int OffsetImm = MI->getOperand(2).getImm();

				bool ShouldBeUnscaled = false;
				// When the original load/store is scaled(ldr/str),
				// offset should be unscaled if the offset is an odd value.
				// Otherwise, the offset shoud be a half of itself.
				if (IsScaled) {
				if (OffsetImm % 2) {
				OffsetImm *= getMemScale(MI);
				ShouldBeUnscaled = true;
				} else {
				junbumlUnsubmitted Done Reply Inline Actions It will be good to make sure that MI and MergeMI are instructions expected in this function. junbuml: It will be good to make sure that MI and MergeMI are instructions expected in this function.
				OffsetImm /= 2;
				}
				}
				junbumlUnsubmitted Done Reply Inline Actions unsigned Reg = RegOp.getReg(); junbuml: unsigned Reg = RegOp.getReg();

				// Select the opcode for wide load/store. Consider whether the opcode should
				// be scaled or unscaled.
				unsigned NewOpc = 0;
				switch (MI->getOpcode()) {
				default:
				llvm_unreachable("Unexpected MI's opcode.");
				break;
				case AArch64::LDRWui:
				NewOpc = ShouldBeUnscaled ? AArch64::LDURXi : AArch64::LDRXui;
				break;
				case AArch64::LDURWi:
				NewOpc = AArch64::LDURXi;
				break;
				case AArch64::STRWui:
				NewOpc = ShouldBeUnscaled ? AArch64::STURXi : AArch64::STRXui;
				break;
				case AArch64::STURWi:
				NewOpc = AArch64::STURXi;
				junbumlUnsubmitted Done Reply Inline Actions Cannot you do just : InsertionPoint = MI; junbuml: Cannot you do just : InsertionPoint = MI;
				break;
				}

				// Set the register operand (dst of load or src of store) to wide type.
				MRI->setRegClass(Reg, &AArch64::GPR64RegClass);

				// Generate wide memory operand.
				MachineMemOperand MMO = MI->memoperands_begin();
				junbumlUnsubmitted Done Reply Inline Actions llvm_unreachable("Unexpected MI's opcode."); junbuml: llvm_unreachable("Unexpected MI's opcode.");
				MachineMemOperand *NewMMO = new MachineMemOperand(
				MMO->getPointerInfo(), MMO->getFlags(), MMO->getSize() << 1,
				MMO->getBaseAlignment() << 1, MMO->getAAInfo(), MMO->getRanges());

				// Set the location for inserting new wide instruction.
				MachineBasicBlock::iterator InsertionPoint = MI;
				junbumlUnsubmitted Done Reply Inline Actions NewOpc = ShouldBeUnscaled ? AArch64::LDURXi : AArch64::LDRXui; junbuml: NewOpc = ShouldBeUnscaled ? AArch64::LDURXi : AArch64::LDRXui;

				// Build wide load/store instruction.
				MachineInstr *NewMI =
				BuildMI(*MBB, InsertionPoint, MI->getDebugLoc(), TII->get(NewOpc))
				.addOperand(RegOp)
				.addOperand(BaseRegOp)
				.addImm(OffsetImm)
				.setMemRefs(&NewMMO, &NewMMO);
				NewMI->addMemOperand(*MBB->getParent(), NewMMO);
				(void)NewMI;

				rengolinUnsubmitted Not Done Reply Inline Actions The names of the variables 'loads' and 'stores' is not very explanatory. Maybe call them deletedLoads/Stores? old? rengolin: The names of the variables 'loads' and 'stores' is not very explanatory. Maybe call them…
				JongwonLeeAuthorUnsubmitted Not Done Reply Inline Actions I agree with your opinion. Thanks. I think Loads/StoresWillBeDeleted is appropriate. JongwonLee: I agree with your opinion. Thanks. I think Loads/StoresWillBeDeleted is appropriate.
				DEBUG(dbgs() << "Creating the wide load/store instruction. Replacing "
				"instructions:\n ");
				DEBUG(MI->print(dbgs()));
				DEBUG(dbgs() << " ");
				DEBUG(MergeMI->print(dbgs()));
				DEBUG(dbgs() << " ");
				DEBUG(dbgs() << " with instruction:\n ");
				DEBUG((NewMI)->print(dbgs()));
				}

				bool AArch64SSALoadStoreOpt::tryToWidenLdStInst(
				MachineInstr MI, std::set<MachineInstr > &StoresWillBeDeleted,
				std::set<MachineInstr *> &LoadsWillBeDeleted) {
				MachineInstr *FirstMI = MI;

				// FirstMI(store) should have a register base operand and an immediate offset
				// operand.
				if (!getLdStBaseOp(FirstMI).isReg() \|\| !getLdStOffsetOp(FirstMI).isImm())
				return false;

				unsigned Reg = getLdStRegOp(FirstMI).getReg();

				// Check if there exists a def instruction for the register operand of
				// FirstMI.
				if (!MRI->hasOneDef(Reg))
				return false;

				MachineInstr &DefInst = *MRI->def_instr_begin(Reg);

				// DefInst should be a load instruction.
				if (DefInst.getOpcode() != AArch64::LDRWui &&
				DefInst.getOpcode() != AArch64::LDURWi)
				return false;

				// DefInst(load) should have a register base operand and an immediate offset
				// operand.
				if (!getLdStBaseOp(&DefInst).isReg() \|\| !getLdStOffsetOp(&DefInst).isImm())
				return false;
				mcrosierUnsubmitted Not Done Reply Inline Actions Why not SmallVector? mcrosier: Why not SmallVector?
				JongwonLeeAuthorUnsubmitted Not Done Reply Inline Actions Thanks. I will fix it. JongwonLee: Thanks. I will fix it.

				// DefInst should have a unique use instruction that should be FirstMI.
				if (!MRI->hasOneUse(Reg))
				return false;

				SmallVector<MachineInstr *, 2> ConsecutiveStores;
				junbumlUnsubmitted Not Done Reply Inline Actions Should it be assert () because you specifically handle AArch64::STURWi, AArch64::STRWui ? junbuml: Should it be assert () because you specifically handle AArch64::STURWi, AArch64::STRWui ?
				JongwonLeeAuthorUnsubmitted Not Done Reply Inline Actions Some test cases shows that this condition is not satisfied when opcode is AArch64::STURWi or AArch64::STRWui. JongwonLee: Some test cases shows that this condition is not satisfied when opcode is AArch64::STURWi or…
				SmallVector<MachineInstr *, 2> ConsecutiveLoads;

				// Find consecutive store for FirstMI.
				MachineBasicBlock::iterator E = MI->getParent()->end();
				MachineBasicBlock::iterator MBBI = MI;
				++MBBI;
				for (; MBBI != E; ++MBBI) {
				MachineInstr *MI = MBBI;

				if (MI->getOpcode() != AArch64::STURWi &&
				MI->getOpcode() != AArch64::STRWui)
				continue;

				// MI(store) should have a register base operand and an immediate offset
				// operand.
				if (!getLdStBaseOp(MI).isReg() \|\| !getLdStOffsetOp(MI).isImm())
				return false;

				// Check if FirstMI(store) and MI(store) are consecutive.
				junbumlUnsubmitted Not Done Reply Inline Actions Same here. I think it should be assert () because you specifically handle AArch64::LDRWui, AArch64::LDURWi ? junbuml: Same here. I think it should be assert () because you specifically handle AArch64::LDRWui…
				JongwonLeeAuthorUnsubmitted Not Done Reply Inline Actions Some test cases shows that this condition is not satisfied when opcode is AArch64::LDRWui or AArch64::LDURWui. JongwonLee: Some test cases shows that this condition is not satisfied when opcode is AArch64::LDRWui or…
				if (!isConsecutive(FirstMI, MI, ConsecutiveStores))
				continue;

				// If the first store's offset is out of range, give up widening.
				if (!checkOffsetRange(ConsecutiveStores[0]))
				return false;

				unsigned MIReg = getLdStRegOp(MI).getReg();
				junbumlUnsubmitted Done Reply Inline Actions ConsecutiveStores and ConsecutiveLoads will be better. junbuml: ConsecutiveStores and ConsecutiveLoads will be better.

				// Check if there exists a def instruction for the register operand of MI.
				junbumlUnsubmitted Done Reply Inline Actions You may want to do MBBI++ here, instead of MBBI++in line 326. Or you can use do/while loop. junbuml: You may want to do MBBI++ here, instead of MBBI++in line 326. Or you can use do/while loop.
				if (!MRI->hasOneDef(MIReg)) {
				ConsecutiveStores.clear();
				continue;
				}

				MachineInstr &MIDefInst = *MRI->def_instr_begin(MIReg);

				// MIDefInst should have a unique use instruction that should be MI.
				if (!MRI->hasOneUse(MIReg))
				return false;
				junbumlUnsubmitted Not Done Reply Inline Actions Maybe assert() here as you specifically detect STURWi and STRWui. junbuml: Maybe assert() here as you specifically detect STURWi and STRWui.
				JongwonLeeAuthorUnsubmitted Not Done Reply Inline Actions Some test cases shows that this condition is not satisfied when opcode is AArch64::STURWi or AArch64::STRWui. JongwonLee: Some test cases shows that this condition is not satisfied when opcode is AArch64::STURWi or…

				// MIDefInst should be a load instruction.
				if (MIDefInst.getOpcode() != AArch64::LDRWui &&
				MIDefInst.getOpcode() != AArch64::LDURWi) {
				ConsecutiveStores.clear();
				continue;
				}

				// MIDefInst(load) should have a register base operand and an immediate
				// offset operand.
				if (!getLdStBaseOp(&MIDefInst).isReg() \|\|
				mcrosierUnsubmitted Not Done Reply Inline Actions Has this been clang-formatted? mcrosier: Has this been clang-formatted?
				JongwonLeeAuthorUnsubmitted Not Done Reply Inline Actions Is this format is wrong? What is required in the clang-format? JongwonLee: Is this format is wrong? What is required in the clang-format?
				!getLdStOffsetOp(&MIDefInst).isImm())
				rengolinUnsubmitted Not Done Reply Inline Actions clang-format is a tool that you pass on the source that formats it as a given standard. The default standard is LLVM's, so that new code always adhere to the standard without much effort. rengolin: clang-format is a tool that you pass on the source that formats it as a given standard. The…
				return false;

				// Check if DefInst(load) and MIDefInst(load) are consecutive.
				if (isConsecutive(&DefInst, &MIDefInst, ConsecutiveLoads)) {
				// If the first load's offset is out of range, give up widening.
				mcrosierUnsubmitted Done Reply Inline Actions No need for extra braces. mcrosier: No need for extra braces.
				if (!checkOffsetRange(ConsecutiveLoads[0]))
				return false;

				// If consecutive loads/stores are found, the followings are satisfied.
				mcrosierUnsubmitted Done Reply Inline Actions No need for extra braces. mcrosier: No need for extra braces.
				// dst reg of first load = src reg of first store
				// dst reg of second load = src reg of second store
				if (getLdStRegOp(ConsecutiveLoads[0]).getReg() !=
				getLdStRegOp(ConsecutiveStores[0]).getReg() \|\|
				getLdStRegOp(ConsecutiveLoads[1]).getReg() !=
				getLdStRegOp(ConsecutiveStores[1]).getReg())
				return false;

				// Check if there is any instruction between two consecutive loads,
				// which may alias the loads. If found, give up widening.
				if (hasAliasBetween(ConsecutiveLoads[0], ConsecutiveLoads[1]))
				return false;

				// Check if there is any instruction between two consecutive stores,
				// which may alias the stores. If found, give up widening.
				if (hasAliasBetween(ConsecutiveStores[0], ConsecutiveStores[1]))
				return false;
				rengolinUnsubmitted Not Done Reply Inline Actions This comment is in the wrong place. It should be at a higher level, where people can actually read it before going through the code. rengolin: This comment is in the wrong place. It should be at a higher level, where people can actually…
				JongwonLeeAuthorUnsubmitted Not Done Reply Inline Actions I referenced this format from AArch64LoadStoreOptimizer.cpp. I think there will be more chances to optimize at this level. So, I think that the explanation is necessary for each case even though this patch starts with just one optimization case. Maybe we will have .. 2) description for second optimization ... 3) description for third optimization ... JongwonLee: I referenced this format from AArch64LoadStoreOptimizer.cpp. I think there will be more chances…
				rengolinUnsubmitted Not Done Reply Inline Actions Good point. rengolin: Good point.

				for (auto &I : ConsecutiveStores)
				StoresWillBeDeleted.insert(I);

				for (auto &I : ConsecutiveLoads)
				LoadsWillBeDeleted.insert(I);

				// Build wide load/store instruction from consecutive loads/stores.
				buildWideLdStInst(ConsecutiveLoads[0], ConsecutiveLoads[1]);
				buildWideLdStInst(ConsecutiveStores[0], ConsecutiveStores[1]);
				return true;
				}

				junbumlUnsubmitted Not Done Reply Inline Actions Do we need to check this again? Or maybe assert() ? junbuml: Do we need to check this again? Or maybe assert() ?
				JongwonLeeAuthorUnsubmitted Not Done Reply Inline Actions We need to check this. For example, reg2 = load [mem1] reg1 = load [mem1 +4] store reg1, [mem2] store reg2, [mem2 +4] If we don't check, we will get the following code. reg2 = wide-load [mem1] wide-store reg1, [mem2] Resultant code gets to use reg1 without definition. JongwonLee: We need to check this. For example, ``` reg2 = load [mem1] reg1 = load [mem1 +4] store reg1…
				ConsecutiveStores.clear();
				junbumlUnsubmitted Not Done Reply Inline Actions I see your point. In that case why don't you change like : reg2 = wide-load [mem1] wide-store reg2, [mem2] junbuml: I see your point. In that case why don't you change like : reg2 = wide-load [mem1] wide…
				JongwonLeeAuthorUnsubmitted Not Done Reply Inline Actions reg2 = load [mem1] reg1 = load [mem1 +4] The above two loads can be the below one load. reg2 = wide-laod [mem1] store reg1, [mem2] store reg2, [mem2 +4] But, the above two stores cannot be the below one store. wide-store reg2, [mem2] The correct form of wide-store should have reg1 as its source operand. wide-store reg1, [mem2] JongwonLee: ``` reg2 = load [mem1] reg1 = load [mem1 +4] ``` The above two loads can be the below one load.
				}
				return false;
				}
				junbumlUnsubmitted Not Done Reply Inline Actions Don't you also need to check if there is any instruction in between the first and second load, which is aliased with the second loads? The same check should be performed for stores. junbuml: Don't you also need to check if there is any instruction in between the first and second load…
				JongwonLeeAuthorUnsubmitted Not Done Reply Inline Actions I think the alias check is needed for (first store, second load) and (second store, first load). I think we don't need to check any other instructions except 4 instructions (2 loads, 2 stores) because the first store is the only use instruction of the first load and the second store is also the only use instruction of the second load. (1) reg1 = load [mem1] (2) reg2 = load [mem1 +4] (3) store reg1, [mem2] (4) store reg2, [mem2 + 4] (3) is the only use instruction of reg1 (4) is the only use instruction of reg2 JongwonLee: I think the alias check is needed for (first store, second load) and (second store, first load).
				junbumlUnsubmitted Not Done Reply Inline Actions I don't think you need to have these alias checks specifically because you check alias between two loads and two stores below. junbuml: I don't think you need to have these alias checks specifically because you check alias between…

				junbumlUnsubmitted Not Done Reply Inline Actions In order to merge the second load / store to the first load / store, we need to make sure that there is no memory operations aliasing with the second one. For example, in the code below, there is a store between the loads. In this case we cannot move the second load to the first without knowing that mem1 and mem3 is not pointing the same address. reg1 = load [mem1] store reg3, [mem3 + 4] reg2 = load [mem1 +4] store reg1, [mem2] store reg2, [mem2 + 4] junbuml: In order to merge the second load / store to the first load / store, we need to make sure that…
				JongwonLeeAuthorUnsubmitted Not Done Reply Inline Actions Thanks. You are right. I fixed the code to consider all the aliases between two loads and between two stores. JongwonLee: Thanks. You are right. I fixed the code to consider all the aliases between two loads and…
				bool AArch64SSALoadStoreOpt::optimizeBlock(MachineBasicBlock &MBB) {
				bool Modified = false;
				// 1) Find consecutive two 32-bit loads and consecutive two 32-bit stores that
				// write the values of the consecutive 32-bit loads. Transform the
				// loads/stores to 64-bit load/store.
				//
				// When the wide load/store is unscaled(ldur/stur), offset needs not to be
				// changed.
				// e.g.,
				// %vreg2 = LDURWi %vreg0, -76;
				// %vreg3 = LDURWi %vreg0, -72;
				// STURWi %vreg2, %vreg1, -44;
				// STURWi %vreg3, %vreg1, -40;
				// ; becomes
				// %vreg2 = LDURXi %vreg0, -76;
				// STURXi %vreg2, %vreg1, -44;
				//
				// When the wide load/store is scaled(ldr/str), offset should be a half of
				// the original value.
				// e.g.,
				// %vreg2 = LDRWui %vreg0, 4;
				mcrosierUnsubmitted Not Done Reply Inline Actions loads -> Loads mcrosier: loads -> Loads
				JongwonLeeAuthorUnsubmitted Not Done Reply Inline Actions It will be changed to LoadsWillBeDeleted. JongwonLee: It will be changed to LoadsWillBeDeleted.
				mcrosierUnsubmitted Not Done Reply Inline Actions http://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators-properly In general, names should be in camel case (e.g. TextFileReader and isLValue()). mcrosier: http://llvm.org/docs/CodingStandards.html#name-types-functions-variables-and-enumerators…
				// %vreg3 = LDRWui %vreg0, 5;
				mcrosierUnsubmitted Not Done Reply Inline Actions stores -> Stores mcrosier: stores -> Stores
				JongwonLeeAuthorUnsubmitted Not Done Reply Inline Actions It will be changed to StoresWillBeDeleted. JongwonLee: It will be changed to StoresWillBeDeleted.
				// STRWui %vreg2, %vreg1, 2;
				mcrosierUnsubmitted Not Done Reply Inline Actions range-based loop? mcrosier: range-based loop?
				JongwonLeeAuthorUnsubmitted Not Done Reply Inline Actions The loop iterates from the first instruction to the end instruction in a basic block. Is this the answer you expect? I don't know the exact meaning of the range-based loop. JongwonLee: The loop iterates from the first instruction to the end instruction in a basic block. Is this…
				// STRWui %vreg3, %vreg1, 3;
				// ; becomes
				// %vreg2 = LDRXui %vreg0, 2;
				// STRXui %vreg2, %vreg1, 1;
				//
				// When the original load/store is scaled(ldr/str) and it has an odd offset
				// value, it can be widened if
				junbumlUnsubmitted Done Reply Inline Actions Please change : // loads/stores // to 64-bit load/store. into // loads/stores to 64-bit load/store. junbuml: Please change : // loads/stores // to 64-bit load/store. into // loads/stores…
				// (MIN_UNSCALED_OFFSET <= unscaled offset value <= MAX_UNSCALED_OFFSET)
				// is satisfied.
				// cf.) unscaled offset value = scaled offset value * memory scale size
				// e.g.,
				mcrosierUnsubmitted Done Reply Inline Actions No need for extra braces. mcrosier: No need for extra braces.
				// %vreg2 = LDRWui %vreg0, 13;
				// %vreg3 = LDRWui %vreg0, 14;
				// STRWui %vreg2, %vreg1, 37;
				// STRWui %vreg3, %vreg1, 38;
				// ; becomes
				// %vreg2 = LDURXi %vreg0, 52; 52 = 13 * 4
				// STURXi %vreg2, %vreg1, 148; 148 = 37 * 4
				mcrosierUnsubmitted Done Reply Inline Actions No need for extra braces. mcrosier: No need for extra braces.
				std::set<MachineInstr *> LoadsWillBeDeleted;
				std::set<MachineInstr *> StoresWillBeDeleted;
				for (auto &MBBI : MBB) {
				MachineInstr *MI = &MBBI;
				mcrosierUnsubmitted Done Reply Inline Actions No need for extra braces. mcrosier: No need for extra braces.

				if (!TII->isCandidateToMergeOrPair(MI))
				continue;

				switch (MI->getOpcode()) {
				default:
				break;
				case AArch64::STURWi:
				case AArch64::STRWui:
				if (StoresWillBeDeleted.find(MI) != StoresWillBeDeleted.end())
				continue;
				junbumlUnsubmitted Done Reply Inline Actions Please fully use 80 column line. junbuml: Please fully use 80 column line.

				if (tryToWidenLdStInst(MI, StoresWillBeDeleted, LoadsWillBeDeleted))
				Modified = true;

				break;
				}
				}

				for (auto &I : LoadsWillBeDeleted)
				I->eraseFromParent();

				for (auto &I : StoresWillBeDeleted)
				I->eraseFromParent();
				junbumlUnsubmitted Done Reply Inline Actions Can you change it to range-based loop ? junbuml: Can you change it to range-based loop ?

				return Modified;
				}
				junbumlUnsubmitted Done Reply Inline Actions It seems that you don't handle volatile. You could use isCandidateToMergeOrPair(). junbuml: It seems that you don't handle volatile. You could use isCandidateToMergeOrPair().

				bool AArch64SSALoadStoreOpt::runOnMachineFunction(MachineFunction &MF) {
				Subtarget = &static_cast<const AArch64Subtarget &>(MF.getSubtarget());
				TII = static_cast<const AArch64InstrInfo *>(Subtarget->getInstrInfo());
				MRI = &MF.getRegInfo();
				AA = &getAnalysis<AAResultsWrapperPass>().getAAResults();

				bool Changed = false;
				for (MachineBasicBlock &MBB : MF)
				Changed \|= optimizeBlock(MBB);
				return Changed;
				}

				FunctionPass *llvm::createAArch64SSALoadStoreOptPass() {
				return new AArch64SSALoadStoreOpt();
				}
				junbumlUnsubmitted Done Reply Inline Actions Please remove TRI if not used here. junbuml: Please remove TRI if not used here.

lib/Target/AArch64/AArch64TargetMachine.cpp

	Show First 20 Lines • Show All 101 Lines • ▼ Show 20 Lines
	EnableGlobalMerge("aarch64-global-merge", cl::Hidden,			EnableGlobalMerge("aarch64-global-merge", cl::Hidden,
	cl::desc("Enable the global merge pass"));			cl::desc("Enable the global merge pass"));

	static cl::opt<bool>			static cl::opt<bool>
	EnableLoopDataPrefetch("aarch64-loop-data-prefetch", cl::Hidden,			EnableLoopDataPrefetch("aarch64-loop-data-prefetch", cl::Hidden,
	cl::desc("Enable the loop data prefetch pass"),			cl::desc("Enable the loop data prefetch pass"),
	cl::init(true));			cl::init(true));

				static cl::opt<bool>
				EnableSSALoadStoreOpt("aarch64-ssa-load-store-opt",
				cl::desc("Enable the load/store pair"
				" optimization pass in SSA form"),
				cl::init(false), cl::Hidden);
				rengolinUnsubmitted Not Done Reply Inline Actions Maybe not leave it on by default as a first approach? rengolin: Maybe not leave it on by default as a first approach?
				JongwonLeeAuthorUnsubmitted Not Done Reply Inline Actions I fixed it to be off by default. JongwonLee: I fixed it to be off by default.

	extern "C" void LLVMInitializeAArch64Target() {			extern "C" void LLVMInitializeAArch64Target() {
	// Register the target.			// Register the target.
	RegisterTargetMachine<AArch64leTargetMachine> X(TheAArch64leTarget);			RegisterTargetMachine<AArch64leTargetMachine> X(TheAArch64leTarget);
	RegisterTargetMachine<AArch64beTargetMachine> Y(TheAArch64beTarget);			RegisterTargetMachine<AArch64beTargetMachine> Y(TheAArch64beTarget);
	RegisterTargetMachine<AArch64leTargetMachine> Z(TheARM64Target);			RegisterTargetMachine<AArch64leTargetMachine> Z(TheARM64Target);
	auto PR = PassRegistry::getPassRegistry();			auto PR = PassRegistry::getPassRegistry();
	initializeGlobalISel(*PR);			initializeGlobalISel(*PR);
	initializeAArch64ExpandPseudoPass(*PR);			initializeAArch64ExpandPseudoPass(*PR);
	▲ Show 20 Lines • Show All 223 Lines • ▼ Show 20 Lines
	}			}
	bool AArch64PassConfig::addRegBankSelect() {			bool AArch64PassConfig::addRegBankSelect() {
	addPass(new RegBankSelect());			addPass(new RegBankSelect());
	return false;			return false;
	}			}
	#endif			#endif

	bool AArch64PassConfig::addILPOpts() {			bool AArch64PassConfig::addILPOpts() {
				if (EnableSSALoadStoreOpt)
				addPass(createAArch64SSALoadStoreOptPass());
	if (EnableCondOpt)			if (EnableCondOpt)
	addPass(createAArch64ConditionOptimizerPass());			addPass(createAArch64ConditionOptimizerPass());
	if (EnableCCMP)			if (EnableCCMP)
	addPass(createAArch64ConditionalCompares());			addPass(createAArch64ConditionalCompares());
	if (EnableMCR)			if (EnableMCR)
	addPass(&MachineCombinerID);			addPass(&MachineCombinerID);
	if (EnableEarlyIfConversion)			if (EnableEarlyIfConversion)
	addPass(&EarlyIfConverterID);			addPass(&EarlyIfConverterID);
	▲ Show 20 Lines • Show All 46 Lines • Show Last 20 Lines

lib/Target/AArch64/CMakeLists.txt

Show All 39 Lines	add_llvm_target(AArch64CodeGen
AArch64ConditionalCompares.cpp		AArch64ConditionalCompares.cpp
AArch64DeadRegisterDefinitionsPass.cpp		AArch64DeadRegisterDefinitionsPass.cpp
AArch64ExpandPseudoInsts.cpp		AArch64ExpandPseudoInsts.cpp
AArch64FastISel.cpp		AArch64FastISel.cpp
AArch64A53Fix835769.cpp		AArch64A53Fix835769.cpp
AArch64FrameLowering.cpp		AArch64FrameLowering.cpp
AArch64ConditionOptimizer.cpp		AArch64ConditionOptimizer.cpp
AArch64RedundantCopyElimination.cpp		AArch64RedundantCopyElimination.cpp
		AArch64SSALoadStoreOptimizer.cpp
AArch64ISelDAGToDAG.cpp		AArch64ISelDAGToDAG.cpp
AArch64ISelLowering.cpp		AArch64ISelLowering.cpp
AArch64InstrInfo.cpp		AArch64InstrInfo.cpp
AArch64LoadStoreOptimizer.cpp		AArch64LoadStoreOptimizer.cpp
AArch64MCInstLower.cpp		AArch64MCInstLower.cpp
AArch64PromoteConstant.cpp		AArch64PromoteConstant.cpp
AArch64PBQPRegAlloc.cpp		AArch64PBQPRegAlloc.cpp
AArch64RegisterInfo.cpp		AArch64RegisterInfo.cpp
Show All 17 Lines

test/CodeGen/AArch64/ldst-opt.ll

	Show First 20 Lines • Show All 1,116 Lines • ▼ Show 20 Lines
	; CHECK-LABEL: post-indexed-sub-word			; CHECK-LABEL: post-indexed-sub-word
	; CHECK: ldr w{{[0-9]+}}, [x{{[0-9]+}}], #-8			; CHECK: ldr w{{[0-9]+}}, [x{{[0-9]+}}], #-8
	; CHECK: str w{{[0-9]+}}, [x{{[0-9]+}}], #-8			; CHECK: str w{{[0-9]+}}, [x{{[0-9]+}}], #-8
	br label %for.body			br label %for.body
	for.body:			for.body:
	%phi1 = phi i32* [ %gep4, %for.body ], [ %b, %0 ]			%phi1 = phi i32* [ %gep4, %for.body ], [ %b, %0 ]
	%phi2 = phi i32* [ %gep3, %for.body ], [ %a, %0 ]			%phi2 = phi i32* [ %gep3, %for.body ], [ %a, %0 ]
	%i = phi i64 [ %dec.i, %for.body], [ %count, %0 ]			%i = phi i64 [ %dec.i, %for.body], [ %count, %0 ]
	%gep1 = getelementptr i32, i32* %phi1, i64 -1			%gep1 = getelementptr i32, i32* %phi1, i64 -3
				rengolinUnsubmitted Not Done Reply Inline Actions Can you explain this change? Looks suspicious... rengolin: Can you explain this change? Looks suspicious...
				JongwonLeeAuthorUnsubmitted Not Done Reply Inline Actions This change is to prevent the optimization of this patch from being applied. This test is for another optimization. JongwonLee: This change is to prevent the optimization of this patch from being applied. This test is for…
				rengolinUnsubmitted Not Done Reply Inline Actions Ah! Makes sense. rengolin: Ah! Makes sense.
	%load1 = load i32, i32* %gep1			%load1 = load i32, i32* %gep1
	%gep2 = getelementptr i32, i32* %phi2, i64 -1			%gep2 = getelementptr i32, i32* %phi2, i64 -3
	store i32 %load1, i32* %gep2			store i32 %load1, i32* %gep2
	%load2 = load i32, i32* %phi1			%load2 = load i32, i32* %phi1
	store i32 %load2, i32* %phi2			store i32 %load2, i32* %phi2
	%dec.i = add nsw i64 %i, -1			%dec.i = add nsw i64 %i, -1
	%gep3 = getelementptr i32, i32* %phi2, i64 -2			%gep3 = getelementptr i32, i32* %phi2, i64 -2
	%gep4 = getelementptr i32, i32* %phi1, i64 -2			%gep4 = getelementptr i32, i32* %phi1, i64 -2
	%cond = icmp sgt i64 %dec.i, 0			%cond = icmp sgt i64 %dec.i, 0
	br i1 %cond, label %for.body, label %end			br i1 %cond, label %for.body, label %end
	▲ Show 20 Lines • Show All 99 Lines • Show Last 20 Lines

test/CodeGen/AArch64/ssa-ldst-opt.ll

This file was added.

				; RUN: llc -march=aarch64 -verify-machineinstrs -asm-verbose=false -aarch64-ssa-load-store-opt -o - %s \| FileCheck %s

				rengolinUnsubmitted Not Done Reply Inline Actions It's good to add the 'aarch64-ssa-load-store-opt' flag, even if it ends up enabled by default for two reasons: It'll document that this is what the test is about, so people can easily find in the code what the change was. If it ever ends up disabled by default, the tests will not break. rengolin: It's good to add the 'aarch64-ssa-load-store-opt' flag, even if it ends up enabled by default…
				JongwonLeeAuthorUnsubmitted Not Done Reply Inline Actions 'aarch64-ssa-load-store-opt' is disabled by default. The explanation about this test is described. JongwonLee: 'aarch64-ssa-load-store-opt' is disabled by default. The explanation about this test is…
				; This test is for 'AArch64 load/store optimization in SSA form'.
				rengolinUnsubmitted Not Done Reply Inline Actions Thanks! rengolin: Thanks!
				; Find consecutive two 32-bit loads and consecutive two 32-bit stores that write the values of the consecutive 32-bit loads.
				; Transform the loads/stores to 64-bit load/store.

				; CHECK-LABEL: test_offset_no_changed:
				; CHECK: ldur x8, [x0, #-76]
				; CHECK-NEXT: stur x8, [x1, #-44]
				; CHECK-NEXT: ret
				define void @test_offset_no_changed(i32* %p1, i32* %p2) #0 {
				%ld.ptr1 = getelementptr i32, i32* %p1, i64 -19
				%1 = load i32, i32* %ld.ptr1, align 4, !tbaa !8
				%st.ptr1 = getelementptr i32, i32* %p2, i64 -11
				junbumlUnsubmitted Not Done Reply Inline Actions Merging the second load to the first load seems to be wrong without alias check between the second load and the first store. junbuml: Merging the second load to the first load seems to be wrong without alias check between the…
				JongwonLeeAuthorUnsubmitted Not Done Reply Inline Actions Thanks. I missed this case. In the test code, metadata for tbaa is added. In the source code,the routine for checking alias is added. JongwonLee: Thanks. I missed this case. In the test code, metadata for tbaa is added. In the source code…
				store i32 %1, i32* %st.ptr1, align 4, !tbaa !5
				%ld.ptr2 = getelementptr i32, i32* %p1, i64 -18
				%2 = load i32, i32* %ld.ptr2, align 4, !tbaa !9
				%st.ptr2 = getelementptr i32, i32* %p2, i64 -10
				store i32 %2, i32* %st.ptr2, align 4, !tbaa !6
				ret void
				}

				; CHECK-LABEL: test_offset_halved:
				; CHECK: ldr x8, [x0, #16]
				; CHECK-NEXT: str x8, [x1, #8]
				; CHECK-NEXT: ret
				define void @test_offset_halved(i32* %p1, i32* %p2) #0 {
				%ld.ptr1 = getelementptr i32, i32* %p1, i64 4
				%1 = load i32, i32* %ld.ptr1, align 4, !tbaa !8
				%st.ptr1 = getelementptr i32, i32* %p2, i64 2
				store i32 %1, i32* %st.ptr1, align 4, !tbaa !5
				%ld.ptr2 = getelementptr i32, i32* %p1, i64 5
				%2 = load i32, i32* %ld.ptr2, align 4, !tbaa !9
				%st.ptr2 = getelementptr i32, i32* %p2, i64 3
				store i32 %2, i32* %st.ptr2, align 4, !tbaa !6
				ret void
				}

				; CHECK-LABEL: test_offset_unscaled:
				; CHECK: ldur x8, [x0, #52]
				; CHECK-NEXT: stur x8, [x1, #148]
				; CHECK-NEXT: ret
				define void @test_offset_unscaled(i32* %p1, i32* %p2) #0 {
				%ld.ptr1 = getelementptr i32, i32* %p1, i64 13
				%1 = load i32, i32* %ld.ptr1, align 4, !tbaa !8
				%st.ptr1 = getelementptr i32, i32* %p2, i64 37
				store i32 %1, i32* %st.ptr1, align 4, !tbaa !5
				%ld.ptr2 = getelementptr i32, i32* %p1, i64 14
				%2 = load i32, i32* %ld.ptr2, align 4, !tbaa !9
				%st.ptr2 = getelementptr i32, i32* %p2, i64 38
				store i32 %2, i32* %st.ptr2, align 4, !tbaa !6
				ret void
				}

				; CHECK-LABEL: test_scaled_offset_range1:
				; CHECK: ldur x8, [x0, #252]
				; CHECK-NEXT: stur x8, [x1, #252]
				; CHECK-NEXT: ret
				define void @test_scaled_offset_range1(i32* %p1, i32* %p2) #0 {
				%ld.ptr1 = getelementptr i32, i32* %p1, i64 63
				%1 = load i32, i32* %ld.ptr1, align 4, !tbaa !9
				%st.ptr1 = getelementptr i32, i32* %p2, i64 63
				store i32 %1, i32* %st.ptr1, align 4, !tbaa !5
				%ld.ptr2 = getelementptr i32, i32* %p1, i64 64
				%2 = load i32, i32* %ld.ptr2, align 4, !tbaa !8
				%st.ptr2 = getelementptr i32, i32* %p2, i64 64
				store i32 %2, i32* %st.ptr2, align 4, !tbaa !6
				ret void
				}

				; In the following test case, 'aarch64-ssa-load-store-opt' is applied.
				; allowed offset range: 0 <= offset <= 16380 if (offset % 2 == 0)

				; CHECK-LABEL: test_scaled_offset_range2:
				; CHECK: ldr x8, [x0]
				; CHECK-NEXT: str x8, [x1, #16376]
				; CHECK-NEXT: ret
				define void @test_scaled_offset_range2(i32* %p1, i32* %p2) #0 {
				%ld.ptr1 = getelementptr i32, i32* %p1, i64 0
				%1 = load i32, i32* %ld.ptr1, align 4, !tbaa !9
				%st.ptr1 = getelementptr i32, i32* %p2, i64 4094
				store i32 %1, i32* %st.ptr1, align 4, !tbaa !5
				%ld.ptr2 = getelementptr i32, i32* %p1, i64 1
				%2 = load i32, i32* %ld.ptr2, align 4, !tbaa !8
				%st.ptr2 = getelementptr i32, i32* %p2, i64 4095
				store i32 %2, i32* %st.ptr2, align 4, !tbaa !6
				ret void
				}

				; In the following test case, 'aarch64-ssa-load-store-opt' is not applied
				; since the first load offset is out of range.
				; allowed offset range: -256 <= offset < 256 if (offset % 2 == 1)

				; CHECK-LABEL: test_scaled_offset_range3:
				; CHECK: ldr w8, [x0, #260]
				; CHECK-NEXT: str w8, [x1, #252]
				; CHECK-NEXT: ldr w8, [x0, #264]
				; CHECK-NEXT: str w8, [x1, #256]
				; CHECK-NEXT: ret
				define void @test_scaled_offset_range3(i32* %p1, i32* %p2) #0 {
				%ld.ptr1 = getelementptr i32, i32* %p1, i64 65
				%1 = load i32, i32* %ld.ptr1, align 4, !tbaa !9
				%st.ptr1 = getelementptr i32, i32* %p2, i64 63
				store i32 %1, i32* %st.ptr1, align 4, !tbaa !5
				%ld.ptr2 = getelementptr i32, i32* %p1, i64 66
				%2 = load i32, i32* %ld.ptr2, align 4, !tbaa !8
				%st.ptr2 = getelementptr i32, i32* %p2, i64 64
				store i32 %2, i32* %st.ptr2, align 4, !tbaa !6
				ret void
				}

				; In the follwoing three test cases, 'aarch64-ssa-load-store-opt' is not applied
				; since disjoint memory accesses cannot be guaranteed without alias information.

				; CHECK-LABEL: test_offset_no_changed_no_tbaa_info:
				; CHECK: ldur w8, [x0, #-76]
				; CHECK-NEXT: stur w8, [x1, #-44]
				; CHECK-NEXT: ldur w8, [x0, #-72]
				; CHECK-NEXT: stur w8, [x1, #-40]
				; CHECK-NEXT: ret
				define void @test_offset_no_changed_no_tbaa_info(i32* %p1, i32* %p2) #0 {
				%ld.ptr1 = getelementptr i32, i32* %p1, i64 -19
				%1 = load i32, i32* %ld.ptr1, align 4
				%st.ptr1 = getelementptr i32, i32* %p2, i64 -11
				store i32 %1, i32* %st.ptr1, align 4
				%ld.ptr2 = getelementptr i32, i32* %p1, i64 -18
				%2 = load i32, i32* %ld.ptr2, align 4
				%st.ptr2 = getelementptr i32, i32* %p2, i64 -10
				store i32 %2, i32* %st.ptr2, align 4
				ret void
				}

				; CHECK-LABEL: test_offset_halved_no_tbaa_info:
				; CHECK: ldr w8, [x0, #16]
				; CHECK-NEXT: str w8, [x1, #8]
				; CHECK-NEXT: ldr w8, [x0, #20]
				; CHECK-NEXT: str w8, [x1, #12]
				; CHECK-NEXT: ret
				define void @test_offset_halved_no_tbaa_info(i32* %p1, i32* %p2) #0 {
				%ld.ptr1 = getelementptr i32, i32* %p1, i64 4
				%1 = load i32, i32* %ld.ptr1, align 4
				%st.ptr1 = getelementptr i32, i32* %p2, i64 2
				store i32 %1, i32* %st.ptr1, align 4
				%ld.ptr2 = getelementptr i32, i32* %p1, i64 5
				%2 = load i32, i32* %ld.ptr2, align 4
				%st.ptr2 = getelementptr i32, i32* %p2, i64 3
				store i32 %2, i32* %st.ptr2, align 4
				ret void
				}

				; CHECK-LABEL: test_offset_unscaled_no_tbaa_info:
				; CHECK: ldr w8, [x0, #52]
				; CHECK-NEXT: str w8, [x1, #148]
				; CHECK-NEXT: ldr w8, [x0, #56]
				; CHECK-NEXT: str w8, [x1, #152]
				; CHECK-NEXT: ret
				define void @test_offset_unscaled_no_tbaa_info(i32* %p1, i32* %p2) #0 {
				%ld.ptr1 = getelementptr i32, i32* %p1, i64 13
				%1 = load i32, i32* %ld.ptr1, align 4
				%st.ptr1 = getelementptr i32, i32* %p2, i64 37
				store i32 %1, i32* %st.ptr1, align 4
				%ld.ptr2 = getelementptr i32, i32* %p1, i64 14
				%2 = load i32, i32* %ld.ptr2, align 4
				%st.ptr2 = getelementptr i32, i32* %p2, i64 38
				store i32 %2, i32* %st.ptr2, align 4
				ret void
				}

				attributes #0 = { nounwind "less-precise-fpmad"="false" "no-frame-pointer-elim"="true" "no-frame-pointer-elim-non-leaf" "no-infs-fp-math"="true" "no-nans-fp-math"="true" "stack-protector-buffer-size"="8" "unsafe-fp-math"="true" "use-soft-float"="false" }

				!llvm.ident = !{!0}

				!0 = !{!"clang version 3.9.0 "}
				!1 = !{!"int", !2, i64 0}
				!2 = !{!"omnipotent char", !3, i64 0}
				!3 = !{!"Simple C++ TBAA"}
				!4 = !{!"structA", !1, i64 0, !1, i64 4}
				!5 = !{!4, !1, i64 0}
				!6 = !{!4, !1, i64 4}
				!7 = !{!"structB", !1, i64 0, !1, i64 4}
				!8 = !{!7, !1, i64 0}
				!9 = !{!7, !1, i64 4}

This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] add SSA Load Store optimization passAbandonedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 55179

lib/Target/AArch64/AArch64.h

lib/Target/AArch64/AArch64SSALoadStoreOptimizer.cpp

lib/Target/AArch64/AArch64TargetMachine.cpp

lib/Target/AArch64/CMakeLists.txt

test/CodeGen/AArch64/ldst-opt.ll

test/CodeGen/AArch64/ssa-ldst-opt.ll

[AArch64] add SSA Load Store optimization pass
AbandonedPublic