Page MenuHomePhabricator

[AArch64] add SSA Load Store optimization pass
AbandonedPublic

Authored by JongwonLee on Apr 8 2016, 2:34 AM.

Details

Reviewers
jmolloy
junbuml
Summary

Find consecutive two 32-bit loads and consecutive two 32-bit stores that write the values of the consecutive 32-bit loads. Transform the loads/stores to 64-bit load/store.

When the wide load/store is unscaled(ldur/stur), offset needs not to be changed.
 e.g.,
   %vreg2 = LDURWi %vreg0, -76;
   %vreg3 = LDURWi %vreg0, -72;
   STURWi %vreg2, %vreg1, -44;
   STURWi %vreg3, %vreg1, -40;
   ; becomes
   %vreg2 = LDURXi %vreg0, -76;
   STURXi %vreg2, %vreg1, -44;

When the wide load/store is scaled(ldr/str), offset should be a half of the original value.
 e.g.,
   %vreg2 = LDRWui %vreg0, 4;
   %vreg3 = LDRWui %vreg0, 5;
   STRWui %vreg2, %vreg1, 2;
   STRWui %vreg3, %vreg1, 3;
   ; becomes
   %vreg2 = LDRXui %vreg0, 2;
   STRXui %vreg2, %vreg1, 1;

When the original load/store is scaled(ldr/str) and it has an odd offset value, it can be widened if (-256 <= unscaled offset value < 256) is satisfied.
cf.) unscaled offset value = scaled offset value * memory scale size
 e.g.,
   %vreg2 = LDRWui %vreg0, 13;
   %vreg3 = LDRWui %vreg0, 14;
   STRWui %vreg2, %vreg1, 37;
   STRWui %vreg3, %vreg1, 38;
   ; becomes
   %vreg2 = LDURXi %vreg0, 52; 52 = 13 * 4
   STURXi %vreg2, %vreg1, 148; 148 = 37 * 4

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
JongwonLee added inline comments.Apr 11 2016, 3:40 AM
lib/Target/AArch64/AArch64SSALoadStoreOptimizer.cpp
2

Thanks. It should be AArch64SSALoadStoreOptimizer.cpp.

72

TII's isUnscaledLdSt() just returns boolean type accroding to opcode. Actually, getMemScale() is called dependently on the result of isUnscaledLdSt(). This usage pattern is also found in AArch64LoadStoreOptimizer.cpp.

169

I cannot catch your point. Could you show me a code sample you want?

287

I agree with your opinion. Thanks. I think Loads/StoresWillBeDeleted is appropriate.

408

I referenced this format from AArch64LoadStoreOptimizer.cpp. I think there will be more chances to optimize at this level. So, I think that the explanation is necessary for each case even though this patch starts with just one optimization case.
Maybe we will have ..

2) description for second optimization
...
3) description for third optimization
...
447

It will be changed to LoadsWillBeDeleted.

test/CodeGen/AArch64/ldst-opt.ll
1125

This change is to prevent the optimization of this patch from being applied. This test is for another optimization.

Some good changes, some comments, but I still wonder why:

  1. You haven't done this in the already existing LdStOpt pass.
  2. You enable this pass by default now, instead of let people play with it for a while.

cheers,
--renato

lib/Target/AArch64/AArch64SSALoadStoreOptimizer.cpp
72

To me, it seems like isUnscaledLdSt should be just:

return (getMemScale() == 1);
169
int FirstMIOffsetStride = getMemScale(FirstMI);
bool FirstMIIsUnscaled = (FirstMIOffsetStride == 1);

or, if you do the transformation I mentioned above:

int FirstMIOffsetStride = TII->getMemScale(FirstMI);
bool FirstMIIsUnscaled = TII->isUnscaledLdSt(FirstMI)

Inliners could common up both calls to getMemScale.

408

Good point.

test/CodeGen/AArch64/ldst-opt.ll
1125

Ah! Makes sense.

mcrosier added a comment.EditedApr 11 2016, 9:48 AM

Some good changes, some comments, but I still wonder why:

  1. You haven't done this in the already existing LdStOpt pass.

I imagine one benefit of merging the 32 bit loads/stores before register allocation is that it saves a register, right?

ldr w2, [x0]
ldr w3, [x0, #4]
str w2, [x1]
str w3, [x1, #4]

becomes

ldr x2, [x0]
str x2, [x1]

saving x3.

This comment was removed by JongwonLee.
test/CodeGen/AArch64/ssa-ldst-opt.ll
10–14

Thanks. I missed this case. In the test code, metadata for tbaa is added. In the source code,the routine for checking alias is added.

  1. You haven't done this in the already existing LdStOpt pass.

I imagine one benefit of merging the 32 bit loads/stores before register allocation is that it saves a register, right?

Right, good point.

JongwonLee added inline comments.Apr 12 2016, 3:53 AM
lib/Target/AArch64/AArch64SSALoadStoreOptimizer.cpp
72

But there exists unscaled ld/st with mem scale that is not 1.

169

Thanks for the example. But mem-scale is not always the same as offset-stride. The stride of scaled ld/st is 1, and the stride of unscaled is mem-scale.

381

Is this format is wrong? What is required in the clang-format?

lib/Target/AArch64/AArch64TargetMachine.cpp
114

I fixed it to be off by default.

test/CodeGen/AArch64/ssa-ldst-opt.ll
2

'aarch64-ssa-load-store-opt' is disabled by default.
The explanation about this test is described.

JongwonLee updated this revision to Diff 53378.Apr 12 2016, 3:54 AM
JongwonLee edited edge metadata.

I'm happy with the changes, with formatting fixes pointed by @mcrosier. I'll leave the approval to @junbuml, though.

lib/Target/AArch64/AArch64SSALoadStoreOptimizer.cpp
73

Do'h, ignore me.

170

Right.

382

clang-format is a tool that you pass on the source that formats it as a given standard. The default standard is LLVM's, so that new code always adhere to the standard without much effort.

test/CodeGen/AArch64/ssa-ldst-opt.ll
3

Thanks!

Hi Jongwon
Thanks you for the update with the additional alias check. However, as you merge up the second load/store to the first load/stores, I believe you need to check if there is any instruction in between the first and second load/store, which may alias with the second load/store, not just for the first store and the second load.

I'm also curious how this pass was motivated. Did you see any performance gain with this change?

Please, see my inline comments for minor issues.

lib/Target/AArch64/AArch64SSALoadStoreOptimizer.cpp
22–30

Please remove header files not used in this file. For example :

llvm/Support/raw_ostream.h 
llvm/IR/Constants.h
llvm/CodeGen/MachineTraceMetrics.h
llvm/CodeGen/TargetSchedule.h
81

It appears that you handles STURWi, STRWui, LDRWui, LDURWi in this patch. So, I don't think you need to keep all the other cases.

183–185

Is there any programmatic way to express the meaning of constants you use here instead of directly using them here?

191

It will be good to add more comments about that this function also insert elements in the SmallVector in the access order.

196

It will be good to make sure that we are handling instructions expected to be passed in this function by adding assert() or some checks ?

220–229

Looks like you can simplify this if statements. For example, "FirstMIBaseReg == SecondMIBaseReg" could be checked earlier. if (FirstMIOffset + FirstMIOffsetStride == SecondMIOffset) and if (FirstMIOffset < SecondMIOffset) is redundant.

240

It will be good to make sure that MI and MergeMI are instructions expected in this function.

243

unsigned Reg = RegOp.getReg();

262

Cannot you do just :
InsertionPoint = MI;

270

llvm_unreachable("Unexpected MI's opcode.");

273–276

NewOpc = ShouldBeUnscaled ? AArch64::LDURXi : AArch64::LDRXui;

331

Should it be assert () because you specifically handle AArch64::STURWi, AArch64::STRWui ?

350

Same here. I think it should be assert () because you specifically handle AArch64::LDRWui, AArch64::LDURWi ?

357–358

ConsecutiveStores and ConsecutiveLoads will be better.

360

You may want to do MBBI++ here, instead of MBBI++in line 326. Or you can use do/while loop.

370

Maybe assert() here as you specifically detect STURWi and STRWui.

414–421

Do we need to check this again? Or maybe assert() ?

425

Don't you also need to check if there is any instruction in between the first and second load, which is aliased with the second loads? The same check should be performed for stores.

455–456

Please change :

//    loads/stores
//    to 64-bit load/store.

into

//    loads/stores to 64-bit load/store.
481–482

Please fully use 80 column line.

494–495

Can you change it to range-based loop ?

524

Please remove TRI if not used here.

JongwonLee marked 16 inline comments as done.Apr 14 2016, 3:58 AM

Hi Jongwon
Thanks you for the update with the additional alias check. However, as you merge up the second load/store to the first load/stores, I believe you need to check if there is any instruction in between the first and second load/store, which may alias with the second load/store, not just for the first store and the second load.

I'm also curious how this pass was motivated. Did you see any performance gain with this change?

Please, see my inline comments for minor issues.

Hi junbum,
Thanks for the comments.

I think the alias check is needed for (first store, second load) and (second store, first load). I think we don't need to check any other instructions except 4 instructions (2 loads, 2 stores) because the first store is the only use instruction of the first load and the second store is also the only use instruction of the second load.

(1) reg1 = load [mem1]
(2) reg2 = load [mem1 +4]
(3) store reg1, [mem2]
(4) store reg2, [mem2 + 4]

(3) is the only use instruction of reg1
(4) is the only use instruction of reg2

This work is motivated from some test cases. Even if the work doesn't get the performance gain for all cases, it has positive effect for some cases. I'll show the data when ready.

lib/Target/AArch64/AArch64SSALoadStoreOptimizer.cpp
183–185

Fixed the code like the below.

#define MAX_UNSCALED_OFFSET 255
#define MIN_UNSCALED_OFFSET 256
...
return UnscaledOffset <= MAX_UNSCALED_OFFSET && UnscaledOffset >= MIN_UNSCALED_OFFSET;
...

Is that O.K?

331

Some test cases shows that this condition is not satisfied when opcode is AArch64::STURWi or AArch64::STRWui.

350

Some test cases shows that this condition is not satisfied when opcode is AArch64::LDRWui or AArch64::LDURWui.

370

Some test cases shows that this condition is not satisfied when opcode is AArch64::STURWi or AArch64::STRWui.

414–421

We need to check this. For example,

reg2 = load [mem1]
reg1 = load [mem1 +4]
store reg1, [mem2]
store reg2, [mem2 +4]

If we don't check, we will get the following code.

reg2 = wide-load [mem1]
wide-store reg1, [mem2]

Resultant code gets to use reg1 without definition.

425

I think the alias check is needed for (first store, second load) and (second store, first load). I think we don't need to check any other instructions except 4 instructions (2 loads, 2 stores) because the first store is the only use instruction of the first load and the second store is also the only use instruction of the second load.

(1) reg1 = load [mem1]
(2) reg2 = load [mem1 +4]
(3) store reg1, [mem2]
(4) store reg2, [mem2 + 4]

(3) is the only use instruction of reg1
(4) is the only use instruction of reg2
JongwonLee updated this revision to Diff 53684.Apr 14 2016, 3:58 AM
junbuml added inline comments.Apr 14 2016, 8:09 AM
lib/Target/AArch64/AArch64SSALoadStoreOptimizer.cpp
34

-256

112–115

return AA->alias(LocA, LocB);

415–422

I see your point. In that case why don't you change like :

reg2 = wide-load [mem1]
wide-store reg2, [mem2]
426

In order to merge the second load / store to the first load / store, we need to make sure that there is no memory operations aliasing with the second one. For example, in the code below, there is a store between the loads. In this case we cannot move the second load to the first without knowing that mem1 and mem3 is not pointing the same address.

reg1 = load [mem1]
store reg3, [mem3 + 4]
reg2 = load [mem1 +4]

store reg1, [mem2]
store reg2, [mem2 + 4]
JongwonLee marked an inline comment as done.Apr 14 2016, 7:38 PM
JongwonLee added inline comments.
lib/Target/AArch64/AArch64SSALoadStoreOptimizer.cpp
415–422
reg2 = load [mem1]
reg1 = load [mem1 +4]

The above two loads can be the below one load.

reg2 = wide-laod [mem1]
store reg1, [mem2]
store reg2, [mem2 +4]

But, the above two stores cannot be the below one store.

wide-store reg2, [mem2]

The correct form of wide-store should have reg1 as its source operand.

wide-store reg1, [mem2]
426

Thanks. You are right. I fixed the code to consider all the aliases between two loads and between two stores.

JongwonLee updated this revision to Diff 53831.Apr 14 2016, 7:38 PM

Hi Jongwon,

Sorry it's taken little long to revisit. Please see my inline comments.

lib/Target/AArch64/AArch64SSALoadStoreOptimizer.cpp
118

This function basically assume that MIa and MIb is in the same basic block. But, isn't it possible that MIa and MIb could be in different basic blocks?

124–137

I believe you could do something like this :

for (auto &MBBI : make_range(MIa->getIterator(), MIb->getIterator()))

145

When you check alias between two stores you should also check if there is any load aliased with the second store if you always move the second to the first.

407–425

I don't think you need to have these alias checks specifically because you check alias between two loads and two stores below.

498

It seems that you don't handle volatile. You could use isCandidateToMergeOrPair().

jmolloy requested changes to this revision.Apr 25 2016, 5:45 AM
jmolloy edited edge metadata.

Hi,

I'm confused about why you're doing this optimization here instead of much earlier in the compiler. This is basically memcpy idiom recognition - taking multiple 32-bit loads/stores and converting them into a llvm.memcpy intrinsic for perfect lowering should do the same job, and would work for all backends.

Have you investigated doing this much earlier (IR, pre-ISel?)

James

lib/Target/AArch64/AArch64SSALoadStoreOptimizer.cpp
2

This comment is mangled and has gone onto a new line. Please truncate it like all other header comments.

33

We use integer constants, not #defines.

This revision now requires changes to proceed.Apr 25 2016, 5:45 AM
JongwonLee marked 4 inline comments as done.Apr 27 2016, 3:04 AM

Hi,

I'm confused about why you're doing this optimization here instead of much earlier in the compiler. This is basically memcpy idiom recognition - taking multiple 32-bit loads/stores and converting them into a llvm.memcpy intrinsic for perfect lowering should do the same job, and would work for all backends.

Have you investigated doing this much earlier (IR, pre-ISel?)

James

Hi James.
This patch is considering only AArch64 not other backends. Actually, earlier level work that can merge 32-bit loads/stores is IR-level SLP vectorization. Current SLP vectorization does not support 64-bit width packing, so I tried to extend the range of SLP vectorization. Please see the patch (http://reviews.llvm.org/D18237, http://reviews.llvm.org/D19151).

lib/Target/AArch64/AArch64SSALoadStoreOptimizer.cpp
118

It would be possible that the load/store instructions in different basic blocks, but it's not considered in this patch.

145

Thanks. I will fix this.

JongwonLee updated this revision to Diff 55179.Apr 27 2016, 3:31 AM
JongwonLee edited edge metadata.

Hi,

This patch is considering only AArch64 not other backends. Actually, earlier level work that can merge 32-bit loads/stores is IR-level SLP vectorization. Current SLP vectorization does not support 64-bit width packing, so I tried to extend the range of SLP vectorization. Please see the patch (http://reviews.llvm.org/D18237, http://reviews.llvm.org/D19151).

OK, but this functionality is useful for other backends too. It isn't ideal in the long term to put generic functionality needlessly in target-specific areas.

If you're implementing this in the SLP vectorizer, why do you need to do it here as well?

Cheers,

James

OK, but this functionality is useful for other backends too. It isn't ideal in the long term to put generic functionality needlessly in target-specific areas.

If you're implementing this in the SLP vectorizer, why do you need to do it here as well?

The support of 64-bit SLP vectorization and this patch have some overlaps but not identical. Each has its own parts for optimization. And this patch can supplement SLP vectorization. For example, when loop unrolling is called after SLP vectorization, we lose the chance to pack 32-bit loads/stores unless SLP vectorization is called again. This patch can catch the chance that would happen due to the order of IR-level optimizations.

This patch was motivated by performance improvement in commercial benchmark.
However, this patch has different tendency in SPEC and LLVM test-suite.
In SPEC benchmark, compile-time increases by about 10 %. In LLVM test-suite, compile-time decreases by about 0.02 %. Execution time has no regression in both SPEC and LLVM test-suites (0.04% and 0.37% improvement respectively). The data are measured from the average of three times executions.

Hi,

So why aren't you doing this in codegen prepare or dag combine?

Cheers,

James

Hi,

So why aren't you doing this in codegen prepare or dag combine?

Cheers,

James

Hi,
I didn't try this work on other positions except backend.
As I mentioned before, this work is only considering AArch64 not other backends.
Is there any advantages of doing this work on codegen prepare or dag combine?

Hi,

So why aren't you doing this in codegen prepare or dag combine?

Cheers,

James

Hi,
I didn't try this work on other positions except backend.
As I mentioned before, this work is only considering AArch64 not other backends.
Is there any advantages of doing this work on codegen prepare or dag combine?

Hi Jongwon,

Sure, there are advantages. Modifying IR is substantially easier and less prone to error than modifying machine instructions. This is also a generic optimization fixing a problem that likely affects more than just the AArch64 target, therefore the right thing to do is to implement it in such a way that it will benefit other targets (unless that causes a very high cost).

In this case it would seem to me quicker, easier and less technical debt later to implement this higher up in the compiler.

James

In this case it would seem to me quicker, easier and less technical debt later to implement this higher up in the compiler.

+1

mcrosier resigned from this revision.May 5 2016, 10:28 AM
mcrosier removed a reviewer: mcrosier.

Hi,

So why aren't you doing this in codegen prepare or dag combine?

Cheers,

James

Hi,
I didn't try this work on other positions except backend.
As I mentioned before, this work is only considering AArch64 not other backends.
Is there any advantages of doing this work on codegen prepare or dag combine?

Hi Jongwon,

Sure, there are advantages. Modifying IR is substantially easier and less prone to error than modifying machine instructions. This is also a generic optimization fixing a problem that likely affects more than just the AArch64 target, therefore the right thing to do is to implement it in such a way that it will benefit other targets (unless that causes a very high cost).

In this case it would seem to me quicker, easier and less technical debt later to implement this higher up in the compiler.

James

Hi James
Thank you for the advice.
I'll try to do this in higher level in the compiler.

Jongwon

JongwonLee abandoned this revision.May 10 2016, 1:50 AM