This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Improve and enable the SeparateConstOffsetFromGEP for AArch64 backend.
ClosedPublic

Authored by HaoLiu on Oct 19 2014, 11:39 PM.

Details

Summary

Hi Tim, Jingyue and other reviewers,

This patch is based on http://reviews.llvm.org/D5863, which fixes some problems in the SeparateConstOffsetFromGEP pass. So please apply that patch first if you want to have a try.

We find LLVM cannot handle CSE well in GEPs (getelementptrs). The GEP can be very complex, it can has many mixed constant indices and variable indices. The variable indices may also contain other constants. And such complex GEPs are usually kept in CodeGen. But as CodeGen can only see in one basic block. For the GEPs across basic blocks (e.g, two GEPs used in two different basic blocks or one GEP used in two basic blocks), it may have CSE opportunities and will be missed.

Currently there is a pass called SeparateConstOffsetFromGEP, which can separate constant within variable indices and split a complex GEP into two GEPs. But this is not enough for the problem that GEPs across basic blocks I mentioned above.

So I improve this pass. It will separate constant within indices for both sequential and struct types. And most important is that it will also transform complex GEPs into a "ptrtoint + arithmetic + inttoptr" form, so that it is able to find CSE opportunities across basic blocks. Also it can benefit address sinking logic in CodeGenPrepare for those complex GEPs which can not be matched by addressing mode, as part of the address calculation can still be sunk. The address sinking logic can result in better addressing mode. EarlyCSE pass is called after this pass to do CSE. LICM pass is also called to do loop invariant code motion in case any of the address calculations are invariant.

If we don't find CSE opportunities after such arithmetic transformation, it still has no obvious regression on performance, as it will always do such transformation in CodeGen. We just move such transformation several passes ahead of CodeGen.

I tested the performance for A57 of AArch64 target on SPEC CPU2000 and SPEC CPU2006. It has no obvious regressions and has improvements on following benchmarks:
spec2006.473.astar 4.7%
spec2006.444.namd 3.0%
spec2006.445.gobmk 2.5%

For the benchmarks don't have obvious improvement, we can also see the address calculation and addressing mode are better from the assembly code.

For other targets like NVPTX, I can not test this patch. I think this patch can also benefit the performance, at least it has no regression.

Review please.

Thanks,
-Hao

Diff Detail

Event Timeline

HaoLiu retitled this revision from to [AArch64] Improve and enable the SeparateConstOffsetFromGEP for AArch64 backend..
HaoLiu updated this object.
HaoLiu edited the test plan for this revision. (Show Details)
HaoLiu added reviewers: t.p.northover, jingyue.
HaoLiu added a subscriber: Unknown Object (MLST).

And most important is that it will also transform complex GEPs into a "ptrtoint + arithmetic + inttoptr" form, so that it is able to find CSE opportunities across basic blocks.

You'll need to disable this transformation when AA is enabled in CodeGen. You can do this just like CGP does (essentially by calling TM->getSubtarget<TargetSubtargetInfo>().useAA()). CGP splits the GEPs into simpler GEPs when AA is in use, and uses inttoptr otherwise. This all relates to the way that we use BasicAA to override TBAA and catch cases of basic type punning -- but BasicAA does not analyze inttoptr, and so can no longer fulfill this need after new inttoptrs are introduced. You could try using simpler GEPs like CGP does when AA is in use, or just disable this part of the transformation for now. Either is fine with me.

In D5864#5, @hfinkel wrote:

And most important is that it will also transform complex GEPs into a "ptrtoint + arithmetic + inttoptr" form, so that it is able to find CSE opportunities across basic blocks.

You'll need to disable this transformation when AA is enabled in CodeGen. You can do this just like CGP does (essentially by calling TM->getSubtarget<TargetSubtargetInfo>().useAA()). CGP splits the GEPs into simpler GEPs when AA is in use, and uses inttoptr otherwise. This all relates to the way that we use BasicAA to override TBAA and catch cases of basic type punning -- but BasicAA does not analyze inttoptr, and so can no longer fulfill this need after new inttoptrs are introduced. You could try using simpler GEPs like CGP does when AA is in use, or just disable this part of the transformation for now. Either is fine with me.

Hi Hal,

Thanks for the explanation on useAA(). We do transformation because using simpler GEPs to replace a complex GEP like CGP does has small impact. Such transformation is not in CGP, because we want to make use of the constant extraction in SeparateConstOffsetFromGEP pass, and we also want to call other passes like EarlyCSE to remove common sub-expressions after such transformation. For the performance concern, I prefer to enable such transformation in AArch64 backend.
But unfortunately, AArch64 backend enables AA for both cortex-a53 and cortex-a57:

bool useAA() const override { return isCortexA53() || isCortexA57(); }

This was introduced recently in http://reviews.llvm.org/D5103. I find most of the backend including X86,ARM doesn't enable AA. AArch64 backend also disables AA when CPU is cyclone. As Sanjin also says enabling AA for CortexA57 doesn't have performance changes in D5103, I think we can just remove CortexA57 from useAA(), so that we can get benefit from such transformation and has no regression.

For CortexA53, I don't test how much benefit from enabling AA in code gen. Maybe we can test following situations and find the best solution:

(1) disable AA, enable Transformation.
(2) enable AA, disable Transformation.
(3) enable AA and Transformation.

Anyway, such test need a lot of time. So I think we'd better only consider about A57, so I've attached a new patch (diff ID 15234) following your suggestion and remove A57 form useAA().

Review please.

Thanks,
-Hao

Hi James,

I think there are two main reasons this approach is outside of CGP:

(1) I want to make use of the constant extraction logic in SeparateConstOffsetFromGEP pass. It can find more constants

(2) CGP only sinks GEPs, it doesn’t do CSE. So I think it’s better to do such logic separately in SeparateConstOffsetFromGEP pass before CGP. This won’t affect the address sinking logic in CGP.

Thanks,

-Hao

From: mankeyrabbit@gmail.com [mailto:mankeyrabbit@gmail.com] On Behalf Of James Molloy
Sent: Wednesday, October 22, 2014 4:01 AM
To: reviews+D5864+public+2de5fd757ea4c578@reviews.llvm.org
Cc: Hao Liu; t.p.northover@gmail.com; jingyue@google.com; llvm-commits@cs.uiuc.edu
Subject: Re: [PATCH] [AArch64] Improve and enable the SeparateConstOffsetFromGEP for AArch64 backend.

Hi Hao,

It's great to finally see this go upstream!

When you started working on this, the approach was to modify the GEP lowering logic in CGP. Hal's comment has reminded me that your approach has moved away from that, so could you please explain why you're doing this outside of CGP, and why your solution does not duplicate the CGP lowering code?

How do they interact?

Cheers,

James

  • Original Message -----

From: "Hao Liu" <Hao.Liu@arm.com>
To: "Hao Liu" <Hao.Liu@arm.com>, "t p northover" <t.p.northover@gmail.com>, jingyue@google.com
Cc: hfinkel@anl.gov, "amara emerson" <amara.emerson@arm.com>, llvm-commits@cs.uiuc.edu
Sent: Wednesday, October 22, 2014 2:07:12 AM
Subject: Re: [PATCH] [AArch64] Improve and enable the SeparateConstOffsetFromGEP for AArch64 backend.

In D5864#5, @hfinkel wrote:

And most important is that it will also transform complex GEPs
into a "ptrtoint + arithmetic + inttoptr" form, so that it is
able to find CSE opportunities across basic blocks.

You'll need to disable this transformation when AA is enabled in
CodeGen. You can do this just like CGP does (essentially by
calling TM->getSubtarget<TargetSubtargetInfo>().useAA()). CGP
splits the GEPs into simpler GEPs when AA is in use, and uses
inttoptr otherwise. This all relates to the way that we use
BasicAA to override TBAA and catch cases of basic type punning --
but BasicAA does not analyze inttoptr, and so can no longer
fulfill this need after new inttoptrs are introduced. You could
try using simpler GEPs like CGP does when AA is in use, or just
disable this part of the transformation for now. Either is fine
with me.

Hi Hal,

Thanks for the explanation on useAA(). We do transformation because
using simpler GEPs to replace a complex GEP like CGP does has small
impact. Such transformation is not in CGP, because we want to make
use of the constant extraction in SeparateConstOffsetFromGEP pass,
and we also want to call other passes like EarlyCSE to remove common
sub-expressions after such transformation. For the performance
concern, I prefer to enable such transformation in AArch64 backend.
But unfortunately, AArch64 backend enables AA for both cortex-a53 and
cortex-a57:

bool useAA() const override { return isCortexA53() ||
isCortexA57(); }

This was introduced recently in http://reviews.llvm.org/D5103. I find
most of the backend including X86,ARM doesn't enable AA. AArch64
backend also disables AA when CPU is cyclone. As Sanjin also says
enabling AA for CortexA57 doesn't have performance changes in D5103,
I think we can just remove CortexA57 from useAA(), so that we can
get benefit from such transformation and has no regression.

For CortexA53, I don't test how much benefit from enabling AA in code
gen. Maybe we can test following situations and find the best
solution:

(1) disable AA, enable Transformation.
(2) enable AA, disable Transformation.
(3) enable AA and Transformation.

Anyway, such test need a lot of time. So I think we'd better only
consider about A57, so I've attached a new patch (diff ID 15234)
following your suggestion and remove A57 form useAA().

Hi Hao,

This is not quite what I had in mind... when I said that we should disable the transformation, I did not mean that you should disable the entire pass, but rather, that you should only disable the part of SeparateConstOffsetFromGEP that breaks apart GEPs into ptrtoint/inttoptr when useAA is true like CGP does. Does SeparateConstOffsetFromGEP not convey benefits without this part of the transformation? Can you try implementing the same thing by breaking the GEPs into simpler ones like CGP does when useAA is true?

As a general note, while I understand why CGP breaks GEPs into ptrtoint/inttoptr, I'd like to move the generic CodeGen passes away from doing that. If you look at CGP, implementing this in terms of GEPs is pretty straightforward. When testing on x86 using -addr-sink-using-gep (even with useAA false), performance generally improves over using the ptrtoint/inttoptr (although there are a couple of regressions I need to investigate when I have some time). As a quasi-separate issue, I'm curious to know what using -addr-sink-using-gep does on ARM/AArch64. Regardless, I'm excited to see this work because I'd like to make SeparateConstOffsetFromGEP available to more backends -- including PowerPC that uses AA more aggressively -- and so I don't want the pass itself to introduce ptrtoint/inttoptr (at least not when useAA is true).

Thanks again,
Hal

Review please.

Thanks,
-Hao

http://reviews.llvm.org/D5864

Sorry, a wrong patch is attached, it should check not useAA()... So add another patch to fix this problem.

This is not quite what I had in mind... when I said that we should disable the transformation, I did not mean that you should disable the entire pass, but rather, that you should only disable the part of SeparateConstOffsetFromGEP that breaks apart GEPs into ptrtoint/inttoptr when useAA is true like CGP does. Does SeparateConstOffsetFromGEP not convey benefits without this part of the transformation? Can you try implementing the same thing by breaking the GEPs into simpler ones like CGP does when useAA is true?

Hi Hal,

I think I understand your point now. I've attached a new patch. It will do arithmetic transformation only when the useAA() returns false, or it will transform a complex GEP into several simpler GEPs like CGP does.
The correctness has been tested. The performance test has not finished yet. But the assembly code looks similar to the old patch for cortex-a57. If the result is better than the old patch, I'll remove the logic about transformation to ptrtoint/inttoptr.

Review please.

Thanks,
-Hao

This generally looks good to me. It would be good to have someone expert in this pass to also look it over.

the performance test has not finished yet. But the assembly code looks similar to the old patch for cortex-a57. If the result is better than the old patch, I'll remove the logic about transformation to ptrtoint/inttoptr.

I am curious how this turns out; please let us know.

lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp
86

Even we can extract constant from GEPs -> Even if we can extract constants from GEPs

88

algorithm

984

I think that if the original GEP has the "inbounds" property, then the shl, mul here (and the same below plus the add in the ptrtoint/inttoptr code) can have the nsw/nuw flags.

HaoLiu updated this revision to Diff 15380.Oct 24 2014, 1:15 AM

Hi Hal,

A new patch has been attached according to your comments.
BTW. For the inbounds issue, the CGP just ignores it when creating new GEP/MUL/ADD, so I think maybe it is a minor problem in CGP.

In D5864#17, @hfinkel wrote:

This generally looks good to me. It would be good to have someone expert in this pass to also look it over.

the performance test has not finished yet. But the assembly code looks similar to the old patch for cortex-a57. If the result is better than the old patch, I'll remove the logic about transformation to ptrtoint/inttoptr.

I am curious how this turns out; please let us know.

I think this is because ptrtoint/inttoptr is similar to GEP. According to http://llvm.org/docs/GetElementPtr.html#how-is-gep-different-from-ptrtoint-arithmetic-and-inttoptr, the most difference is that GEP carries additional pointer aliasing rules. Also as the AA used in CodeGen doesn't change to much thing, so the assembly code is similar. It's just my guess. Two version still has much difference in instruction order.

Thanks,
-Hao

  • Original Message -----

From: "Hao Liu" <Hao.Liu@arm.com>
To: "Hao Liu" <Hao.Liu@arm.com>, "t p northover" <t.p.northover@gmail.com>, jingyue@google.com,
ssijaric@codeaurora.org, mcrosier@codeaurora.org
Cc: hfinkel@anl.gov, "amara emerson" <amara.emerson@arm.com>, llvm-commits@cs.uiuc.edu
Sent: Friday, October 24, 2014 3:25:14 AM
Subject: Re: [PATCH] [AArch64] Improve and enable the SeparateConstOffsetFromGEP for AArch64 backend.

Hi Hal,

A new patch has been attached according to your comments.
BTW. For the inbounds issue, the CGP just ignores it when creating
new GEP/MUL/ADD, so I think maybe it is a minor problem in CGP.

FWIW, we only recently started propagating NSW/NUW into the SelectionDAG layer. We should add a FIXME (or just fix this) in CGP too.

In D5864#17, @hfinkel wrote:

This generally looks good to me. It would be good to have someone
expert in this pass to also look it over.

the performance test has not finished yet. But the assembly code
looks similar to the old patch for cortex-a57. If the result is
better than the old patch, I'll remove the logic about
transformation to ptrtoint/inttoptr.

I am curious how this turns out; please let us know.

I think this is because ptrtoint/inttoptr is similar to GEP.
According to
http://llvm.org/docs/GetElementPtr.html#how-is-gep-different-from-ptrtoint-arithmetic-and-inttoptr,
the most difference is that GEP carries additional pointer aliasing
rules. Also as the AA used in CodeGen doesn't change to much thing,
so the assembly code is similar. It's just my guess. Two version
still has much difference in instruction order.

Yep; I recommend testing the difference independently of enabling AA. (In CGP, as I imagine you noticed, I also have a separate flag for this purpose).

-Hal

Thanks,
-Hao

http://reviews.llvm.org/D5864

jingyue edited edge metadata.Oct 24 2014, 1:00 PM

I just noticed I was put on this. Sorry for the delay, and I'll review it today.

Jingyue

HaoLiu updated this revision to Diff 15475.Oct 27 2014, 1:28 AM
HaoLiu edited edge metadata.

The performance result shows disable useAA for CortexA57 can have more improvement. The performance improvement result for latest trunk with this patch on CortexA57 are as follows:
Version 1 enable useAA() for CortexA57:

spec.cpu2006.ref.444_namd	-4.56%
spec.cpu2006.ref.473_astar	-3.79%
spec.cpu2006.ref.445_gobmk	-2.91%

Version 2 disable useAA() for CortexA57:

spec.cpu2006.ref.473_astar	-4.84%
spec.cpu2006.ref.444_namd	-4.39%
spec.cpu2006.ref.445_gobmk	-2.80%

Both versions have obvious benefit. But the result on astar of version 2 is much better than version 1. So I've attached a patch removing CortexA57 from useAA(). useAA() is beneficial for in-order target, but it has no obvious improvement on out of order cpu. Currently cyclone, x86 also doesn't useAA(). So I think we don't need useAA for CortexA57.

Hi James,

From: mankeyrabbit@gmail.com [mailto:mankeyrabbit@gmail.com] On Behalf Of James Molloy
Sent: Monday, October 27, 2014 11:38 PM
To: reviews+D5864+public+2de5fd757ea4c578@reviews.llvm.org
Cc: Hao Liu; Tim Northover; ssijaric@codeaurora.org; Chad Rosier; Jingyue Wu; LLVM Commits
Subject: Re: [PATCH] [AArch64] Improve and enable the SeparateConstOffsetFromGEP for AArch64 backend.

Hi Hao,

I'm not sure about this. Firstly I have concerns about the statistical significance of the difference in results - we've seen swings of +/- 1% on A57, and here we're talking about a 0.5% difference. Are you satisfied that the difference is genuine?

[Hao Liu] I’m also curious about the result. I suspect both versions should have not big difference. But I think the difference is genuine. The benchmark astar is quite stable and should never have variance about +/- 1%.

Secondly, I don't see a good reason why useAA should be worse than !useAA. To my mind, if there is a regression when useAA is enabled, it should be one of two things:

  1. A legitimate regression, in which case let's fix useAA!
  2. Random code permutation outside of the compiler's model. If this is the case we shouldn't be adjusting the compiler to compensate.

useAA has the potential to improve codegen. We're not really using it at the moment, but it's good that it's available to use in the future. If we rip it out, we should have a good reason for it, I think!

Can I suggest that you make the initial commit, which improves performance, then have a look at why !useAA is better and fix that incrementally?

[Hao Liu] OK, I’d like to do that. Thanks.

Cheers,

James

A new patch adds several lines reviewed in D5863 and does some code reformat work to make it more readable.

Thanks,
-Hao

My general concern is the part extracting constant struct field index seems wrong. Unlike arrays where every elements share the same type, structure fields can have different types. Therefore, changing a field index may change the intermediate type in a pointer arithmetic, causing incorrect IR.

For example, consider

struct S {
  int a[4]; // sizeof(int) = 4
  double b[8]; // sizeof(double) = 8
};

%s = alloca struct %S
%p = getelementptr %s, 0, 1, i ; &s.b[i] = s + 4*sizeof(int) + i*sizeof(double)

If you extract the 1, the gep becomes:

%p = getelementptr %s, 0, 0, i; &s.a[i] = s + i*sizeof(int)

and the constant offset is 4 * sizeof(int) = 16.

As you probably see already,

s + 4*sizeof(int) + i*sizeof(double) != s + 4*sizeof(int) + i*sizeof(int)

Does this make sense to you? What is a good way to fix this?

I also have a bunch of nit picks inlined.

lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp
88

Nice explanation! However, in the final version, you may want to merge these improvements into the old comments. You can mention the improvements in the summary of the revision.

185

//, or null if we cannot extract a constant offset.

319

typo:
the the given => the given
a constant offsets => a constant offset
wheter => whether
depends on => depending on
use => uses
several simpler GEPs or arithmetic operations

331

a better name would be transformToArithmetics because you are using not only adds but also muls

364

Add a comment on what SimplifyGEP means.

642
If (ConstantOffset == 0)
  return NewIdx;
// ...
return Extractor.rebuild...
878

getZExtValue() returns uint64_t.

1019

The comment here is confusing. It doesn't seem to explain what the following code does.

1021

Why name it *Other*Const?

Hi Jingyue,

My general concern is the part extracting constant struct field index seems wrong. Unlike arrays where every elements share the same type, structure fields can have different types. Therefore, changing a field index may change the intermediate type in a pointer arithmetic, causing incorrect IR.

For example, consider

struct S {
  int a[4]; // sizeof(int) = 4
  double b[8]; // sizeof(double) = 8
};

%s = alloca struct %S
%p = getelementptr %s, 0, 1, i ; &s.b[i] = s + 4*sizeof(int) + i*sizeof(double)

If you extract the 1, the gep becomes:

%p = getelementptr %s, 0, 0, i; &s.a[i] = s + i*sizeof(int)

and the constant offset is 4 * sizeof(int) = 16.

As you probably see already,

s + 4*sizeof(int) + i*sizeof(double) != s + 4*sizeof(int) + i*sizeof(int)

Does this make sense to you? What is a good way to fix this?

Yes, your concern is reasonable. So to extract a index of structure type, we should not keep the original gep form of 3 indices.
For your case about "%p = getelementptr %s, 0, 1, i", we can transform it to either of two simpler forms:
1st Form: 2 simpler GEPs

%i_offset = mul %i, sizeof(double)
%s_i8 = bitcast %s to i8*
%s_i = getelementptr i8* %s_i8, %i_offset
%s_i_c = getelementptr i8* %s_i, 4*sizeof(int)
%result = bitcast %s_i_c to double*

Bitcast can be ignored, the original GEP with 3 indices are transformed into 1 MUL/SHL and 2 simpler GEPs with 1 index. If we find constant within the index "%i", it can also be added into the last GEP.

2nd Form: ptrtoint+arithmetic+inttoptr

%ptrint = ptrtoint %s to i64
%i_offset = mul %i, sizeof(double)
%s_i = add %ptrint, %i_offset
%s_i_c = add %ptrint, 4*sizeof(int)
%result = inttoptr %s_i_c to double*

This is similar, ptrtoint/inttoptr can be ignored, we transform the original GEP with 3 indices into 1 MUL/SHL and 2 ADDs.

The main benefit is that
(1) We can always extract the indices of struct types.
(2) We can do CSE easily on simpler GEPs and MUL/SHL/ADDs.

Such transformation is similar to the address sinking login in CGP pass, which also transform GEP to one of such forms by checking useAA(). For the correctness and performance, several benchmarks has been tested on AArch64 backend. I cannot test NVPTX, but I think if the test cases have complex address calculation, it can also get benefit.

Thanks,
-Hao

HaoLiu updated this revision to Diff 15536.Oct 28 2014, 7:59 PM

Add a new patch refactored according to Jingyue's comments.
Review please.

Thanks,
-Hao

Regarding the correctness issue with struct field indices, I think the solutions you proposed make sense. Look forward to your fix in the next diff. We should choose which solution to go with depending on useAA.

The logic looks good in general except the field index issue. My inline comments are mostly on refactoring.

lib/Target/AArch64/AArch64TargetMachine.cpp
208

I'd explain why running GEP reassociation can expose more opportunities for LICM. I guess only the split part (splitting a GEP with multiple indices into a set of smaller GEPs) buys you such opportunities. Right?

lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp
82

Remove superfluous blank lines.

83

// Another improvement enabled by the SimplifyGEP flag is ...

93

I don't understand reason 2. Can you explain more? Are you saying, given a[i][j1] and a[i][j2], you can reuse the common part &a[i]?

365

What would be a good name? You are not really *simplifying* but to some extend *complicating* GEPs :)

Does LowerGEP sound good to you? You are *lowering* a GEP into multiple uglygeps (btw, uglygep is a known concept in LLVM) or arithmetic operations which are closer to machine code.

Make sure the comments are consistent after renaming.

857

Similarly, this function should be merged into accumulateByteOffset.

889

Can we merge the two transformToXXX functions? The logics are pretty similar.

988

Looks quite similar to splitGEP. Can we massage this into splitGEP, and guard the part that lowers GEP with if (SimplifyGEP)?

1016

OK, now I see what you're doing here. According to our discussion in D5863, this looks like a temporary work around for the CFGSimplification issue, and will eventually go away. I don't think it's a good idea to hack this pass to work around that issue. I'd rather run instsimplify before SeparateConstOffset in the AArch64 backend, and add a FIXME there saying once the CFGSimplification issue is fixed we needn't run instsimplify any more.

1050

After extractConstantAndSimplifyGEP is merged into splitGEP, the code here can be simplified.

HaoLiu updated this revision to Diff 15560.Oct 30 2014, 2:10 AM

Attach a new patch refactored according to Jingyue's comments. Review please.

Thanks,
-Hao

Regarding the correctness issue with struct field indices, I think the solutions you proposed make sense. Look forward to your fix in the next diff.

Hi Jingyue,

I'm just a bit confused about the "fix" you mean. Did you mean the solution to extract constant in the structure index? The solution is already in the diff.
I'm only not sure about the "a = 1+2" problem, and I've replied to your comment. For other problems, I've fixed in the attached diff.

Thanks,
-Hao

lib/Target/AArch64/AArch64TargetMachine.cpp
208

Yes. The whole GEP may not be invariant, but after splitting, we can move the invariant part and leave the variant part in the loop.

lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp
93

Yes, GEPs like "getelementptr %ptr, i, j1" and "getelementptr %ptr, i, j2" can not be eliminated. But transformed into simpler forms, "i" and "j1" "j2" are separated, and we can find the common expressions about "i".

365

That make sence.

857

Make sence. It can be controlled by the LowerGEP.

889

I think we'd better not do merge. They logic look similar, but they do quite different things. If we want to merge them, we need to add checks like "if (TransformToSimplerGEP)" in many places (From the first for loop logic to the function end). That would be a quite hard work for other people to understand the code.

1016

I just don't think this is a temporary work. We cannot assume there is no situation like "a = 1+2". People also can create a test case of "a = 1 + 2" and feed to this pass. I mean people can call this pass without any other passes. So if we don't accumulate all none zero indices in this place, we'll get incorrect result.

We want to " Remove the constant offset in each GEP index" and " Splits this GEP index into a variadic part and a constant offset" in this for loop. So the later part about transformation won't consider about the none zero constant indices, it assumes there is all varaible indices or zero indices. See that I only check "if (!isa<ConstantInt>(GEP->getOperand(I)))". So if we leave an index of "constant 1" behand, it will be ignored and we'll generate incorrect result. That's why I want to make sure we have accumulated all the none zero constant indices.

I mean the issue we spotted previously with the following example:

struct S {
  int a[4]; // sizeof(int) = 4
  double b[8]; // sizeof(double) = 8
};

%s = alloca struct %S
%p = getelementptr %s, 0, 1, i ; &s.b[i] = s + 4*sizeof(int) + i*sizeof(double)

I thought we agreed that your diff had a correctness issue with this example, and I didn't see any change to fix that issue in your latest diff. Am I missing something?

This comment was removed by HaoLiu.
In D5864#35, @jingyue wrote:

I mean the issue we spotted previously with the following example:

struct S {
  int a[4]; // sizeof(int) = 4
  double b[8]; // sizeof(double) = 8
};

%s = alloca struct %S
%p = getelementptr %s, 0, 1, i ; &s.b[i] = s + 4*sizeof(int) + i*sizeof(double)

I thought we agreed that your diff had a correctness issue with this example, and I didn't see any change to fix that issue in your latest diff. Am I missing something?

Sorry, my words misled you. I said "we can transform it to either of two simpler forms", I meant this patch had already done like that.
The correctness issue has already been solved in the first patch. See more details in transformToSimplerGEPs(), which does the same as my explanation. Transformation to simpler GEPs is similar to transformation to arithmetic operations. It just replace ADDs by GEPs. So your concern about correctness issue has already been resolved. Also I have two test cases to test structure indices.

Thanks,
-Hao

I see what's going on: you only extract constant offsets from array indices, and leave struct indices in those transformXXX functions. Thanks for clarification! I think that's worth some comment though, and I will point that out in my inline comments.

lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp
82

The word "simpler" is overloaded. I'd explicitly say something like:

// to lower a GEP with multiple indices into multiple GEPs with a single index or arithmetic operations ...

Is this more accurate?

102

The sentence "another simpler form of simpler GEPs is similar" is a little awkward. The word "simple" is very overloaded. What do you mean exactly by "simpler"? Consider using multiple sentences and/or an example to explain what this lowering to GEPs exactly mean.

125

This => Lowering GEPs

126

The word "optimize" is overloaded. Is hoist/sink a better word?

131

Such transformation can split it => Lowering GEPs can split such GEPs

184

You may want to say Find looks for constant offsets in both array indices and field indices.

Because the signature of Extract is changed, you need to remove the comment "The meaning of the arguments and the return value ..." and explicitly explain what its arguments and return value mean.

313

I don't like the word "transform" and "simpler". To be more accurate, I'd use lower instead of transform. What do you exactly mean by "simpler"? Are you lowering a GEP with multiple indices to multiple GEPs with a single index? If that's the case, I'd say that explicitly in order to be more accurate.

319

ditto

323

will find => finds
will only find => only finds
else => ; otherwise,

324

none zero => non-zero

345

/// Whether to transform

713

If you buy that transformXXX should run after setIsInBounds(false), you needn't check inbound in this function.

717

These are essentially byte offsets instead of indices, because they are scaled by element sizes. Do we have a better name? ByteOffsetOfIndex?

717

Actually, is this vector even necessary? Can you build the resultant uglygep along the way?

719

vairable => variable

721

Why checking whether it's a constant? Shouldn't you check whether it is an array index? My understanding is, the offsets of structure indices (which are guaranteed to be constant) and the constant offsets of array indices are already counted in accumulativeByteOffsets, and now you only need to rebuild the variadic part of the array indices.

726

{ at the end

730

} else {

744

The meaning of "simpler" here is unclear.

779

ditto. See Line 721

784

ditto. curly brackets.

797

{

801

}

827

Why do we need to check LowerGEP here? My understanding is, even when LowerGEP is set, +AccumulativeByteOffset will still become part of the lowered GEPs or arithmetics (see how transformToSimplerGEPs create a GEP with offset AccumulativeByteOffset, and how transformToArithmetics create an add with offset AccumulativeByteOffset). Shouldn't we still check whether this offset can be nicely folded into an addressing mode?

836

// ... in each sequential index

Emphasizing you only remove constant offsets from sequential indices is important, because the behavior here is slightly different from accumulateByteOffset which looks into both sequential and struct indices.

You also need to explain why removing constant offsets in struct field index here is dangerous, and mention the transformXXX functions will take care of that.

863

Run these transformXXX functions before GEP->setIsInBounds(false) seems wrong to me, because transformXXX may treat off-bound GEPs as inbound and add nsw flags incorrectly.

866

"such form" is too obscure

889

The more I look at these two functions, the more strongly I believe we should merge them. Especially, after you get rid of ResultIndices, you probably only need to check LowerGEP once in the merged function. No?

1016

I'm confused. Check my comment on the line

if (!isa<ConstantInt>(GEP->getOperand(I)))

and if my comment there makes sense I'll come back here.

HaoLiu updated this revision to Diff 16027.EditedNov 10 2014, 10:28 PM
In D5864#39, @jingyue wrote:

I see what's going on: you only extract constant offsets from array indices, and leave struct indices in those transformXXX functions. Thanks for clarification! I think that's worth some comment though, and I will point that out in my inline comments.

Hi Jingyue,

Thanks for your detailed comments, which are really very helpful. I've refactored the patch and attached.
Review please.

Thanks,
-Hao

lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp
721

Yes, you are right. I'll add comments to explain this.

827

I think even a pointer plus a constant index can not be matched in an addressing mode, it is worth to lower a GEP with multiple indices. Because we can do optimizations on the variable indices. Also if the constant index is extracted from multiple indices, the result looks better than the original GEP.
If I remove such check, some test cases will fail. So it is kept when disable LowerGEP.

jingyue added inline comments.Nov 11 2014, 10:56 AM
lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp
22

// We cannot eliminate the partially common subexpression getelementptr [10 x %struct]* %ptr, i64 %i

184

I don't see the first part of this comment addressed. I'd like to emphasize "both array indices and field indices".

720

Not critical, but I slightly prefer getZExtValue() because a field index is non-negative and getElementOffset takes unsigned.

721

I wasn't just proposing a comment change. I meant the source code should read something like

if (isa<SequentialType>(GTI))

You are building new GEPs/arithmetics for the remainder of the array indices, and checking whether the indexed type is a SequentialType matches your intention much better.

I also believe checking for SequentialType instead of ConstantInt solves your previous question on handling "a = 1 + 2". I'll go back to that later.

827

That makes some sense. Out of curiosity, which test would fail?

853

// we always don't => we don't remove struct field indices here

"always don't" is not the right grammar. Besides, you do extract field indices later, so I'd weaken the word "never".

854

diable => LowerGEP is disabled

855

enable LowerGEP => LowerGEP is enabled

856

I wrote lowerToXXX in my reviews just for simplicity. However, in the source code, we should still use the full names.

889

I don't see this comment addressed.

jingyue requested changes to this revision.Nov 11 2014, 10:56 AM
jingyue edited edge metadata.
This revision now requires changes to proceed.Nov 11 2014, 10:56 AM
Gerolf added a subscriber: Gerolf.Nov 11 2014, 10:59 AM

Hi Hao,

could you share your current performance data + options you use? With your patch I see small gains on astar + xalancbmk, but a ~4% loss on libquantum. The data is based on the 11/07 trunk on top of r221549, O3 + LTO, cyclone, ref input.
In addition to cortex, could you run on x86 also?

Thanks
Gerolf

In D5864#45, @Gerolf wrote:

Hi Hao,

could you share your current performance data + options you use? With your patch I see small gains on astar + xalancbmk, but a ~4% loss on libquantum. The data is based on the 11/07 trunk on top of r221549, O3 + LTO, cyclone, ref input.
In addition to cortex, could you run on x86 also?

Thanks
Gerolf

Hi Gerolf,

I use "-O3 -ffast-math -mcpu=cortex-a57" and ref input. My last run was on the 10/24 trunk on top of r220484 (To ignore the impact of trunk changes, I always run and compare to the specific revision). I've ran the performance test for about 4 times on this specific revision. To ignore noise and variances, I remove those benchmark who doesn't exist in every performance runs. Then I find there is no obvious regressions and have stable improvements on following 3 benchmarks (This is just results from one run, other runs have similar results):

SPEC2006.473.astar          -4.84%
SPEC2006.444.namd           -4.39%
SPEC2006.445.gobmk          -2.80%

The percentage number is calculated by: 100% - current_time / reference_time.

It's quite strange that your results have many differences from mine on cyclone. libquantum is a quite stable benchmark on our test infrastructure, and I never see performance changes on it with my patch. So I always think my patch doesn't affect its performance.

I can not test cyclone, so I run on x86 with today's trunk with and without my patch (I enabled the SeparateConstOffsetFromGEP pass on x86 backend similarly to AArch64 backend). I've ran for about 8 times on libquantum. This benchmark is not stable on my x86 machine. The results have +/-1% variances comparing to the average results with or without my patch. But the average results are similar with or without my patch. So I think my patch won't affect the performance of libquantum on x86. I've ran a performance test for all the int/fp benchmarks in SPEC 2006. Just haven't got result yet. As it needs quite a long time for the result. If it is fortunate, I'll get the result tomorrow and show to you.
So I wonder whether libquantum is stable on cyclone. If it is not stable, maybe you can run it several times to ignore the variance. Or if it is real regression, we can then investigate it.

Do you have more details about the results. I'm interested in how many improvements on astar. Also I'm interested in why there is no improvement on namd and gobmk. They are supposed to get some benefits from this patch.

Thanks,
-Hao

Hi Gerolf,

Why do you want to see the results on x86? SeparateConstOffsetFromGEP is
only used by NVPTX and AArch64 so far. I don't think it will affect the
compilation time or the compiled code on x86.

Jingyue

HaoLiu edited edge metadata.

Hi Jingyue,

I've attached a new patch according to the review comments.
Review please.

Thanks,
-Hao

lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp
184

I think "both array indices and field indices" means all indices. We don't need to explain this common behavior. It is quite obvious. I think we need comments if we only find in one kind of "array indices" or "field indices", or do something special.
Anyway, the new comments will be like the comments for Extract()

720

Aha, previously I use getZExtValue(). You commented I should change to getSExtValue(). I thought it was not critical, so I modified it.
Now, It seems I should roll back...
: )

721

If check SequentialType, we also need to check wheter it is ConstantInt. Then the code will be like:

if (isa<SequentialType>(GTI))
  if (!isa<ConstantInt>(GEP->getOperand(I))
     ...
  else
     AccumulativeByteOffset += Const

This looks more complicate.

About the "a = 1 + 2", I think handling it when extracting constant in indices is better. Then we don't need to add code in both lowerToXXX functions. Also if we handle it in lowerToXXX functions, the logic for the original splitGEP (I.E. when lowerGEP is diabled) can still not handle this problem.

827

Oh, interesting. I supposed some of your test cases in test/Transforms/SeparateConstOffsetFromGEP/NVPTX may fail.
But I tried removing such check and ran "make check", there is no test failures. So it seems there is no such test cases...

889

I still doubt about your suggestion. The diff 16027 gets ride of ResultIndices. They are still quite different and difficult to combine them.
Lowering to arithmetic operations starts from creating ptrtoint, and then creates ADD/MUL/SHL, and ends with creating ADD and inttoptr.
Lowering to GEPs starts from creating bitcast, and then creates MUL/SHL/GEPs, and ends with creating GEP and bitcast.

1016

I think this is the right place to do such thing. As the comments say, we want to "// Remove the constant offset in each GEP index." If we should leave some constant offset in index.
Also, if we fix such things in the following lowerToXXX functions, the logic for the original splitGEP (I.E. when lowerGEP is diabled) can still not handle this problem.

A general suggestion: I guess you upload your diff using Phab's web interface. If so, I suggest you upload patches with full context (http://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface) from now on or use arc which uploads full context by default. It's a little difficult to review a diff with only 3-line contexts and Phab seems to have a little trouble computing the diff between two diffs with limited context (at least the line numbers don't match in many places).

lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp
184

That makes sense.

720

No, I don't think I suggested you change it to getSExtValue(). When I said "getSExtValue() returns uint64_t", I expected you to write

uint64_t Field = ...->getZExtValue();

because I slightly prefer delaying truncating which loses precision. But, anyway, I'm fine with "unsigned".

721

I'm confused. How does AccumulativeByteOffset in your snippet come up here? I think AccumulativeByteOffset is computed before calling transformToSimplerGEPs. Why do you bother changing that again?

In case I haven't made this clear, I never expect Find/Extract always extracts all constant offsets. For example, given a = 1 + 2, I'm totally fine if the Find function finds only 2, and in that case the remaining code should treat 1 just as a variable remainder. The GEP after extraction may look like "gep base, i1, 1, i2" which may not be the optimal form, but again I am fine with that because it's sound. If we want to get the optimal form, I am looking for either a more comprehensive way (not just handling tricky cases like adding two constants; your initial diff might serve as a good start but the code was too complicated) or reassociating each index before this pass so that all constant offsets in the index are merged (which I suppose is the approach we want to go with).

With that, do we still need the nested "if (!isa<CostantInt>)" check? Do you have an example where simply checking isa<SequentialType> isn't enough? I think a concrete example may help to make things clearer.

827

That was my suspicion. This check on the addressing mode is a very weak check: I believe both NVPTX and AArch64 support reg+<a potentially very large offset>, so I would be surprised that having/not having this check would cause tests to fail.

You explained pretty well why it's still worth to lower a GEP even if address mode matching fails. I'd like to see that in the source code comments. Thanks!

889

Anyway, I'll probably do that by myself after this diff is in.

HaoLiu updated this revision to Diff 16190.Nov 13 2014, 6:58 PM
HaoLiu edited edge metadata.

Hi Jingyue,

Nice suggestion. It really helps a lot for the review.

I've attached a new patch with full context. Review please.

Thanks,
-Hao

lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp
720

Oh, I double checked that. I indeed misunderstood your meaning. Last time I used
"unsigned Field = xxx->getZExtValue()". not "uint64_t Field = xxx->getZExtValue()".
Sorry for the noise.
Anyway, diff 16135 has already fix it.

721

Yeah, I understand your meaning. We just let the "1" as a remainder. It's sound because the logic of Find/Extract is to handle only one constant and return. It's reasonable.
Previously I focused on making it optimal. So I want to extract all the constant. Anyway, this is just corner case and it may be fixed in the future. I'm fine with that.
BTW, there is already a test case @test_const_add checking we generate correct result for "a = 1 + 2".

889

OK, thanks.

jingyue accepted this revision.Nov 14 2014, 9:14 AM
jingyue edited edge metadata.

LGTM with some minor fixes. You may want to double check with ARM folks on performance. I remember Gerolf said he couldn't get the same performance numbers as you did.

Thanks again for all the hard work!

lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp
148

Lower => Lowering

150

lower => we lower

154

lower => we lower

155

better => a better

333

single index => a single index

334

The grammar of the sentence "It has been extracted a constant offset" is incorrect.

/// Lower a GEP with multiple indices into multiple GEPs with a single index.
/// Function splitGEP already split the original GEP into a variadic part and a
/// constant offset (i.e., AccumulativeByteOffset). This function lowers the
/// variadic part into a set of GEPs with a single index and applies
/// AccumulativeByteOffset to it. 
/// GEP The variadic part of the original GEP
/// AccumulativeByteOffset ...

I'd even like to change the name of the argument GEP to Variadic, because GEP appears so many times throughout this file and usually refers to the original GEP.

343

ditto

348

we can find => we successfully find

My original comment wasn't clear enough, and I think "we successfully find" is better.

370

single index => a single index

850

The grammar of "So we don't such check" is wrong. => Therefore, we don't check for addressing modes in that case.

865

The following xxx and yyy => lowerToSingleIndexGEPs or lowerToArithmetics will later

This revision is now accepted and ready to land.Nov 14 2014, 9:14 AM
In D5864#45, @Gerolf wrote:

Hi Hao,

could you share your current performance data + options you use? With your patch I see small gains on astar + xalancbmk, but a ~4% loss on libquantum. The data is based on the 11/07 trunk on top of r221549, O3 + LTO, cyclone, ref input.
In addition to cortex, could you run on x86 also?

Hi Gerolf,

I ran spec2006 4 times on my x86 machine, but the results are always fluctuating. So I can not see whether it is better or not by enabling this pass. I think it is because I don't have much experience on testing x86.

For the AArch64 backend, recently I ran twice on r222083@trunk and r222017@trunk on Cortex-A57. The results are similar to the previous results. But it has about 1%-2% regression on 471.omnetpp. That benchmark is always fluctuating, so I can't measure it accurately. Anyway, we have much improvements than regressions.

Thanks,
-Hao

Did you try pinning the benchmark on one CPU? Scheduling can significantly
disturb the benchmark results. Also, if your CPU has any boosting feature
(similar to Intel's turbo boost), try disabling that. One of our folks,
Mark, found that can make benchmarking sensitive too.

Jingyue

In D5864#58, @jingyue wrote:

Did you try pinning the benchmark on one CPU? Scheduling can significantly
disturb the benchmark results. Also, if your CPU has any boosting feature
(similar to Intel's turbo boost), try disabling that. One of our folks,
Mark, found that can make benchmarking sensitive too.

Jingyue

In D5864#58, @jingyue wrote:

Did you try pinning the benchmark on one CPU? Scheduling can significantly
disturb the benchmark results. Also, if your CPU has any boosting feature
(similar to Intel's turbo boost), try disabling that. One of our folks,
Mark, found that can make benchmarking sensitive too.

Jingyue

I think SPEC2006 benchmarks always run on one CPU, so I didn't pin to one CPU. For the boosting feature, I tested on an old x86 machine, so I'm not sure about the features. Anyway, I'm not familiar with x86 performance tests, so I think it's better to let other experts to have a try.

Thanks,
-Hao

In D5864#54, @jingyue wrote:

LGTM with some minor fixes. You may want to double check with ARM folks on performance. I remember Gerolf said he couldn't get the same performance numbers as you did.

Thanks again for all the hard work!

Hi Jingyue,

Thanks very much for your hard work on review, which really help me a lot.
I've committed the code in http://llvm.org/viewvc/llvm-project?view=revision&revision=222328 and http://llvm.org/viewvc/llvm-project?view=revision&revision=222331.

BTW, for the performance test, I also tested other revision by some slightly modifications such as:
(1) Always lower GEPs if there is more than one index despite wheter we found a constant.
(2) Call SeparateConstOffsetFromGEP pass before calling TargetPassConfig::addIRPasses() in AArch64TargetMachine.cpp.

The performance results are different. Some benchmarks are better, some benchmarks are worse. So I just think maybe we need further tuning in the future for better performance. Anyway, that's the future work.

Thanks,
-Hao

SPEC2006 benchmarks are single-threaded, but this single thread can be
scheduled by the operating system on different CPUs (cores, hyperthreads,
etc.) in different time slices. I used to encounter fluctuation too and
fixed that using the taskset command. Not sure if that can help you too.

Also, I don't think x86 matters because this pass is only enabled in
AArch64 and NVPTX for now. I do feel a little worried on the 1%-2%
regression you observed on AArch64. It would be great if you can get a more
stable result.

Jingyue

Hi Hao,

thanks for all your efforts landing this. We’ll check the performance impact on cyclone. Did lto have any impact on your results?

Cheers
Gerolf

In D5864#61, @jingyue wrote:

SPEC2006 benchmarks are single-threaded, but this single thread can be
scheduled by the operating system on different CPUs (cores, hyperthreads,
etc.) in different time slices. I used to encounter fluctuation too and
fixed that using the taskset command. Not sure if that can help you too.

Also, I don't think x86 matters because this pass is only enabled in
AArch64 and NVPTX for now. I do feel a little worried on the 1%-2%
regression you observed on AArch64. It would be great if you can get a more
stable result.

Jingyue

Hi Jingyue,

When I run benchmarks on AArch64 machine, I use taskset because there are two kinds of cores (A53 and A57). I thought all x86 cores on my machine are the same, so I didn't use taskset. Thanks for your share. It is helpful for our future perf tests.

For the regressions, I'll keep on watching the performance results.

Thanks,
-Hao

In D5864#62, @Gerolf wrote:

Hi Hao,

thanks for all your efforts landing this. We’ll check the performance impact on cyclone. Did lto have any impact on your results?

Cheers
Gerolf

Hi Gerolf

I don't know the LTO impact. Actually I never tested on LTO and don't know how to test it.
If you think LTO will be affected, I can have a try.

Thanks,
-Hao

Alright, we’ll do an lto a/b run then. I thought you might have checked libquantum where saw a regression with an earlier version of the patch, but let’s sync up once we have the latest data.

Cheers
Gerolf

HaoLiu closed this revision.May 13 2015, 7:52 PM