Page MenuHomePhabricator

[AArch64] Improve and enable the SeparateConstOffsetFromGEP for AArch64 backend.
ClosedPublic

Authored by HaoLiu on Oct 19 2014, 11:39 PM.

Details

Summary

Hi Tim, Jingyue and other reviewers,

This patch is based on http://reviews.llvm.org/D5863, which fixes some problems in the SeparateConstOffsetFromGEP pass. So please apply that patch first if you want to have a try.

We find LLVM cannot handle CSE well in GEPs (getelementptrs). The GEP can be very complex, it can has many mixed constant indices and variable indices. The variable indices may also contain other constants. And such complex GEPs are usually kept in CodeGen. But as CodeGen can only see in one basic block. For the GEPs across basic blocks (e.g, two GEPs used in two different basic blocks or one GEP used in two basic blocks), it may have CSE opportunities and will be missed.

Currently there is a pass called SeparateConstOffsetFromGEP, which can separate constant within variable indices and split a complex GEP into two GEPs. But this is not enough for the problem that GEPs across basic blocks I mentioned above.

So I improve this pass. It will separate constant within indices for both sequential and struct types. And most important is that it will also transform complex GEPs into a "ptrtoint + arithmetic + inttoptr" form, so that it is able to find CSE opportunities across basic blocks. Also it can benefit address sinking logic in CodeGenPrepare for those complex GEPs which can not be matched by addressing mode, as part of the address calculation can still be sunk. The address sinking logic can result in better addressing mode. EarlyCSE pass is called after this pass to do CSE. LICM pass is also called to do loop invariant code motion in case any of the address calculations are invariant.

If we don't find CSE opportunities after such arithmetic transformation, it still has no obvious regression on performance, as it will always do such transformation in CodeGen. We just move such transformation several passes ahead of CodeGen.

I tested the performance for A57 of AArch64 target on SPEC CPU2000 and SPEC CPU2006. It has no obvious regressions and has improvements on following benchmarks:
spec2006.473.astar 4.7%
spec2006.444.namd 3.0%
spec2006.445.gobmk 2.5%

For the benchmarks don't have obvious improvement, we can also see the address calculation and addressing mode are better from the assembly code.

For other targets like NVPTX, I can not test this patch. I think this patch can also benefit the performance, at least it has no regression.

Review please.

Thanks,
-Hao

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
This comment was removed by HaoLiu.
In D5864#35, @jingyue wrote:

I mean the issue we spotted previously with the following example:

struct S {
  int a[4]; // sizeof(int) = 4
  double b[8]; // sizeof(double) = 8
};

%s = alloca struct %S
%p = getelementptr %s, 0, 1, i ; &s.b[i] = s + 4*sizeof(int) + i*sizeof(double)

I thought we agreed that your diff had a correctness issue with this example, and I didn't see any change to fix that issue in your latest diff. Am I missing something?

Sorry, my words misled you. I said "we can transform it to either of two simpler forms", I meant this patch had already done like that.
The correctness issue has already been solved in the first patch. See more details in transformToSimplerGEPs(), which does the same as my explanation. Transformation to simpler GEPs is similar to transformation to arithmetic operations. It just replace ADDs by GEPs. So your concern about correctness issue has already been resolved. Also I have two test cases to test structure indices.

Thanks,
-Hao

I see what's going on: you only extract constant offsets from array indices, and leave struct indices in those transformXXX functions. Thanks for clarification! I think that's worth some comment though, and I will point that out in my inline comments.

lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp
82

The word "simpler" is overloaded. I'd explicitly say something like:

// to lower a GEP with multiple indices into multiple GEPs with a single index or arithmetic operations ...

Is this more accurate?

102

The sentence "another simpler form of simpler GEPs is similar" is a little awkward. The word "simple" is very overloaded. What do you mean exactly by "simpler"? Consider using multiple sentences and/or an example to explain what this lowering to GEPs exactly mean.

125

This => Lowering GEPs

126

The word "optimize" is overloaded. Is hoist/sink a better word?

131

Such transformation can split it => Lowering GEPs can split such GEPs

184

You may want to say Find looks for constant offsets in both array indices and field indices.

Because the signature of Extract is changed, you need to remove the comment "The meaning of the arguments and the return value ..." and explicitly explain what its arguments and return value mean.

313

I don't like the word "transform" and "simpler". To be more accurate, I'd use lower instead of transform. What do you exactly mean by "simpler"? Are you lowering a GEP with multiple indices to multiple GEPs with a single index? If that's the case, I'd say that explicitly in order to be more accurate.

319

ditto

323

will find => finds
will only find => only finds
else => ; otherwise,

324

none zero => non-zero

345

/// Whether to transform

713

If you buy that transformXXX should run after setIsInBounds(false), you needn't check inbound in this function.

717

These are essentially byte offsets instead of indices, because they are scaled by element sizes. Do we have a better name? ByteOffsetOfIndex?

717

Actually, is this vector even necessary? Can you build the resultant uglygep along the way?

719

vairable => variable

721

Why checking whether it's a constant? Shouldn't you check whether it is an array index? My understanding is, the offsets of structure indices (which are guaranteed to be constant) and the constant offsets of array indices are already counted in accumulativeByteOffsets, and now you only need to rebuild the variadic part of the array indices.

726

{ at the end

730

} else {

744

The meaning of "simpler" here is unclear.

779

ditto. See Line 721

784

ditto. curly brackets.

797

{

801

}

827

Why do we need to check LowerGEP here? My understanding is, even when LowerGEP is set, +AccumulativeByteOffset will still become part of the lowered GEPs or arithmetics (see how transformToSimplerGEPs create a GEP with offset AccumulativeByteOffset, and how transformToArithmetics create an add with offset AccumulativeByteOffset). Shouldn't we still check whether this offset can be nicely folded into an addressing mode?

836

// ... in each sequential index

Emphasizing you only remove constant offsets from sequential indices is important, because the behavior here is slightly different from accumulateByteOffset which looks into both sequential and struct indices.

You also need to explain why removing constant offsets in struct field index here is dangerous, and mention the transformXXX functions will take care of that.

863

Run these transformXXX functions before GEP->setIsInBounds(false) seems wrong to me, because transformXXX may treat off-bound GEPs as inbound and add nsw flags incorrectly.

866

"such form" is too obscure

889

The more I look at these two functions, the more strongly I believe we should merge them. Especially, after you get rid of ResultIndices, you probably only need to check LowerGEP once in the merged function. No?

1016

I'm confused. Check my comment on the line

if (!isa<ConstantInt>(GEP->getOperand(I)))

and if my comment there makes sense I'll come back here.

HaoLiu updated this revision to Diff 16027.EditedNov 10 2014, 10:28 PM
In D5864#39, @jingyue wrote:

I see what's going on: you only extract constant offsets from array indices, and leave struct indices in those transformXXX functions. Thanks for clarification! I think that's worth some comment though, and I will point that out in my inline comments.

Hi Jingyue,

Thanks for your detailed comments, which are really very helpful. I've refactored the patch and attached.
Review please.

Thanks,
-Hao

lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp
721

Yes, you are right. I'll add comments to explain this.

827

I think even a pointer plus a constant index can not be matched in an addressing mode, it is worth to lower a GEP with multiple indices. Because we can do optimizations on the variable indices. Also if the constant index is extracted from multiple indices, the result looks better than the original GEP.
If I remove such check, some test cases will fail. So it is kept when disable LowerGEP.

jingyue added inline comments.Nov 11 2014, 10:56 AM
lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp
22

// We cannot eliminate the partially common subexpression getelementptr [10 x %struct]* %ptr, i64 %i

184

I don't see the first part of this comment addressed. I'd like to emphasize "both array indices and field indices".

720

Not critical, but I slightly prefer getZExtValue() because a field index is non-negative and getElementOffset takes unsigned.

721

I wasn't just proposing a comment change. I meant the source code should read something like

if (isa<SequentialType>(GTI))

You are building new GEPs/arithmetics for the remainder of the array indices, and checking whether the indexed type is a SequentialType matches your intention much better.

I also believe checking for SequentialType instead of ConstantInt solves your previous question on handling "a = 1 + 2". I'll go back to that later.

827

That makes some sense. Out of curiosity, which test would fail?

853

// we always don't => we don't remove struct field indices here

"always don't" is not the right grammar. Besides, you do extract field indices later, so I'd weaken the word "never".

854

diable => LowerGEP is disabled

855

enable LowerGEP => LowerGEP is enabled

856

I wrote lowerToXXX in my reviews just for simplicity. However, in the source code, we should still use the full names.

889

I don't see this comment addressed.

jingyue requested changes to this revision.Nov 11 2014, 10:56 AM
jingyue edited edge metadata.
This revision now requires changes to proceed.Nov 11 2014, 10:56 AM
Gerolf added a subscriber: Gerolf.Nov 11 2014, 10:59 AM

Hi Hao,

could you share your current performance data + options you use? With your patch I see small gains on astar + xalancbmk, but a ~4% loss on libquantum. The data is based on the 11/07 trunk on top of r221549, O3 + LTO, cyclone, ref input.
In addition to cortex, could you run on x86 also?

Thanks
Gerolf

In D5864#45, @Gerolf wrote:

Hi Hao,

could you share your current performance data + options you use? With your patch I see small gains on astar + xalancbmk, but a ~4% loss on libquantum. The data is based on the 11/07 trunk on top of r221549, O3 + LTO, cyclone, ref input.
In addition to cortex, could you run on x86 also?

Thanks
Gerolf

Hi Gerolf,

I use "-O3 -ffast-math -mcpu=cortex-a57" and ref input. My last run was on the 10/24 trunk on top of r220484 (To ignore the impact of trunk changes, I always run and compare to the specific revision). I've ran the performance test for about 4 times on this specific revision. To ignore noise and variances, I remove those benchmark who doesn't exist in every performance runs. Then I find there is no obvious regressions and have stable improvements on following 3 benchmarks (This is just results from one run, other runs have similar results):

SPEC2006.473.astar          -4.84%
SPEC2006.444.namd           -4.39%
SPEC2006.445.gobmk          -2.80%

The percentage number is calculated by: 100% - current_time / reference_time.

It's quite strange that your results have many differences from mine on cyclone. libquantum is a quite stable benchmark on our test infrastructure, and I never see performance changes on it with my patch. So I always think my patch doesn't affect its performance.

I can not test cyclone, so I run on x86 with today's trunk with and without my patch (I enabled the SeparateConstOffsetFromGEP pass on x86 backend similarly to AArch64 backend). I've ran for about 8 times on libquantum. This benchmark is not stable on my x86 machine. The results have +/-1% variances comparing to the average results with or without my patch. But the average results are similar with or without my patch. So I think my patch won't affect the performance of libquantum on x86. I've ran a performance test for all the int/fp benchmarks in SPEC 2006. Just haven't got result yet. As it needs quite a long time for the result. If it is fortunate, I'll get the result tomorrow and show to you.
So I wonder whether libquantum is stable on cyclone. If it is not stable, maybe you can run it several times to ignore the variance. Or if it is real regression, we can then investigate it.

Do you have more details about the results. I'm interested in how many improvements on astar. Also I'm interested in why there is no improvement on namd and gobmk. They are supposed to get some benefits from this patch.

Thanks,
-Hao

Hi Gerolf,

Why do you want to see the results on x86? SeparateConstOffsetFromGEP is
only used by NVPTX and AArch64 so far. I don't think it will affect the
compilation time or the compiled code on x86.

Jingyue

HaoLiu edited edge metadata.

Hi Jingyue,

I've attached a new patch according to the review comments.
Review please.

Thanks,
-Hao

lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp
184

I think "both array indices and field indices" means all indices. We don't need to explain this common behavior. It is quite obvious. I think we need comments if we only find in one kind of "array indices" or "field indices", or do something special.
Anyway, the new comments will be like the comments for Extract()

720

Aha, previously I use getZExtValue(). You commented I should change to getSExtValue(). I thought it was not critical, so I modified it.
Now, It seems I should roll back...
: )

721

If check SequentialType, we also need to check wheter it is ConstantInt. Then the code will be like:

if (isa<SequentialType>(GTI))
  if (!isa<ConstantInt>(GEP->getOperand(I))
     ...
  else
     AccumulativeByteOffset += Const

This looks more complicate.

About the "a = 1 + 2", I think handling it when extracting constant in indices is better. Then we don't need to add code in both lowerToXXX functions. Also if we handle it in lowerToXXX functions, the logic for the original splitGEP (I.E. when lowerGEP is diabled) can still not handle this problem.

827

Oh, interesting. I supposed some of your test cases in test/Transforms/SeparateConstOffsetFromGEP/NVPTX may fail.
But I tried removing such check and ran "make check", there is no test failures. So it seems there is no such test cases...

889

I still doubt about your suggestion. The diff 16027 gets ride of ResultIndices. They are still quite different and difficult to combine them.
Lowering to arithmetic operations starts from creating ptrtoint, and then creates ADD/MUL/SHL, and ends with creating ADD and inttoptr.
Lowering to GEPs starts from creating bitcast, and then creates MUL/SHL/GEPs, and ends with creating GEP and bitcast.

1016

I think this is the right place to do such thing. As the comments say, we want to "// Remove the constant offset in each GEP index." If we should leave some constant offset in index.
Also, if we fix such things in the following lowerToXXX functions, the logic for the original splitGEP (I.E. when lowerGEP is diabled) can still not handle this problem.

A general suggestion: I guess you upload your diff using Phab's web interface. If so, I suggest you upload patches with full context (http://llvm.org/docs/Phabricator.html#requesting-a-review-via-the-web-interface) from now on or use arc which uploads full context by default. It's a little difficult to review a diff with only 3-line contexts and Phab seems to have a little trouble computing the diff between two diffs with limited context (at least the line numbers don't match in many places).

lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp
184

That makes sense.

720

No, I don't think I suggested you change it to getSExtValue(). When I said "getSExtValue() returns uint64_t", I expected you to write

uint64_t Field = ...->getZExtValue();

because I slightly prefer delaying truncating which loses precision. But, anyway, I'm fine with "unsigned".

721

I'm confused. How does AccumulativeByteOffset in your snippet come up here? I think AccumulativeByteOffset is computed before calling transformToSimplerGEPs. Why do you bother changing that again?

In case I haven't made this clear, I never expect Find/Extract always extracts all constant offsets. For example, given a = 1 + 2, I'm totally fine if the Find function finds only 2, and in that case the remaining code should treat 1 just as a variable remainder. The GEP after extraction may look like "gep base, i1, 1, i2" which may not be the optimal form, but again I am fine with that because it's sound. If we want to get the optimal form, I am looking for either a more comprehensive way (not just handling tricky cases like adding two constants; your initial diff might serve as a good start but the code was too complicated) or reassociating each index before this pass so that all constant offsets in the index are merged (which I suppose is the approach we want to go with).

With that, do we still need the nested "if (!isa<CostantInt>)" check? Do you have an example where simply checking isa<SequentialType> isn't enough? I think a concrete example may help to make things clearer.

827

That was my suspicion. This check on the addressing mode is a very weak check: I believe both NVPTX and AArch64 support reg+<a potentially very large offset>, so I would be surprised that having/not having this check would cause tests to fail.

You explained pretty well why it's still worth to lower a GEP even if address mode matching fails. I'd like to see that in the source code comments. Thanks!

889

Anyway, I'll probably do that by myself after this diff is in.

HaoLiu updated this revision to Diff 16190.Nov 13 2014, 6:58 PM
HaoLiu edited edge metadata.

Hi Jingyue,

Nice suggestion. It really helps a lot for the review.

I've attached a new patch with full context. Review please.

Thanks,
-Hao

lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp
720

Oh, I double checked that. I indeed misunderstood your meaning. Last time I used
"unsigned Field = xxx->getZExtValue()". not "uint64_t Field = xxx->getZExtValue()".
Sorry for the noise.
Anyway, diff 16135 has already fix it.

721

Yeah, I understand your meaning. We just let the "1" as a remainder. It's sound because the logic of Find/Extract is to handle only one constant and return. It's reasonable.
Previously I focused on making it optimal. So I want to extract all the constant. Anyway, this is just corner case and it may be fixed in the future. I'm fine with that.
BTW, there is already a test case @test_const_add checking we generate correct result for "a = 1 + 2".

889

OK, thanks.

jingyue accepted this revision.Nov 14 2014, 9:14 AM
jingyue edited edge metadata.

LGTM with some minor fixes. You may want to double check with ARM folks on performance. I remember Gerolf said he couldn't get the same performance numbers as you did.

Thanks again for all the hard work!

lib/Transforms/Scalar/SeparateConstOffsetFromGEP.cpp
148

Lower => Lowering

150

lower => we lower

154

lower => we lower

155

better => a better

333

single index => a single index

334

The grammar of the sentence "It has been extracted a constant offset" is incorrect.

/// Lower a GEP with multiple indices into multiple GEPs with a single index.
/// Function splitGEP already split the original GEP into a variadic part and a
/// constant offset (i.e., AccumulativeByteOffset). This function lowers the
/// variadic part into a set of GEPs with a single index and applies
/// AccumulativeByteOffset to it. 
/// GEP The variadic part of the original GEP
/// AccumulativeByteOffset ...

I'd even like to change the name of the argument GEP to Variadic, because GEP appears so many times throughout this file and usually refers to the original GEP.

343

ditto

348

we can find => we successfully find

My original comment wasn't clear enough, and I think "we successfully find" is better.

370

single index => a single index

850

The grammar of "So we don't such check" is wrong. => Therefore, we don't check for addressing modes in that case.

865

The following xxx and yyy => lowerToSingleIndexGEPs or lowerToArithmetics will later

This revision is now accepted and ready to land.Nov 14 2014, 9:14 AM
In D5864#45, @Gerolf wrote:

Hi Hao,

could you share your current performance data + options you use? With your patch I see small gains on astar + xalancbmk, but a ~4% loss on libquantum. The data is based on the 11/07 trunk on top of r221549, O3 + LTO, cyclone, ref input.
In addition to cortex, could you run on x86 also?

Hi Gerolf,

I ran spec2006 4 times on my x86 machine, but the results are always fluctuating. So I can not see whether it is better or not by enabling this pass. I think it is because I don't have much experience on testing x86.

For the AArch64 backend, recently I ran twice on r222083@trunk and r222017@trunk on Cortex-A57. The results are similar to the previous results. But it has about 1%-2% regression on 471.omnetpp. That benchmark is always fluctuating, so I can't measure it accurately. Anyway, we have much improvements than regressions.

Thanks,
-Hao

Did you try pinning the benchmark on one CPU? Scheduling can significantly
disturb the benchmark results. Also, if your CPU has any boosting feature
(similar to Intel's turbo boost), try disabling that. One of our folks,
Mark, found that can make benchmarking sensitive too.

Jingyue

In D5864#58, @jingyue wrote:

Did you try pinning the benchmark on one CPU? Scheduling can significantly
disturb the benchmark results. Also, if your CPU has any boosting feature
(similar to Intel's turbo boost), try disabling that. One of our folks,
Mark, found that can make benchmarking sensitive too.

Jingyue

In D5864#58, @jingyue wrote:

Did you try pinning the benchmark on one CPU? Scheduling can significantly
disturb the benchmark results. Also, if your CPU has any boosting feature
(similar to Intel's turbo boost), try disabling that. One of our folks,
Mark, found that can make benchmarking sensitive too.

Jingyue

I think SPEC2006 benchmarks always run on one CPU, so I didn't pin to one CPU. For the boosting feature, I tested on an old x86 machine, so I'm not sure about the features. Anyway, I'm not familiar with x86 performance tests, so I think it's better to let other experts to have a try.

Thanks,
-Hao

In D5864#54, @jingyue wrote:

LGTM with some minor fixes. You may want to double check with ARM folks on performance. I remember Gerolf said he couldn't get the same performance numbers as you did.

Thanks again for all the hard work!

Hi Jingyue,

Thanks very much for your hard work on review, which really help me a lot.
I've committed the code in http://llvm.org/viewvc/llvm-project?view=revision&revision=222328 and http://llvm.org/viewvc/llvm-project?view=revision&revision=222331.

BTW, for the performance test, I also tested other revision by some slightly modifications such as:
(1) Always lower GEPs if there is more than one index despite wheter we found a constant.
(2) Call SeparateConstOffsetFromGEP pass before calling TargetPassConfig::addIRPasses() in AArch64TargetMachine.cpp.

The performance results are different. Some benchmarks are better, some benchmarks are worse. So I just think maybe we need further tuning in the future for better performance. Anyway, that's the future work.

Thanks,
-Hao

SPEC2006 benchmarks are single-threaded, but this single thread can be
scheduled by the operating system on different CPUs (cores, hyperthreads,
etc.) in different time slices. I used to encounter fluctuation too and
fixed that using the taskset command. Not sure if that can help you too.

Also, I don't think x86 matters because this pass is only enabled in
AArch64 and NVPTX for now. I do feel a little worried on the 1%-2%
regression you observed on AArch64. It would be great if you can get a more
stable result.

Jingyue

Hi Hao,

thanks for all your efforts landing this. We’ll check the performance impact on cyclone. Did lto have any impact on your results?

Cheers
Gerolf

In D5864#61, @jingyue wrote:

SPEC2006 benchmarks are single-threaded, but this single thread can be
scheduled by the operating system on different CPUs (cores, hyperthreads,
etc.) in different time slices. I used to encounter fluctuation too and
fixed that using the taskset command. Not sure if that can help you too.

Also, I don't think x86 matters because this pass is only enabled in
AArch64 and NVPTX for now. I do feel a little worried on the 1%-2%
regression you observed on AArch64. It would be great if you can get a more
stable result.

Jingyue

Hi Jingyue,

When I run benchmarks on AArch64 machine, I use taskset because there are two kinds of cores (A53 and A57). I thought all x86 cores on my machine are the same, so I didn't use taskset. Thanks for your share. It is helpful for our future perf tests.

For the regressions, I'll keep on watching the performance results.

Thanks,
-Hao

In D5864#62, @Gerolf wrote:

Hi Hao,

thanks for all your efforts landing this. We’ll check the performance impact on cyclone. Did lto have any impact on your results?

Cheers
Gerolf

Hi Gerolf

I don't know the LTO impact. Actually I never tested on LTO and don't know how to test it.
If you think LTO will be affected, I can have a try.

Thanks,
-Hao

Alright, we’ll do an lto a/b run then. I thought you might have checked libquantum where saw a regression with an earlier version of the patch, but let’s sync up once we have the latest data.

Cheers
Gerolf

HaoLiu closed this revision.May 13 2015, 7:52 PM