This is an archive of the discontinued LLVM Phabricator instance.

Unmerge GEPs to reduce register pressure on IndirectBr edges.
ClosedPublic

Authored by hjyamauchi on Aug 15 2017, 3:47 PM.

Details

Summary

GEP merging can sometimes increase the number of live values and register
pressure across control edges and cause performance problems particularly if the
increased register pressure results in spills.

This change implements GEP unmerging around an IndirectBr in certain cases to
mitigate the issue. This is in the CodeGenPrepare pass (after all the GEP
merging has happened.)

With this patch, the Python interpreter loop runs faster by ~5%.

Event Timeline

hjyamauchi created this revision.Aug 15 2017, 3:47 PM

Hi Hal, would you take a look at this change?

hfinkel edited edge metadata.Aug 21 2017, 6:45 PM

With this patch, the Python interpreter loop runs faster by ~5%.

On what platform?

Did you try always doing this (instead of just doing it over indirect branches)? You're possibly increasing the critical path by doing the computation this way, and if you have a processor with a good branch predictor maybe this shows up? But if you told me that it generally always helps, I'd not be too surprised either.

You should be careful not to create constants that aren't cheap to represent if you start with ones that are. Specifically, UGEPIIdx->getValue() - GEPIIdx->getValue() might be large (if one of those values is negative). TTI has getIntImmCost, and if both UIdx and Idx are cheap, but (Uidx - Idx) is expensive, we probably don't want to do this.

Sorry for a delay.

With this patch, the Python interpreter loop runs faster by ~5%.

On what platform?

x86-64 Haswell.

Did you try always doing this (instead of just doing it over indirect branches)? You're possibly increasing the critical path by doing the computation this way, and if you have a processor with a good branch predictor maybe this shows up? But if you told me that it generally always helps, I'd not be too surprised either.

Good points.

No, I didn't try always doing this. My thoughts follow:

As you point out, I think there may be a tradeoff between potential spills and a potentially longer critical path, and it's not 100% clear which way is better *in general* because that would depend on how the CPU works and it's not easy to tell whether this would actually save spills at this stage, though for the indirectbr in the python interpreter, it's a clear win due to fewer spills (on x86-64).

The benefits of restricting this to relatively rare indirectbr are that (a) it might limit the impact of a potentially longer critical path, if any, and (b) the impact on the compile time should be minimal because it's the first check we do there and the common case doesn't need to go through the subsequent more elaborate checks.

Maybe should query the target and do this only if the number of registers is low (or just x86(-64))?

I wish I could formulate this in a better way. I haven't found a better way so far. If you see a better way, please let me know.

You should be careful not to create constants that aren't cheap to represent if you start with ones that are. Specifically, UGEPIIdx->getValue() - GEPIIdx->getValue() might be large (if one of those values is negative). TTI has getIntImmCost, and if both UIdx and Idx are cheap, but (Uidx - Idx) is expensive, we probably don't want to do this.

Agreed. Will work on this.

Added getIntImmCost checks. Please take another look.

eastig added a subscriber: eastig.Aug 25 2017, 1:00 PM

As this patch can affect ARM targets I am doing some benchmarking.
I've got the LNT benchmarks results for AArch64 (Cortex-A57). There is no difference in performance. I'll have got more results soon.
It's interesting to see what benchmarks has been used to measure the improvements.

As this patch can affect ARM targets I am doing some benchmarking.
I've got the LNT benchmarks results for AArch64 (Cortex-A57). There is no difference in performance. I'll have got more results soon.
It's interesting to see what benchmarks has been used to measure the improvements.

Good to know.

The improvements I saw were measured with python programs (like the following) running on the python 2.7 runtime compiled with LLVM at r309573 on x86-64 Haswell.

for _ in xrange(1, 100000000):
  continue

This patch reduces register spills in the computed goto-based interpreter loop.

Noted the tradeoff between register pressures and critical path in the comment.

Also rebased.

This is looking good. A couple additional things...

lib/CodeGen/CodeGenPrepare.cpp
6224

I don't think this check is necessary. GEPIOp is constrained to be defined in SrcBlock, and it's SrcBlock that has the IndirectBr terminator, so *any* use of GEPIOp outside of SrcBlock keeps it live over the indirect edge. I don't see why we wouldn't unmerge regardless of the parent block here.

6264

You'll also need to make sure that this GEP is not marked as inbounds if GEPI was not.

if (!GEPI->isInBounds()) {
  UGEPI->setIsInBounds(false);
}

because otherwise the result of GEP could be not-in-bounds resulting in UB if that's used as the input to an inbounds UGEPI.

hjyamauchi updated this revision to Diff 114084.Sep 6 2017, 3:22 PM
hjyamauchi marked an inline comment as done.

One comment addressed and another needs clarification.

hjyamauchi added inline comments.Sep 6 2017, 3:23 PM
lib/CodeGen/CodeGenPrepare.cpp
6264

I'm not very familiar with how inbounds works.

Is an inbounds GEP UB if it takes a non-inbounds GEP as its operand (regardless of whether the index/offset is actually in bounds or not)?

For example,

Before:

%GEPIOp = ...
%GEPI = gep %GEPIOp 2
%UGEPI = gep inbounds %GEPIOp 1

After:

%GEPIOp = ...
%GEPI = gep %GEPIOp 2
%UGEPI = gep inbounds %GEPI -1

Suppose "gep %GEPIOp 2" is not in bounds and "gep inbounds %GEPIOp 1" is in bounds. Both aren't UB.

"gep inbounds %GEPI -1" is UB just because it takes (non-inbounds) "gep %GEPIOp 2" as an operand, even though the offset/index of "gep inbounds %GEPI -1" is actually in bounds?

hfinkel added inline comments.Sep 6 2017, 5:25 PM
lib/CodeGen/CodeGenPrepare.cpp
6264

Yes, the base pointer needs to be inbounds too. The LangRef says, "If the inbounds keyword is present, the result value of the getelementptr is a poison value if the base pointer is not an in bounds address of an allocated object, or if any of the addresses that would be formed by successive addition of the offsets implied by the indices to the base address with infinitely precise signed arithmetic are not an in bounds address of that allocated object." That's exactly why you need to account for the inbounds here.

hjyamauchi marked an inline comment as done.

Addressed a comment.

hfinkel accepted this revision.Sep 7 2017, 3:14 PM

LGTM

This revision is now accepted and ready to land.Sep 7 2017, 3:14 PM
hjyamauchi closed this revision.Sep 11 2017, 10:53 AM