This is an archive of the discontinued LLVM Phabricator instance.

LEA code size optimization pass (Part 1): Remove redundant address recalculations
ClosedPublic

Authored by aturetsk on Sep 30 2015, 8:46 AM.

Details

Summary

Add new x86 pass which replaces address calculations in load or store instructions with def register of existing LEA (must be in the same basic block), if the LEA calculates address that differs only by a displacement. Works only with -Os or -Oz.

Diff Detail

Repository
rL LLVM

Event Timeline

aturetsk updated this revision to Diff 36112.Sep 30 2015, 8:46 AM
aturetsk retitled this revision from to LEA code size optimization pass (Part 1): Remove redundant address recalculations.
aturetsk updated this object.
aturetsk added a reviewer: nadav.
aturetsk added a subscriber: llvm-commits.
aturetsk updated this object.Sep 30 2015, 9:04 AM

Hi Andrey,

Thanks for working on this.

What is the compile time impact of that pass?

Right now, the scope of the pass is pretty narrow. For instance, it is basic block scope only whereas it should be MachineFunction scope.
Anyhow, I am fine with it if we have a plan to improve it, therefore the question:
What is the long term plan for this pass?

Finally, the added test case seems too small to cover all the code added. I may of course be wrong, it seems we are missing some case. In particular, please make sure to cover the cases where:

  • The size of the displacement exceeds 4-bytes.
  • The matching LEA is after the definition.
  • The size of the displacement exceeds 1-bytes for a closer candidate.
  • The function doesn’t have the midsize/optsize attribute.

Cheers,
-Quentin

lib/Target/X86/X86OptimizeLEAs.cpp
42 ↗(On Diff #36112)

That’s a bit strange to have that private.

55 ↗(On Diff #36112)

Since you expect both instructions to be valid, put references.
This is also true for the rest of the class.

59 ↗(On Diff #36112)

Use SmallVectorImpl to not have to repeat the number of elements everywhere.

60 ↗(On Diff #36112)

Please explain all the fields in the comment.
Also, shouldn’t List be const here?

68 ↗(On Diff #36112)

What are the unsigned argument used for?
In other words, please document how they are used.

97 ↗(On Diff #36112)

use \p MI.

98 ↗(On Diff #36112)

*The* address… by *the* displacement…
I am not a native English speaker, but without the articles, I found the sentence strange.

100 ↗(On Diff #36112)

The register class of the definition of the LEA… is compatible?

104 ↗(On Diff #36112)

Why?
You didn’t define “best” for the LEA instruction, so I guess it may make sense, but I don’t see why now.

Also, after reading the code, looks like you bail on the first LEA that is past MI, so what is the strategy here?
I.e., what if the next LEA would have matched but not this one.

132 ↗(On Diff #36112)

You can also constrain the register class to the intersection of both classes.

172 ↗(On Diff #36112)

What are the advantages of the vector against a static const array?
Look into X86TransformInfo for instance.

175 ↗(On Diff #36112)

Even if both operand are identical, that does not mean they carry the same value.
This is true for SSA variable, but physical register are not SSA.
You must check that the operand is not a physical register.

220 ↗(On Diff #36112)

the number

aturetsk updated this revision to Diff 40929.Nov 23 2015, 7:28 AM

Fix most of the comments

Hi Quentin!

Thanks for the review.

My measurements show no significant compile-time impact of the pass in -Os mode.

There are plenty of things to improve in the patch. Here are my thoughts about it:

  1. I believe the most needed thing in the pass is a proper heuristic to decide which address calculations should be replaced and which should not. The current code size impact of the pass is -0.2% in -Os mode and -0.3% in -Oz mode (including part 2 of the patch) on Spec 2000. That's not much, I was hoping for more when I started the development. However I think a good heuristic can improve the results. Moreover, a heuristic can probably discover some performance opportunities of the pass. Last time I measured, I got about 0% geomean change in performance on Spec 2000, however the measurements showed that some tests have significant improvement and some have significant degradation from the pass. So if a heuristic could help to cut off some of bad cases, we would get a performance gain as well.
  1. MachineFunction scope. I implemented this as one of my experiments with the pass, but the changes didn't give any improvements (or even had a negative effect, I can't remember). I believe extended over-basic-block liveranges of LEA defs damage code. So to extend the scope we need to have some heuristic to understand when we want to use LEAs over basic blocks or not. I wasn't able to find one, Unfortunately I didn't save the patch, but extending the scope is an easy thing to implement - search for LEAs in basic blocks from DominatorTree in findLEAs and tweak calcInstrDist and chooseBestLEA to handle instructions from different basic blocks.
  1. Some time ago I discovered that at the moment when the LEA pass works some instructions calculating addresses are not yet LEAs (they will be converted later after RA, you can see convertToThreeAddress for details). I've created an experimental patch to handle these not-yet-LEAs as well, but it didn't give any improvements, so I dropped it. So this change requires some heuristic as well, otherwise we can't be sure whether we improve code or make it worse.
  1. The biggest part of code size improvement of the pass is from replacing %esp with other register in moves to/from stack (example: lea 256(%esp), %eax; mov $111, 260(%esp)). That's profitable because if a new displacement fits 1 byte (and the old one does not), we save few bytes from different encoding of the instruction. However at the moment when the LEA pass works we don't have the actual displacement, only frame index and some displacement relative to it, so we can't really judge whether we should replace %esp with another register or not. So another possible improvement is to keep %esp cases unchanged until after RA, and than make a decision to replace %esp according to actual displacements.

That's my vision of what could be done to the pass in the future. I'm most interested in having a deeper analysis to understand when the replacement of address calculation is profitable, but I have no ideas how to do that.
About tests, I will add some soon to increase the coverage.

lib/Target/X86/X86OptimizeLEAs.cpp
43 ↗(On Diff #40929)

Fixed.

56 ↗(On Diff #40929)

Fixed here and in the other places.

60 ↗(On Diff #40929)

Fixed.

61 ↗(On Diff #40929)

Extended the comment.
Put 'const' here and in other places where possible.

69 ↗(On Diff #40929)

Extended the comment.

98 ↗(On Diff #40929)

Should I put "\p" before every argument name mentioned in every comment?

99 ↗(On Diff #40929)

Fixed.

101 ↗(On Diff #40929)

Fixed.

105 ↗(On Diff #40929)

3) Displacement of the new memory operand should fit in 1 byte if possible.
4) The LEA should be as close to MI as possible, and prior to it if
// possible.

These two conditions define the best LEA. Choosing 1 byte displacement over the bigger one leads to saving few bytes through different instruction encoding. And we try to use the closest LEA to avoid significant increasing of LEA's def liverange.
And it just seems more right to me to give priority to the first LEA before MI than to the first LEA after MI regardless of the distance between LEAs and MI to avoid instruction moving.

133 ↗(On Diff #40929)

The case where this is needed is extremely rare. I used to have this in the initial version of the patch, but I wasn't able to write the test for it. So I just decided to drop it.

173 ↗(On Diff #40929)

Fixed.

176 ↗(On Diff #40929)

To avoid large 'if' expressions I created a separate function to perform the check: isIdenticalOp.

221 ↗(On Diff #40929)

Fixed.

aturetsk updated this revision to Diff 41310.Nov 27 2015, 8:02 AM

Added more tests.
"The size of the displacement exceeds 4-bytes." - this case is the only one remaining uncovered. I wasn't able to force the compiler to use such big address displacement (If address has a big displacement it's stored into a register and than used through it). Not sure if this is possible without changing compiler sources.

qcolombet added inline comments.Dec 1 2015, 11:52 AM
lib/Target/X86/X86OptimizeLEAs.cpp
98 ↗(On Diff #40929)

Those are only proceeded in the doxygen like comments (e.g., ///).
In these comments, I think we usually put \p before every argument name.

105 ↗(On Diff #40929)

Although I get what you are saying, it feels arbitrary to me to stop on the first LEA after the current instruction when this is the only candidate we found.
For now, this is fine, just make sure to add a FIXME.

133 ↗(On Diff #40929)

Maybe say that in the comment then.

aturetsk updated this revision to Diff 41762.Dec 3 2015, 8:58 AM

Fixed the patch according to the latest comments.

qcolombet accepted this revision.Dec 3 2015, 10:41 AM
qcolombet added a reviewer: qcolombet.

Hi Andrey,

LGTM modulo two things:

  • Merge the test cases in one file.
  • Run opt -instnamer on the IR to get rid of the %[0-9]+ variables.

Please commit with those changes.

Thanks,
-Quentin

This revision is now accepted and ready to land.Dec 3 2015, 10:41 AM
aturetsk updated this revision to Diff 41852.Dec 4 2015, 2:16 AM
aturetsk edited edge metadata.

Merged tests and ran 'opt -instnamer' on them.

This revision was automatically updated to reflect the committed changes.

Please see https://llvm.org/bugs/show_bug.cgi?id=25843: this pass can be
very slow, exponentially blowing the compile time for large functions.