Add new x86 pass which replaces address calculations in load or store instructions with def register of existing LEA (must be in the same basic block), if the LEA calculates address that differs only by a displacement. Works only with -Os or -Oz.
Details
Diff Detail
Event Timeline
Hi Andrey,
Thanks for working on this.
What is the compile time impact of that pass?
Right now, the scope of the pass is pretty narrow. For instance, it is basic block scope only whereas it should be MachineFunction scope.
Anyhow, I am fine with it if we have a plan to improve it, therefore the question:
What is the long term plan for this pass?
Finally, the added test case seems too small to cover all the code added. I may of course be wrong, it seems we are missing some case. In particular, please make sure to cover the cases where:
- The size of the displacement exceeds 4-bytes.
- The matching LEA is after the definition.
- The size of the displacement exceeds 1-bytes for a closer candidate.
- The function doesn’t have the midsize/optsize attribute.
Cheers,
-Quentin
lib/Target/X86/X86OptimizeLEAs.cpp | ||
---|---|---|
42 | That’s a bit strange to have that private. | |
55 | Since you expect both instructions to be valid, put references. | |
59 | Use SmallVectorImpl to not have to repeat the number of elements everywhere. | |
60 | Please explain all the fields in the comment. | |
68 | What are the unsigned argument used for? | |
97 | use \p MI. | |
98 | *The* address… by *the* displacement… | |
100 | The register class of the definition of the LEA… is compatible? | |
104 | Why? Also, after reading the code, looks like you bail on the first LEA that is past MI, so what is the strategy here? | |
132 | You can also constrain the register class to the intersection of both classes. | |
172 | What are the advantages of the vector against a static const array? | |
175 | Even if both operand are identical, that does not mean they carry the same value. | |
220 | the number |
Hi Quentin!
Thanks for the review.
My measurements show no significant compile-time impact of the pass in -Os mode.
There are plenty of things to improve in the patch. Here are my thoughts about it:
- I believe the most needed thing in the pass is a proper heuristic to decide which address calculations should be replaced and which should not. The current code size impact of the pass is -0.2% in -Os mode and -0.3% in -Oz mode (including part 2 of the patch) on Spec 2000. That's not much, I was hoping for more when I started the development. However I think a good heuristic can improve the results. Moreover, a heuristic can probably discover some performance opportunities of the pass. Last time I measured, I got about 0% geomean change in performance on Spec 2000, however the measurements showed that some tests have significant improvement and some have significant degradation from the pass. So if a heuristic could help to cut off some of bad cases, we would get a performance gain as well.
- MachineFunction scope. I implemented this as one of my experiments with the pass, but the changes didn't give any improvements (or even had a negative effect, I can't remember). I believe extended over-basic-block liveranges of LEA defs damage code. So to extend the scope we need to have some heuristic to understand when we want to use LEAs over basic blocks or not. I wasn't able to find one, Unfortunately I didn't save the patch, but extending the scope is an easy thing to implement - search for LEAs in basic blocks from DominatorTree in findLEAs and tweak calcInstrDist and chooseBestLEA to handle instructions from different basic blocks.
- Some time ago I discovered that at the moment when the LEA pass works some instructions calculating addresses are not yet LEAs (they will be converted later after RA, you can see convertToThreeAddress for details). I've created an experimental patch to handle these not-yet-LEAs as well, but it didn't give any improvements, so I dropped it. So this change requires some heuristic as well, otherwise we can't be sure whether we improve code or make it worse.
- The biggest part of code size improvement of the pass is from replacing %esp with other register in moves to/from stack (example: lea 256(%esp), %eax; mov $111, 260(%esp)). That's profitable because if a new displacement fits 1 byte (and the old one does not), we save few bytes from different encoding of the instruction. However at the moment when the LEA pass works we don't have the actual displacement, only frame index and some displacement relative to it, so we can't really judge whether we should replace %esp with another register or not. So another possible improvement is to keep %esp cases unchanged until after RA, and than make a decision to replace %esp according to actual displacements.
That's my vision of what could be done to the pass in the future. I'm most interested in having a deeper analysis to understand when the replacement of address calculation is profitable, but I have no ideas how to do that.
About tests, I will add some soon to increase the coverage.
lib/Target/X86/X86OptimizeLEAs.cpp | ||
---|---|---|
44 | Fixed. | |
57 | Fixed here and in the other places. | |
61 | Fixed. | |
62 | Extended the comment. | |
70 | Extended the comment. | |
99 | Should I put "\p" before every argument name mentioned in every comment? | |
100 | Fixed. | |
102 | Fixed. | |
106 |
These two conditions define the best LEA. Choosing 1 byte displacement over the bigger one leads to saving few bytes through different instruction encoding. And we try to use the closest LEA to avoid significant increasing of LEA's def liverange. | |
134 | The case where this is needed is extremely rare. I used to have this in the initial version of the patch, but I wasn't able to write the test for it. So I just decided to drop it. | |
174 | Fixed. | |
177 | To avoid large 'if' expressions I created a separate function to perform the check: isIdenticalOp. | |
222 | Fixed. |
Added more tests.
"The size of the displacement exceeds 4-bytes." - this case is the only one remaining uncovered. I wasn't able to force the compiler to use such big address displacement (If address has a big displacement it's stored into a register and than used through it). Not sure if this is possible without changing compiler sources.
lib/Target/X86/X86OptimizeLEAs.cpp | ||
---|---|---|
99 | Those are only proceeded in the doxygen like comments (e.g., ///). | |
106 | Although I get what you are saying, it feels arbitrary to me to stop on the first LEA after the current instruction when this is the only candidate we found. | |
134 | Maybe say that in the comment then. |
Hi Andrey,
LGTM modulo two things:
- Merge the test cases in one file.
- Run opt -instnamer on the IR to get rid of the %[0-9]+ variables.
Please commit with those changes.
Thanks,
-Quentin
Please see https://llvm.org/bugs/show_bug.cgi?id=25843: this pass can be
very slow, exponentially blowing the compile time for large functions.
That’s a bit strange to have that private.