This implements a more optimal algorithm for selecting a base constant in constant hoisting. It not only takes into account the number of uses of constants, but now also the resulting integer range of the offsets. Thus, the algorithm maximizes the number of uses within an integer range that will enable more efficient code generation. On ARM, for example, this will enable code size optimisations because less negative offsets will be created. Negative offsets/immediates are not supported by Thumb1 thus preventing more compact instruction encoding.
Details
Diff Detail
Event Timeline
Currently, this change impacts all targets. For those targets that don't implement the isImmediateInRangeForLoad API, do we expect this change to be an improvement in general? If not, we might consider having isImmediateInRangeForLoad return an optional bool, so only those targets that implement isImmediateInRangeForLoad are impacted.
lib/Transforms/Scalar/ConstantHoisting.cpp | ||
---|---|---|
296 | differend -> different | |
300 | Can this just be a static helper function? | |
302 | Does this need to be initialized to 'None'? | |
308 | No need for the extra brackets. | |
359 | No need for the extra brackets. | |
363 | No need for the extra brackets. | |
411 | I'd prefer the original 'ConstCand' over just 'i'. | |
559 | No need for the extra brackets. |
Looks like a nice improvement to ConstantHoisting, but I am a little worried about the limited scope and implementation for load optimization only.
Constants are not only used by load/store instructions, so using "isImmediateInRangeForLoad" is very misleading. Also it might negatively impact decisions we used to make for other instructions.
Did you run the test suite to measure the performance and compile time impact for X86 and ARM/AArch64?
include/llvm/Analysis/TargetTransformInfo.h | ||
---|---|---|
405 | Please add a comment describing the new TTI method. | |
lib/Transforms/Scalar/ConstantHoisting.cpp | ||
79 | CandidatesHaveUses -> candidatesHaveUses | |
80 | calcOffsetDiff -> calculateOffsetDiff | |
357 | What about the case when the constant is not used by a load? | |
test/Transforms/ConstantHoisting/X86/phi.ll | ||
23 ↗ | (On Diff #60170) | OLDHECK??? |
Thanks for reviewing!
Yes, this impacts all targets. For targets that don't implement the isImmediateInRangeForLoad API, the default value "true" is returned. This is of course to make sure that we don't exclude any constants that were considered before this change. But yes, this might cause codegen differences. Examples are the 2 regression tests that I had to change (masks.ll and phi.ll). There can be multiple solutions with the same "gain", and simply the first solution is picked. This has the side effect that offsets will be positive, which happens to be good, at least on ARM. I don't see if that would negatively impact other targets (because I don't know them well enough). And yes, I was struggling with the name "isImmediateInRangeForLoad". Initially I just had "isImmediateInRange", but then thought it might be too vague and changed it. But it might actually be better describing its usage, because it can be in range of anything, loads/stores etc.
I have been primarily focusing on correctness (obviously) and code size. This patch shows significant code size reductions on our motivating examples. The new regression tests const-addr-no-neg-offset.ll is a representative, minimal code example of that; significantly more 16-bit load/stores will be generated.
Tomorrow I will try to get performance numbers on the table for ARM and X86 and thus see if there is a performance penalty; I don't know the other architectures well enough to make a statement about this.
I am not worried about compile times. In my first straightforward O(N^2) prototype implementation this finishes in virtually no time for lists with hundreds of constants. With thousands it really started to take some time for number crunching. This more efficient implementation had no problems at all (but I don't have hard numbers for the test suite, will also do that tomorrow).
I haven't run actual numbers, but I did diff a few Spec2006 binaries with this patch applied. In general, I see more instructions on AArch64 (a target that doesn't implement the isImmediateInRangeForLoad hook).
For example, here's the diff for gcc:
Opcode static count diff summary:
-304 movk w # -258 sub w w # -37 mov w # -27 ldr x [x #] -27 add w w w -18 mov x x -16 sub w w # lsl # -9 b -6 ldp x x [x #] # -6 stp x x [x #]! -3 fadd s s s -2 sub x x # lsl # -2 str s [x #] -2 ldr s [x #] -1 fsub s s s -1 fmov s w -1 ldp s s [x #] -1 fmul s s s -1 stp s s [x #] -1 sub x x # -1 add x x x -1 scvtf s x 1 orr w w # 1 cbz/nz w 1 ldrb w [x #] 2 and w w # 3 orr w w w 3 cbz/nz x 4 str x [x #] 5 str w [x #] 5 ldr x [x #] # 5 str x [x #]! 6 add x x # lsl # 7 ldr w [x #] 9 bl 19 ldp x x [x #] 20 mov w w 22 stp x x [x #] 43 add x x # 54 add w w # lsl # 57 adrp x 783 add w w # ------------------------- 1050 added (excluding nops) 725 removed (excluding nops) 325 net (excluding nops) 1050 added 725 removed 325 net
In short, an additional 325 static instructions are introduced.
Here's the diff for Sphinx:
Opcode static count diff summary:
-22 mov x x -16 adrp x -5 mov w # -3 sub x x # -2 stp x x [x #] -1 ldrb w [x #]! -1 bl -1 orr w w # -1 strb w [x #]! -1 ldp x x [x #] -1 ldr w [x #]! -1 movk w # 1 ldrb w [x #] 1 ldr w [x #] 3 str x [x #] 4 b 7 add x x # lsl # 13 add x x # 31 ldr x [x #] ------------------------- 60 added (excluding nops) 55 removed (excluding nops) 5 net (excluding nops) 60 added 55 removed 5 net
I see many more loads, but I haven't investigated further. Interestingly enough for sjeng I see an opposite trend (more adds and fewer loads).
I'm going to see if I can implement the isImmediateInRangeForLoad for AArch64.
lib/Transforms/Scalar/ConstantHoisting.cpp | ||
---|---|---|
288 | Rather than wrap all these dbgs() print statements with the DEBUG macro, why not just wrap the call to printOffsetRange() with the DEBUG macro. IMO, this makes it more clear that this is just a debug dump in the context of the caller. |
I have some first performance results for 2 configurations: "cortex-a53, aarch32, –mthumb" and "cortex-a53, aarch64" and the results are fairly neutral. So some regression, some improvements; they really cancel each other out. I want to run a few more benchmarks, and also run the test suite on X86.
Next is also to get more numbers on code size.
Mhh. The fact that AArch64 doesn't even implement the TTI hook and shows different behavior is not great - especially since it seems to have increased code size.
In its current form this is still a very targeted optimization for Thumb1 load/store. Wouldn't an ARM specific MachineFunction pass be better for this?
Yes, I agree that in the results are somewhat disappointing; it looks like that it is beneficial for thumb but not for the other targets, at least not in it's current form. That suggests it should be an ARM/thumb specific optimisations; or the current implementation needs improvement. I will be looking further into this to see what's best/feasible.
Resigning to get this off my queue. Please add me back if you decide to proceed with this or a similar change.
Sorry for the delay (this work was interrupted by my holiday and some other work).
Compared to the previous patch, this are the changes:
- It still affects all targets, but the code triggers only at –Os
- We have traded readability/maintainability of the algorithm over efficiency. So it is now less efficient, but a simple check that sees if the list contains more than 100 items in the worklist makes the code fall back to the old algorithm. The algorithm is also slightly different as it sums up the costs of all uses and then subtracts values for offsets that are out of range; before we were not really tracking the costs of all uses.
- For X86 I have checked that the behaviour is the same as the old algorithm for the LNT suite.
- I don’t see any regressions for aarch32 thumb mode on LNT, coremark, dhrystone, eembc, but no improvements either. Our motivating example still sees the significant code size reduction though.
- I still do need an extra callback to get the costs for offsets. The reasons is that the existing getIntImmCosts functions checks which subtarget is being targeted. This doesn’t help because we want to know the costs of an immediate in a different ISA (i.e. thumb1).
include/llvm/Analysis/TargetTransformInfo.h | ||
---|---|---|
413 | Having this hook be context-free will hinder targets implementing it usefully. It should take the same arguments as getIntImmCost(), so that targets can give a contextual answer if they wish. | |
lib/Transforms/Scalar/ConstantHoisting.cpp | ||
288 | "different" | |
341 | You can use a range-for loop here. | |
342 | This should be hoisted out of the inner loop as it is it invariant. | |
347 | The formatting looks off here - have you used clang-format? |
This is looking fine to me now, but please wait for Mehdi or someone else to also look it over.
include/llvm/Analysis/TargetTransformInfoImpl.h | ||
---|---|---|
256 | The formatting here isn't LLVM style. The return sholud be on a new line, then the } on the next line. |
A few very superficial comments.
lib/Transforms/Scalar/ConstantHoisting.cpp | ||
---|---|---|
328 | Thanks very much for well documenting this! | |
384 | Any reason to add braces here? | |
386 | All this code could be sunk in maximizeConstantsInRange(). |
Please add a comment describing the new TTI method.