The ARM backend contains code that tries to optimize compares by replacing them with an existing instruction that sets the flags the same way. This allows it to replace a "cmp" with a "adds", generalizing the code that replaces "cmp" with "sub". It also heuristically disables sinking of instructions that could potentially be used to replace compares (currently only if they're next to each other).
Details
Diff Detail
Event Timeline
Hi Joel,
Is my impression correct that this should improve code size and potentially performance in general, not just for the overflow intrinsics?
If so, did you manage to collect any numbers of the impact of this patch on code size and/or performance on some benchmark?
One other high-level thought is that it isn't just sub and add instructions that can set flags and potentially remove the need for a compare instruction.
I'm afraid I don't have a feel for how often a compare can be eliminate by using subs/adds or other flag-setting instructions.
I wonder if the changes in this patch could be generalized to enable flag-setting instructions other than adds/subs to remove compare instructions too?
Not that I think that necessarily needs to be done in this patch - just wondering if this patch is an incremental step towards enabling even more removal of compare instructions.
Thanks!
Kristof
Hi Kristof,
Yes, this in theory could improve performance and code size in general, not just with the overflow intrinsics. I hadn't thought about that, though, so I hadn't looked into it before. I just ran a couple smallish benchmarks and saw no change (I did, however, find a bug in my patch, so it was useful!), so I doubt it makes a large difference, but further testing would be good. Unfortunately, I'm busy with something else at the moment, and I won't have too much time on this for at least a few days.
As for generalizing this patch, I believe you're right, although I don't actually know ARM very well. But, for example, I would think we could use an ANDS to replace a CMP EQ. My guess is that these would be relatively uncommon, but I could certainly be wrong, so it probably would be useful to look into this more.
lib/Target/ARM/ARMBaseInstrInfo.cpp | ||
---|---|---|
2569 | Did you mean t2ADDrr in second case? You have two conditions checking for ARM::ADDrr. Same wit ARM::ADDri |
lib/Target/ARM/ARMBaseInstrInfo.cpp | ||
---|---|---|
2569 | I'm not sure what you mean. This seems reasonable to me. I'm using the result and first operand of the add to match the two operands of the compare, so I don't care what the last operand of the add is. I thus don't care whether it's a register or an immediate, and it can also be either ARM or Thumb. So don't all four of these cases work? |
lib/Target/ARM/ARMBaseInstrInfo.cpp | ||
---|---|---|
2569 | OK, my bad. I missed out the "t2" on my screen. All good. |
The change itself looks good, I only have one question to clarify a part of the code, but otherwise seems fine.
I agree with Kristof that this could span across more than just overflow, but it can be done in later patches.
The benefits are clear, even from the few tests that changed, but it would be good to have a simple test-suite run in benchmark mode, just to make sure we're not creating any new pathological case on random programs.
cheers,
--renato
lib/Target/ARM/ARMBaseInstrInfo.cpp | ||
---|---|---|
2732 | Can you elaborate on this change? |
The benefits are clear, even from the few tests that changed, but it would be good to have a simple test-suite run in benchmark mode, just to make sure we're not creating any new pathological case on random programs.
Are there instructions for how to run lnt remotely? I can run it locally, but since I'm on an x86 machine that won't help test this.
lib/Target/ARM/ARMBaseInstrInfo.cpp | ||
---|---|---|
2732 | Sure. In at least one of my testcases (the first one in su-addsub-overflow.ll), I have a basic block with the following: %vreg0 = ADDrr %vreg2, %vreg3 Here, the compare is CmpInstr and the ADDrr is MI. But MI gets set to nullptr a little above this because SrcReg2 != 0. Without this line here, MI and SubAdd would both be nullptr, and so we'd do nothing. But I want to allow what was MI to be SubAdd, since if it's a sub or add it can potentially replace the compare. E was initialized to be MI, and the loop just below looks for SubAdd by walking backwards up to E, so I decrement E to allow it to consider MI. Does that clear it up? Should I add some of that to this comment? |
lib/Target/ARM/ARMBaseInstrInfo.cpp | ||
---|---|---|
2676–2680 | No, I'd missed that. Thanks for pointing it out. | |
2744–2748 | Not really. I believe the reason was that if the instruction I modifies CSPR, nothing before it can replace the compare, but it still can itself. Swapping these two checks allowed that case. However, when I undo this part of the change locally, my tests still pass. I'm pretty sure I needed this before, but I guess something changed and I don't need it any longer. So I can revert this piece if you'd like, although I still think the new way is better in theory. |
lib/Target/ARM/ARMBaseInstrInfo.cpp | ||
---|---|---|
2744–2748 | Thanks for the clarification, I'd leave it as it was unless it is easy to add a test that proves that this is certainly better for some cases but maybe was some interaction with my change in D35192 (while it was in trunk). | |
2902 | Sorry, I fail to see where are you using this new function. |
lib/Target/ARM/ARMBaseInstrInfo.cpp | ||
---|---|---|
2902 | Ah! Right. I hadn't noticed the override keyword above. Sorry for the fuss. | |
test/CodeGen/ARM/su-addsub-overflow.ll | ||
124 | I understand this is a "sanity check"-like test and your change does not include new code to avoid this problem (all the machinery was already there). Is this correct? |
Thanks for the comments. I'll update the patch in a minute.
test/CodeGen/ARM/su-addsub-overflow.ll | ||
---|---|---|
124 | Yes, mostly. I added this test in my second version of this patch after seeing a bug in the first version. I had initially made an incorrect change, and I reverted that part of it and added this test. (You can diff my first two versions of this patch against each other to see the details.) So this patch doesn't contain new code to avoid this; it just makes sure I don't make that mistake again. |
Are there any more comments, or does this look good?
lib/Target/ARM/ARMBaseInstrInfo.cpp | ||
---|---|---|
2905 | I initially decided to look only at the preceding instruction to avoid the potential quadratic behavior of scanning the whole block. I just now gathered some data by compiling a few random files with this and a version that walks as far back as it can to look for a compare. In a few cases that allows us to optimize slightly more compares, but not many, and in most cases there's no difference. Thus given the extra compile time overhead and the chance that it will prevent us from sinking instructions we want to sink, keeping this heuristic for now seems good to me. Does that seem reasonable? |
LGTM. Wait a couple of days before commiting just in case @efriedma has further comments.
Thanks!
Hi,
The patch caused regressions in LNT benchmarks on Cortex-A9:
SingleSource/Benchmarks/McGill/chomp: 12.79%
MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow: 9.00%
There are also ~10% regressions in our benchmarks on Cortex-M4 and Cortex-M33 due to the patch.
I'm analysing them.
Thanks,
Evgeny Astigeevich
In SingleSource/Benchmarks/McGill/chomp the patch causes generation of subs+cmp_with_0 instead of only a subs.
Thanks for the report. I think I can reproduce it on McGill. I'm looking into it now.
The example I found is in equal_data, which I assume is being inlined in make_list.
I have a (one-line) fix for that, and I'm trying to run the test suite to see what else it improves. However, I'm not seeing any changes in fourinarow, and I don't immediately see any obvious bad code in it. Do you happen to know where the regression in it was?
In function 'value,' blocks where:
Before
it eq orrseq.w r1, r3, r4 bne.n b936 <value+0x1c66>
After
itt eq orreq.w r1, r3, r4 cmpeq r1, #0 bne.n b946 <value+0x1c76>
Before
ittt eq ldreq.w r0, [sp, #1712] ; 0x6b0 ldreq.w r1, [sp, #1708] ; 0x6ac orrseq.w r1, r1, r0 bne.n c7fa <value+0x2b2a>
After
ittt eq ldreq.w r0, [sp, #1700] ; 0x6a4 orreq r0, r6 cmpeq r0, #0 bne.n c810 <value+0x2b40>
Full command:
/home/llvm-test/aarch32-rL322735-rC322729-rX322703/bin/clang -DNDEBUG -I/work/llvm-test-suite/MultiSource/Benchmarks/FreeBench/fourinarow -I/home/llvm-test/SANDBOX/test-2018-01-18_21-02-41/MultiSource/Benchmarks/FreeBench/fourinarow -O3 -DNDEBUG -mcpu=cortex-a9 -fomit-frame-pointer -mthumb -O3 -DNDEBUG -w -Werror=date-time -DVERSION=\"1.00\" -DCOMPDATE=\"today\" -DCFLAGS=\"\" -DHOSTNAME=\"thishost\" -o CMakeFiles/fourinarow.dir/fourinarow.c.o -c /work/llvm-test-suite/MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.c
Did you mean t2ADDrr in second case? You have two conditions checking for ARM::ADDrr. Same wit ARM::ADDri