This is an archive of the discontinued LLVM Phabricator instance.

[ARM] Optimize {s,u}{add,sub}.with.overflow.
ClosedPublic

Authored by jgalenson on Sep 28 2017, 12:39 PM.

Details

Summary

The ARM backend contains code that tries to optimize compares by replacing them with an existing instruction that sets the flags the same way. This allows it to replace a "cmp" with a "adds", generalizing the code that replaces "cmp" with "sub". It also heuristically disables sinking of instructions that could potentially be used to replace compares (currently only if they're next to each other).

Diff Detail

Event Timeline

jgalenson created this revision.Sep 28 2017, 12:39 PM

Hi Joel,

Is my impression correct that this should improve code size and potentially performance in general, not just for the overflow intrinsics?
If so, did you manage to collect any numbers of the impact of this patch on code size and/or performance on some benchmark?

One other high-level thought is that it isn't just sub and add instructions that can set flags and potentially remove the need for a compare instruction.
I'm afraid I don't have a feel for how often a compare can be eliminate by using subs/adds or other flag-setting instructions.
I wonder if the changes in this patch could be generalized to enable flag-setting instructions other than adds/subs to remove compare instructions too?
Not that I think that necessarily needs to be done in this patch - just wondering if this patch is an incremental step towards enabling even more removal of compare instructions.

Thanks!

Kristof

jgalenson updated this revision to Diff 117225.Sep 29 2017, 2:44 PM
jgalenson retitled this revision from Optimize {s|u}add.with.overflow on ARM. to Optimize {s,u}{add,sub}.with.overflow on ARM..
jgalenson added a reviewer: kristof.beyls.

Hi Kristof,

Yes, this in theory could improve performance and code size in general, not just with the overflow intrinsics. I hadn't thought about that, though, so I hadn't looked into it before. I just ran a couple smallish benchmarks and saw no change (I did, however, find a bug in my patch, so it was useful!), so I doubt it makes a large difference, but further testing would be good. Unfortunately, I'm busy with something else at the moment, and I won't have too much time on this for at least a few days.

As for generalizing this patch, I believe you're right, although I don't actually know ARM very well. But, for example, I would think we could use an ANDS to replace a CMP EQ. My guess is that these would be relatively uncommon, but I could certainly be wrong, so it probably would be useful to look into this more.

javed.absar added inline comments.Sep 30 2017, 12:31 AM
lib/Target/ARM/ARMBaseInstrInfo.cpp
2569

Did you mean t2ADDrr in second case? You have two conditions checking for ARM::ADDrr. Same wit ARM::ADDri

jgalenson added inline comments.Sep 30 2017, 9:06 AM
lib/Target/ARM/ARMBaseInstrInfo.cpp
2569

I'm not sure what you mean. This seems reasonable to me. I'm using the result and first operand of the add to match the two operands of the compare, so I don't care what the last operand of the add is. I thus don't care whether it's a register or an immediate, and it can also be either ARM or Thumb. So don't all four of these cases work?

jgalenson retitled this revision from Optimize {s,u}{add,sub}.with.overflow on ARM. to [ARM] Optimize {s,u}{add,sub}.with.overflow..Nov 9 2017, 9:31 AM
javed.absar added inline comments.Nov 9 2017, 12:39 PM
lib/Target/ARM/ARMBaseInstrInfo.cpp
2569

OK, my bad. I missed out the "t2" on my screen. All good.

rengolin edited edge metadata.Nov 10 2017, 10:25 PM

The change itself looks good, I only have one question to clarify a part of the code, but otherwise seems fine.

I agree with Kristof that this could span across more than just overflow, but it can be done in later patches.

The benefits are clear, even from the few tests that changed, but it would be good to have a simple test-suite run in benchmark mode, just to make sure we're not creating any new pathological case on random programs.

cheers,
--renato

lib/Target/ARM/ARMBaseInstrInfo.cpp
2732

Can you elaborate on this change?

The benefits are clear, even from the few tests that changed, but it would be good to have a simple test-suite run in benchmark mode, just to make sure we're not creating any new pathological case on random programs.

Are there instructions for how to run lnt remotely? I can run it locally, but since I'm on an x86 machine that won't help test this.

lib/Target/ARM/ARMBaseInstrInfo.cpp
2732

Sure.

In at least one of my testcases (the first one in su-addsub-overflow.ll), I have a basic block with the following:

%vreg0 = ADDrr %vreg2, %vreg3
CMPrr %vreg0, %vreg2

Here, the compare is CmpInstr and the ADDrr is MI. But MI gets set to nullptr a little above this because SrcReg2 != 0. Without this line here, MI and SubAdd would both be nullptr, and so we'd do nothing. But I want to allow what was MI to be SubAdd, since if it's a sub or add it can potentially replace the compare. E was initialized to be MI, and the loop just below looks for SubAdd by walking backwards up to E, so I decrement E to allow it to consider MI.

Does that clear it up? Should I add some of that to this comment?

jgalenson updated this revision to Diff 122742.Nov 13 2017, 3:55 PM

Just rebasing this on top of the latest changes.

rogfer01 added inline comments.Nov 14 2017, 12:08 AM
lib/Target/ARM/ARMBaseInstrInfo.cpp
2676–2680

Maybe I'm reading it wrong: should this comment say something like the other is a SUB or ADD instruction?

2744–2748

It is not clear to me why you had to swap this check with the one below? Is it related to the --E; in line 2731?

jgalenson added inline comments.Nov 14 2017, 9:35 AM
lib/Target/ARM/ARMBaseInstrInfo.cpp
2676–2680

No, I'd missed that. Thanks for pointing it out.

2744–2748

Not really. I believe the reason was that if the instruction I modifies CSPR, nothing before it can replace the compare, but it still can itself. Swapping these two checks allowed that case.

However, when I undo this part of the change locally, my tests still pass. I'm pretty sure I needed this before, but I guess something changed and I don't need it any longer. So I can revert this piece if you'd like, although I still think the new way is better in theory.

jgalenson updated this revision to Diff 122860.Nov 14 2017, 9:40 AM
jgalenson added a reviewer: rogfer01.

Update comment that I'd missed.

Are there any other comments?

rogfer01 added inline comments.Nov 29 2017, 10:14 AM
lib/Target/ARM/ARMBaseInstrInfo.cpp
2744–2748

Thanks for the clarification, I'd leave it as it was unless it is easy to add a test that proves that this is certainly better for some cases but maybe was some interaction with my change in D35192 (while it was in trunk).

2902

Sorry, I fail to see where are you using this new function.

jgalenson added inline comments.Nov 29 2017, 10:19 AM
lib/Target/ARM/ARMBaseInstrInfo.cpp
2744–2748

Sure, I'll revert this part of the change.

2902

I'm not calling it anywhere, but it's overriding a virtual method method in TargetInstrInfo.h that controls whether or not MachineSink should sink the given instruction.

I've now reverted the part of the change mentioned in the last couple comments.

rogfer01 edited edge metadata.Dec 7 2017, 1:00 AM

This looks sensible to me. Extend the comment in that --E; above at line 2732. If @efriedma or @rengolin do not have further comments I believe this is ready to land.

lib/Target/ARM/ARMBaseInstrInfo.cpp
2902

Ah! Right. I hadn't noticed the override keyword above. Sorry for the fuss.

test/CodeGen/ARM/su-addsub-overflow.ll
124

I understand this is a "sanity check"-like test and your change does not include new code to avoid this problem (all the machinery was already there). Is this correct?

Thanks for the comments. I'll update the patch in a minute.

test/CodeGen/ARM/su-addsub-overflow.ll
124

Yes, mostly. I added this test in my second version of this patch after seeing a bug in the first version. I had initially made an incorrect change, and I reverted that part of it and added this test. (You can diff my first two versions of this patch against each other to see the details.) So this patch doesn't contain new code to avoid this; it just makes sure I don't make that mistake again.

jgalenson updated this revision to Diff 125985.Dec 7 2017, 9:53 AM

Update the comment above the E--.

Now that D35192 has re-landed I need to update this patch.

Note that I needed to bring back the swapped conditions that I had before to get the tests to pass. It looks like that was in fact related to D35192.

Are there any more comments, or does this look good?

lib/Target/ARM/ARMBaseInstrInfo.cpp
2905

I initially decided to look only at the preceding instruction to avoid the potential quadratic behavior of scanning the whole block. I just now gathered some data by compiling a few random files with this and a version that walks as far back as it can to look for a compare. In a few cases that allows us to optimize slightly more compares, but not many, and in most cases there's no difference. Thus given the extra compile time overhead and the chance that it will prevent us from sinking instructions we want to sink, keeping this heuristic for now seems good to me. Does that seem reasonable?

Friendly new year's ping. Are there any other comments?

rogfer01 accepted this revision.Jan 12 2018, 3:50 AM

LGTM. Wait a couple of days before commiting just in case @efriedma has further comments.

Thanks!

This revision is now accepted and ready to land.Jan 12 2018, 3:50 AM
This revision was automatically updated to reflect the committed changes.
eastig added a subscriber: eastig.Jan 18 2018, 9:55 AM

Hi,

The patch caused regressions in LNT benchmarks on Cortex-A9:

SingleSource/Benchmarks/McGill/chomp: 12.79%
MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow: 9.00%

There are also ~10% regressions in our benchmarks on Cortex-M4 and Cortex-M33 due to the patch.

I'm analysing them.

Thanks,
Evgeny Astigeevich

In SingleSource/Benchmarks/McGill/chomp the patch causes generation of subs+cmp_with_0 instead of only a subs.

Thanks for the report. I think I can reproduce it on McGill. I'm looking into it now.

Thank you, Joel. The function to look at is make_list.

The example I found is in equal_data, which I assume is being inlined in make_list.

I have a (one-line) fix for that, and I'm trying to run the test suite to see what else it improves. However, I'm not seeing any changes in fourinarow, and I don't immediately see any obvious bad code in it. Do you happen to know where the regression in it was?

The example I found is in equal_data, which I assume is being inlined in make_list.

I have a (one-line) fix for that, and I'm trying to run the test suite to see what else it improves. However, I'm not seeing any changes in fourinarow, and I don't immediately see any obvious bad code in it. Do you happen to know where the regression in it was?

In function 'value,' blocks where:

Before

it eq
orrseq.w r1, r3, r4
bne.n b936 <value+0x1c66>

After

itt eq
orreq.w r1, r3, r4
cmpeq r1, #0
bne.n b946 <value+0x1c76>

Before

ittt eq
ldreq.w r0, [sp, #1712] ; 0x6b0
ldreq.w r1, [sp, #1708] ; 0x6ac
orrseq.w r1, r1, r0
bne.n c7fa <value+0x2b2a>

After

ittt eq
ldreq.w r0, [sp, #1700] ; 0x6a4
orreq r0, r6
cmpeq r0, #0
bne.n c810 <value+0x2b40>

What command-line arguments are you using?

What command-line arguments are you using?

-O3 -DNDEBUG -mcpu=cortex-a9 -mthumb -fomit-frame-pointer

What command-line arguments are you using?

-O3 -DNDEBUG -mcpu=cortex-a9 -mthumb -fomit-frame-pointer

Full command:

/home/llvm-test/aarch32-rL322735-rC322729-rX322703/bin/clang  -DNDEBUG -I/work/llvm-test-suite/MultiSource/Benchmarks/FreeBench/fourinarow -I/home/llvm-test/SANDBOX/test-2018-01-18_21-02-41/MultiSource/Benchmarks/FreeBench/fourinarow  -O3 -DNDEBUG -mcpu=cortex-a9 -fomit-frame-pointer -mthumb   -O3 -DNDEBUG   -w -Werror=date-time -DVERSION=\"1.00\" -DCOMPDATE=\"today\" -DCFLAGS=\"\" -DHOSTNAME=\"thishost\" -o CMakeFiles/fourinarow.dir/fourinarow.c.o   -c /work/llvm-test-suite/MultiSource/Benchmarks/FreeBench/fourinarow/fourinarow.c

Thanks!

I believe D42263 should fix that as well.