This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Combine ADD/SUB instructions when they contain a 24-bit immediate.
AbandonedPublic

Authored by red1bluelost on Dec 31 2021, 5:58 PM.

Details

Summary

This patch combines improves add/sub instructions that have 24-bit
immediates by turning the MOV-MOV-ADD/SUB into ADDI/SUBI-ADDI/SUBI using the
high and low 12-bit portions of the immediate.

For example, the following code:

int addi(int A) { return A + 0x111333; }

results in the assembly:

addi:                        // Without combine
        mov w8, #4915
        mov w8, #17, lsl #16
        add w0, w0, w8
        ret
addi:                        // With combine
        add w8, w0, #273, lsl #12
        add w0, w8, #819
        ret

This was implemented by adding patterns to MachineCombinerPattern and
handling the patterns in AArch64InstrInfo::genAlternativeCodeSequence and
AArch64InstrInfo::getMachineCombinerPatterns. The patterns match for scenarios
where the moved-immediate is in operand 1 or 2 of the ADD/SUB, the immediate can
be negated to produce a 24-bit immediate which will change the ADD to SUB and
SUB to ADD, and where a SUBREG_TO_REG is used to promote the i32 register to
a i64 register.

I originally implemented this combine through a TableGen Pat however this
caused some of the MADD combines to fail. With the ADD/SUB combine residing
in the MachineCombiner, MADD combines can be prioritizes when both patterns
exist.

If this design is accepted, ADDS/SUBS patterns could be added in another patch.

Testing:

Each MachineCombinerPattern is tested aarch64-combine-addsub-24bit-imm.mir,
a new file.

The addsub.ll test file has the typical scenarios in LLVM IR.

Other AArch64 test files had to be updated since this new combine was encountered
in those tests.

I ran ninja check-all on the code.

Diff Detail

Unit TestsFailed

Event Timeline

red1bluelost created this revision.Dec 31 2021, 5:58 PM
red1bluelost requested review of this revision.Dec 31 2021, 5:58 PM
Herald added a project: Restricted Project. · View Herald TranscriptDec 31 2021, 5:58 PM
red1bluelost edited the summary of this revision. (Show Details)Dec 31 2021, 6:01 PM

Oh interesting. This is similar to D111034, but that was reverted again. I'm not sure why, apparently MIPeephole optimizations are all too easy to get wrong.

The problem with doing this is when it is done in a loop. Something like this example: https://godbolt.org/z/e5f4hWGcq, where preferably the loop invariant mov can be hoisted out of the loop, leaving a single add. Can we make sure there is a test case for that example, and try and guard against it?

Also, I think a ADD+MOVi16 might be slightly better than a ADD+ADD. (As in - the MOV is a 16bit imm that can be materialized with a single instruction). The two ADDs might make a longer critical path. It's if the MOV pseudo would need multiple instructions that the add becomes beneficial.

Thanks Dave! I didn't know about all that.

I'll give those a look, add some test cases, and update this patch if it can work out.

Addresses feedback for patch.

A test file, aarch64-combine-addsub-imm-reject-loop.mir, was added to check
if the ADD/SUB combine would affect loop invariants hoisting, showing a regression.
To fix the regression, a CombinerObjective was added for these
MachineCombinerPatterns that they must not exist in a loop. During machine
combining, these patterns will check if the Root is inside a loop, skipping the
combine if that condition holds.

To address the priority to keep 16-bit immediates as MOV-ADD, a few tests were
added to aarch64-combine-addsub-24bit-imm.mir and code modifications. Now, 24-bit
immediates must use bits 23-16 so that it guarentees the combine only reduces
MOV-MOV-ADD to ADDI-ADDI.

Forgot one of the files in the update.

Sorry, I'm getting used to patches and arcanist.

dmgreen added inline comments.
llvm/lib/Target/AArch64/AArch64InstrInfo.cpp
4795

Can we generalize this to "Any MOV that will be > 1 instruction"? It may be possible to use expandMOVImm for that, and check the number of instructions.

The rules for what makes a single instruction are difficult to represent simply, and can be more than a 16bit imm. As in https://godbolt.org/z/KjKhMjb7v.

4838

Should this be a uint64_t?

5559

LLVM tends not to go too heavy on the doxygen comments, especially for obvious parameters like MF/MRI/TII.

llvm/test/CodeGen/AArch64/aarch64-combine-addsub-24bit-imm.mir
3

I don't think this needs -O0

344

I think this might be negative of the correct result.

(But equally I don't think that SUBXrr will be present at this point in the pipeline. It will be speculatively emitted as a SUBSXrr as that may be used as a compare).

llvm/test/CodeGen/AArch64/aarch64-combine-addsub-imm-reject-loop.mir
2 ↗(On Diff #397195)

This one might be better as a .ll file. That way we test the entire backend, and can see obvious regressions when the instruction count increases.

benshi001 added a comment.EditedJan 6 2022, 12:39 AM

Oh interesting. This is similar to D111034, but that was reverted again. I'm not sure why, apparently MIPeephole optimizations are all too easy to get wrong.

The problem with doing this is when it is done in a loop. Something like this example: https://godbolt.org/z/e5f4hWGcq, where preferably the loop invariant mov can be hoisted out of the loop, leaving a single add. Can we make sure there is a test case for that example, and try and guard against it?

Also, I think a ADD+MOVi16 might be slightly better than a ADD+ADD. (As in - the MOV is a 16bit imm that can be materialized with a single instruction). The two ADDs might make a longer critical path. It's if the MOV pseudo would need multiple instructions that the add becomes beneficial.

I am quite sorry for that. Though make check-all passed and everything was good on amd64, the clang built into aarch64 ELF and running on aarch64-linux crashed. And I have no aarch64 host machine to debug that (although I guess it should be a minor bug).

I am quite sorry for that. Though make check-all passed and everything was good on amd64, the clang built into aarch64 ELF and running on aarch64-linux crashed. And I have no aarch64 host machine to debug that (although I guess it should be a minor bug).

Oh I see. Sorry for not seeing that message. I hadn't realized it was reverted.

I have no strong opinion whether this is best done as an AArch64MIPeephole or in MachineCombine. They are probably both fine for this kind of thing. Whichever @red1bluelost thinks will be easiest to extend to SUBS instructions that write nzcv sounds OK to me, if the goal here is to address https://github.com/llvm/llvm-project/issues/51482.

I am quite sorry for that. Though make check-all passed and everything was good on amd64, the clang built into aarch64 ELF and running on aarch64-linux crashed. And I have no aarch64 host machine to debug that (although I guess it should be a minor bug).

Oh I see. Sorry for not seeing that message. I hadn't realized it was reverted.

I have no strong opinion whether this is best done as an AArch64MIPeephole or in MachineCombine. They are probably both fine for this kind of thing. Whichever @red1bluelost thinks will be easiest to extend to SUBS instructions that write nzcv sounds OK to me, if the goal here is to address https://github.com/llvm/llvm-project/issues/51482.

Either method (Peephole or MachineCombine) would be easy to extend for the SUBS instructions. I think the Peephole approach would be better, it seems more encapsulated than my approach.

What is the status of D111034 @benshi001, is it likely to be committed or is that bug still holding it up? The buildbot logs don't appear anymore so I can't see where that bug happened. Could that bug also be present with my MachineCombine patch?

I am quite sorry for that. Though make check-all passed and everything was good on amd64, the clang built into aarch64 ELF and running on aarch64-linux crashed. And I have no aarch64 host machine to debug that (although I guess it should be a minor bug).

Oh I see. Sorry for not seeing that message. I hadn't realized it was reverted.

I have no strong opinion whether this is best done as an AArch64MIPeephole or in MachineCombine. They are probably both fine for this kind of thing. Whichever @red1bluelost thinks will be easiest to extend to SUBS instructions that write nzcv sounds OK to me, if the goal here is to address https://github.com/llvm/llvm-project/issues/51482.

Either method (Peephole or MachineCombine) would be easy to extend for the SUBS instructions. I think the Peephole approach would be better, it seems more encapsulated than my approach.

What is the status of D111034 @benshi001, is it likely to be committed or is that bug still holding it up? The buildbot logs don't appear anymore so I can't see where that bug happened. Could that bug also be present with my MachineCombine patch?

@red1bluelost, you are appreciated to continue this optimization along with the peephole approach, if you are sure it no longer crashes on aarch64-linux host. As I have explained, I have no aarch64-linux host and can not gurantee that. Thanks! I am glad if anybody can add this optmization to the main branch ASAP.

I am quite sorry for that. Though make check-all passed and everything was good on amd64, the clang built into aarch64 ELF and running on aarch64-linux crashed. And I have no aarch64 host machine to debug that (although I guess it should be a minor bug).

Oh I see. Sorry for not seeing that message. I hadn't realized it was reverted.

I have no strong opinion whether this is best done as an AArch64MIPeephole or in MachineCombine. They are probably both fine for this kind of thing. Whichever @red1bluelost thinks will be easiest to extend to SUBS instructions that write nzcv sounds OK to me, if the goal here is to address https://github.com/llvm/llvm-project/issues/51482.

Either method (Peephole or MachineCombine) would be easy to extend for the SUBS instructions. I think the Peephole approach would be better, it seems more encapsulated than my approach.

What is the status of D111034 @benshi001, is it likely to be committed or is that bug still holding it up? The buildbot logs don't appear anymore so I can't see where that bug happened. Could that bug also be present with my MachineCombine patch?

@red1bluelost, you are appreciated to continue this optimization along with the peephole approach, if you are sure it no longer crashes on aarch64-linux host. As I have explained, I have no aarch64-linux host and can not gurantee that. Thanks! I am glad if anybody can add this optmization to the main branch ASAP.

Ok. I will try to see if I can find the bug with the Peephole if possible. I have been testing it this weekend through the Linaro Docker container mentioned in D111034 but using QEMU emulation of AArch64. It seems like it can replicate the aarch64-linux host environment but is just really slow. (I don't have an aarch64 host either).

I'll also see about finishing up the MachineCombine approach and testing it too, incase it would be easier to just finish this approach.

If I manage to fix the Peephole approach, should I edit D111034? Or should I start a new patch review for the Peephole?

benshi001 added a comment.EditedJan 10 2022, 8:01 PM

I am quite sorry for that. Though make check-all passed and everything was good on amd64, the clang built into aarch64 ELF and running on aarch64-linux crashed. And I have no aarch64 host machine to debug that (although I guess it should be a minor bug).

Oh I see. Sorry for not seeing that message. I hadn't realized it was reverted.

I have no strong opinion whether this is best done as an AArch64MIPeephole or in MachineCombine. They are probably both fine for this kind of thing. Whichever @red1bluelost thinks will be easiest to extend to SUBS instructions that write nzcv sounds OK to me, if the goal here is to address https://github.com/llvm/llvm-project/issues/51482.

Either method (Peephole or MachineCombine) would be easy to extend for the SUBS instructions. I think the Peephole approach would be better, it seems more encapsulated than my approach.

What is the status of D111034 @benshi001, is it likely to be committed or is that bug still holding it up? The buildbot logs don't appear anymore so I can't see where that bug happened. Could that bug also be present with my MachineCombine patch?

@red1bluelost, you are appreciated to continue this optimization along with the peephole approach, if you are sure it no longer crashes on aarch64-linux host. As I have explained, I have no aarch64-linux host and can not gurantee that. Thanks! I am glad if anybody can add this optmization to the main branch ASAP.

Ok. I will try to see if I can find the bug with the Peephole if possible. I have been testing it this weekend through the Linaro Docker container mentioned in D111034 but using QEMU emulation of AArch64. It seems like it can replicate the aarch64-linux host environment but is just really slow. (I don't have an aarch64 host either).

I'll also see about finishing up the MachineCombine approach and testing it too, incase it would be easier to just finish this approach.

If I manage to fix the Peephole approach, should I edit D111034? Or should I start a new patch review for the Peephole?

I suggest you start a new patch review request and mention D111034 :)

Since we already have https://reviews.llvm.org/D117429, can we abandon this ?

red1bluelost abandoned this revision.Jan 17 2022, 4:45 AM

Abandoning revision in favor of D117429.