Page MenuHomePhabricator

[AArch64][SelectionDAG] Lower multiplication by a constant to shl+add+shl+add
ClosedPublic

Authored by Allen on Oct 7 2022, 4:44 AM.

Details

Summary

Change the costmodel to lower a = b * C where C = (1 + 2^m) * (1 + 2^n) to

add   w8, w0, w0, lsl #m
add   w0, w8, w8, lsl #n

Note: The latency can vary depending on the shirt amount

Diff Detail

Event Timeline

Allen created this revision.Oct 7 2022, 4:44 AM
Allen requested review of this revision.Oct 7 2022, 4:44 AM
Herald added a project: Restricted Project. · View Herald TranscriptOct 7 2022, 4:44 AM

This is getting into the territory of actually being slower than a "mul", depending on the latency of "mul" and "add-with-shift" on the target CPU... we probably need CPU-specific modeling if you want to go this direction.

Allen added a comment.Oct 7 2022, 6:08 PM

This is getting into the territory of actually being slower than a "mul", depending on the latency of "mul" and "add-with-shift" on the target CPU... we probably need CPU-specific modeling if you want to go this direction.

Thanks. As the Selection DAG doesn't include schedule model, so this should be checked in machine combiner?

MF.getSubtarget().getSchedModel() should work in SelectionDAG.

The tricky things here are:

  • Actually getting the right variant out of the scheduler is a little tricky. You basically have to construct an MCInst. (Note that the latency can vary depending on the shirt amount.)
  • We don't have actually have accurate scheduling models for all the chips we care about. It tends to be something that requires a lot of effort for very little effect, so a lot of CPUs just use the A57 model.

Maybe to start, just try to figure out which targets have "free" shifts, and turn on the optimization for the cases that involve those shifts? (Multiple cores have a small shift optimization, where left shifts of 4 or less don't increase the latency of an add.)

Out of interest, what cases do you have where mul is worse than add+shift + add+shift? From the look of the tv100 scheduling model it would seem to be 3/4 cycles for the mul (depending on whether it is i32 or i64) vs 2+2 for the add+shifts. Are small shifts really free, as in FeatureLSLFast?

Allen added a comment.Oct 12 2022, 4:04 AM

Out of interest, what cases do you have where mul is worse than add+shift + add+shift? From the look of the tv100 scheduling model it would seem to be 3/4 cycles for the mul (depending on whether it is i32 or i64) vs 2+2 for the add+shifts. Are small shifts really free, as in FeatureLSLFast?

I read from the a new spec , which I'm working on, the Latency of add+shifts is 1 when the value of shift small. At the same time, I happen to see a TODO in the upstream code :)
Thanks for your detail suggestion, I'll add the check of FeatureLSLFast or distinguish more shift-related values.

khchen added a subscriber: khchen.Oct 12 2022, 7:39 AM
Allen updated this revision to Diff 468008.Oct 15 2022, 4:55 AM

add condition Subtarget->hasLSLFast() according comment

dmgreen added inline comments.Oct 17 2022, 12:19 AM
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15145

-> TODO, shift.

The documentation for LSLFast says that Shifts <= 3 places are fast, which is the limit for most address offsets. Modern cores usually have free shifts <= 4 places. (They tend to have cheap multiplies too, if they can perform fast shifts).

I was considering putting a LSLFast4 option in when I recently enabled LSLFast for Arm cores, but as far as I understand the LSLFast option current doesn't actually apply to Add instructions like it should at the moment. We can check that ShiftM1 and ShiftN1 are <= 3 here though, and maybe change the subtarget feature for shifts of 4?

Allen updated this revision to Diff 468208.Oct 17 2022, 8:07 AM
Allen edited the summary of this revision. (Show Details)

Add condition ShiftM1 <= 3 && ShiftM1 <= 3 for LSLFast

Allen marked an inline comment as done.Oct 17 2022, 8:08 AM
Allen added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15145

Apply your comment, thanks

Can you add some tests for multiplying by larger values. Maybe 165 and 297.

llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15145

You can remove the comment now then.

15151

Can this be moved into the if..

Allen updated this revision to Diff 468544.Oct 18 2022, 7:59 AM
Allen marked 2 inline comments as done.

Add 3 new test cases

Allen marked an inline comment as done.Oct 18 2022, 8:00 AM
Allen added inline comments.
llvm/lib/Target/AArch64/AArch64ISelLowering.cpp
15145

Done

15151

Done, thanks

Allen updated this revision to Diff 469122.Oct 19 2022, 11:34 PM
Allen marked an inline comment as done.

Retry the commit as a unexpectedpre-merge checks fail

Allen added a comment.Oct 20 2022, 9:21 AM

any new suggestion about the last update?

dmgreen accepted this revision.Oct 20 2022, 9:24 AM

LGTM, thanks for adding the tests.

This revision is now accepted and ready to land.Oct 20 2022, 9:24 AM