Change the costmodel to lower a = b * C where C = (1 + 2^m) * (1 + 2^n) to
add w8, w0, w0, lsl #m add w0, w8, w8, lsl #n
Note: The latency can vary depending on the shirt amount
Differential D135441
[AArch64][SelectionDAG] Lower multiplication by a constant to shl+add+shl+add Allen on Oct 7 2022, 4:44 AM. Authored by
Details Change the costmodel to lower a = b * C where C = (1 + 2^m) * (1 + 2^n) to add w8, w0, w0, lsl #m add w0, w8, w8, lsl #n Note: The latency can vary depending on the shirt amount
Diff Detail Event TimelineComment Actions This is getting into the territory of actually being slower than a "mul", depending on the latency of "mul" and "add-with-shift" on the target CPU... we probably need CPU-specific modeling if you want to go this direction. Comment Actions Thanks. As the Selection DAG doesn't include schedule model, so this should be checked in machine combiner? Comment Actions MF.getSubtarget().getSchedModel() should work in SelectionDAG. The tricky things here are:
Maybe to start, just try to figure out which targets have "free" shifts, and turn on the optimization for the cases that involve those shifts? (Multiple cores have a small shift optimization, where left shifts of 4 or less don't increase the latency of an add.) Comment Actions Out of interest, what cases do you have where mul is worse than add+shift + add+shift? From the look of the tv100 scheduling model it would seem to be 3/4 cycles for the mul (depending on whether it is i32 or i64) vs 2+2 for the add+shifts. Are small shifts really free, as in FeatureLSLFast? Comment Actions I read from the a new spec , which I'm working on, the Latency of add+shifts is 1 when the value of shift small. At the same time, I happen to see a TODO in the upstream code :)
|
-> TODO, shift.
The documentation for LSLFast says that Shifts <= 3 places are fast, which is the limit for most address offsets. Modern cores usually have free shifts <= 4 places. (They tend to have cheap multiplies too, if they can perform fast shifts).
I was considering putting a LSLFast4 option in when I recently enabled LSLFast for Arm cores, but as far as I understand the LSLFast option current doesn't actually apply to Add instructions like it should at the moment. We can check that ShiftM1 and ShiftN1 are <= 3 here though, and maybe change the subtarget feature for shifts of 4?