This is an archive of the discontinued LLVM Phabricator instance.

[LoongArch] Optimize multiplication with immediates
ClosedPublic

Authored by benshi001 on Mar 31 2023, 12:35 AM.

Details

Summary

Optimize multiplication with some specific immediates to
a pair of alsl.

Diff Detail

Event Timeline

benshi001 created this revision.Mar 31 2023, 12:35 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 31 2023, 12:35 AM
Herald added a subscriber: hiraditya. · View Herald Transcript
benshi001 requested review of this revision.Mar 31 2023, 12:35 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 31 2023, 12:35 AM
Harbormaster completed remote builds in B222931: Diff 509929.
benshi001 updated this revision to Diff 509931.

Multiplication with some specific immediates can be simplified to a pair of ALSL.

benshi001 updated this revision to Diff 509937.Mar 31 2023, 1:12 AM
xen0n added a comment.EditedMar 31 2023, 2:03 AM

This is good strength reduction overall, thanks for the insight!

Though maybe the exact patterns can be automatically generated with nested loops, for reduced mental burden to prove correctness? I've sketched something and it seems doable:

>>> alsl = lambda j, k, imm: (j << imm) + k
>>> a = [alsl(alsl(1, 1, i), alsl(1, 1, i), j) for j in range(1,5) for i in range(1,5)]
>>> a
[9, 15, 27, 51, 15, 25, 45, 85, 27, 45, 81, 153, 51, 85, 153, 289]
>>> list(sorted(set(a)))
[9, 15, 25, 27, 45, 51, 81, 85, 153, 289]
>>> b = [alsl(alsl(1, 1, i), 1, j) for j in range(1,5) for i in range(1,5)]
>>> b
[7, 11, 19, 35, 13, 21, 37, 69, 25, 41, 73, 137, 49, 81, 145, 273]
>>> ab = set(a).union(set(b))
>>> len(ab)
24
>>> list(sorted(set(a).intersection(set(b))))
[25, 81]

>>> c = [alsl(1, alsl(1, 1, i), j) for j in range(1,5) for i in range(1,5)]
>>> c
[5, 7, 11, 19, 7, 9, 13, 21, 11, 13, 17, 25, 19, 21, 25, 33]
>>> list(sorted(set(c)))
[5, 7, 9, 11, 13, 17, 19, 21, 25, 33]

>>> set(c).difference(ab)
{33, 5, 17}
>>> all = set(c).union(ab).difference({3, 5, 9, 17})
>>> len(all)
24
>>> list(sorted(all))
[7, 11, 13, 15, 19, 21, 25, 27, 33, 35, 37, 41, 45, 49, 51, 69, 73, 81, 85, 137, 145, 153, 273, 289]

So basically, we can strength-reduce a total of 24 different constant-multiplications with two alsl's:

  • case 1: alsl T, X, X, i; alsl Y, T, T, j: 15, 25, 27, 45, 51, 81, 85, 153, 289
  • case 2: alsl T, X, X, i; alsl Y, T, X, j: 7, 11, 13, 19, 21, 25, 35, 37, 41, 49, 69, 73, 81, 137, 145, 273
  • case 3: alsl T, X, X, i; alsl Y, X, T, j: 7, 11, 13, 19, 21, 25, 33

Problem is that there are some overlaps between the 3 possible combinations, and some inside case 1 and 3. If we could somehow avoid producing conflicting rules then probably leveraging TableGen's loop and computation abilities would produce code that's easier to maintain. Otherwise, simplifying the code with some macros could also be beneficial.

llvm/lib/Target/LoongArch/LoongArchInstrInfo.td
865–866

The inner if could be simplified into just return N1C->hasOneUse(). The outer return false could be kept though, for avoiding an overly complex single return expression.

benshi001 updated this revision to Diff 509975.Mar 31 2023, 3:33 AM
benshi001 marked an inline comment as done.

This is good strength reduction overall, thanks for the insight!

Though maybe the exact patterns can be automatically generated with nested loops, for reduced mental burden to prove correctness? I've sketched something and it seems doable:

>>> alsl = lambda j, k, imm: (j << imm) + k
>>> a = [alsl(alsl(1, 1, i), alsl(1, 1, i), j) for j in range(1,5) for i in range(1,5)]
>>> a
[9, 15, 27, 51, 15, 25, 45, 85, 27, 45, 81, 153, 51, 85, 153, 289]
>>> list(sorted(set(a)))
[9, 15, 25, 27, 45, 51, 81, 85, 153, 289]
>>> b = [alsl(alsl(1, 1, i), 1, j) for j in range(1,5) for i in range(1,5)]
>>> b
[7, 11, 19, 35, 13, 21, 37, 69, 25, 41, 73, 137, 49, 81, 145, 273]
>>> ab = set(a).union(set(b))
>>> len(ab)
24
>>> list(sorted(set(a).intersection(set(b))))
[25, 81]

>>> c = [alsl(1, alsl(1, 1, i), j) for j in range(1,5) for i in range(1,5)]
>>> c
[5, 7, 11, 19, 7, 9, 13, 21, 11, 13, 17, 25, 19, 21, 25, 33]
>>> list(sorted(set(c)))
[5, 7, 9, 11, 13, 17, 19, 21, 25, 33]

>>> set(c).difference(ab)
{33, 5, 17}
>>> all = set(c).union(ab).difference({3, 5, 9, 17})
>>> len(all)
24
>>> list(sorted(all))
[7, 11, 13, 15, 19, 21, 25, 27, 33, 35, 37, 41, 45, 49, 51, 69, 73, 81, 85, 137, 145, 153, 273, 289]

So basically, we can strength-reduce a total of 24 different constant-multiplications with two alsl's:

  • case 1: alsl T, X, X, i; alsl Y, T, T, j: 15, 25, 27, 45, 51, 81, 85, 153, 289
  • case 2: alsl T, X, X, i; alsl Y, T, X, j: 7, 11, 13, 19, 21, 25, 35, 37, 41, 49, 69, 73, 81, 137, 145, 273
  • case 3: alsl T, X, X, i; alsl Y, X, T, j: 7, 11, 13, 19, 21, 25, 33

Problem is that there are some overlaps between the 3 possible combinations, and some inside case 1 and 3. If we could somehow avoid producing conflicting rules then probably leveraging TableGen's loop and computation abilities would produce code that's easier to maintain. Otherwise, simplifying the code with some macros could also be beneficial.

Thanks for your suggestion! Using foreach really makes the code more clear!

BTW: ALSL only accepts shift amount 1,2,3,4, the value 5 is not supported.

xen0n added a comment.Mar 31 2023, 3:37 AM

This is good strength reduction overall, thanks for the insight!

Though maybe the exact patterns can be automatically generated with nested loops, for reduced mental burden to prove correctness? I've sketched something and it seems doable:

>>> alsl = lambda j, k, imm: (j << imm) + k
>>> a = [alsl(alsl(1, 1, i), alsl(1, 1, i), j) for j in range(1,5) for i in range(1,5)]
>>> a
[9, 15, 27, 51, 15, 25, 45, 85, 27, 45, 81, 153, 51, 85, 153, 289]
>>> list(sorted(set(a)))
[9, 15, 25, 27, 45, 51, 81, 85, 153, 289]
>>> b = [alsl(alsl(1, 1, i), 1, j) for j in range(1,5) for i in range(1,5)]
>>> b
[7, 11, 19, 35, 13, 21, 37, 69, 25, 41, 73, 137, 49, 81, 145, 273]
>>> ab = set(a).union(set(b))
>>> len(ab)
24
>>> list(sorted(set(a).intersection(set(b))))
[25, 81]

>>> c = [alsl(1, alsl(1, 1, i), j) for j in range(1,5) for i in range(1,5)]
>>> c
[5, 7, 11, 19, 7, 9, 13, 21, 11, 13, 17, 25, 19, 21, 25, 33]
>>> list(sorted(set(c)))
[5, 7, 9, 11, 13, 17, 19, 21, 25, 33]

>>> set(c).difference(ab)
{33, 5, 17}
>>> all = set(c).union(ab).difference({3, 5, 9, 17})
>>> len(all)
24
>>> list(sorted(all))
[7, 11, 13, 15, 19, 21, 25, 27, 33, 35, 37, 41, 45, 49, 51, 69, 73, 81, 85, 137, 145, 153, 273, 289]

So basically, we can strength-reduce a total of 24 different constant-multiplications with two alsl's:

  • case 1: alsl T, X, X, i; alsl Y, T, T, j: 15, 25, 27, 45, 51, 81, 85, 153, 289
  • case 2: alsl T, X, X, i; alsl Y, T, X, j: 7, 11, 13, 19, 21, 25, 35, 37, 41, 49, 69, 73, 81, 137, 145, 273
  • case 3: alsl T, X, X, i; alsl Y, X, T, j: 7, 11, 13, 19, 21, 25, 33

Problem is that there are some overlaps between the 3 possible combinations, and some inside case 1 and 3. If we could somehow avoid producing conflicting rules then probably leveraging TableGen's loop and computation abilities would produce code that's easier to maintain. Otherwise, simplifying the code with some macros could also be beneficial.

Thanks for your suggestion! Using foreach really makes the code more clear!

My pleasure. ;-)

BTW: ALSL only accepts shift amount 1,2,3,4, the value 5 is not supported.

It's just one of the Python idiosyncrasies: range(1, 5) really yields 1, 2, 3, 4. Just like how for (int i = 1; i < 5; i++) is the same in C.

Thanks for the improvements.
So this is only for the case 2 mentioned by @xen0n, right? Seems that the test for 81 is missing.

Will case 1 and case 3 be handled later?

benshi001 updated this revision to Diff 510162.Mar 31 2023, 7:26 PM

Thanks for the improvements.
So this is only for the case 2 mentioned by @xen0n, right? Seems that the test for 81 is missing.

Will case 1 and case 3 be handled later?

Thanks for your comments.

  1. The missing case 81 is added.
  2. I will implement case 1 and case 3 later in another patch. BTW: Total amount of these two cases is small, and they even have redundant values with case 2, so how about implementing them without foreach, just standalone Pat one by one ?
benshi001 updated this revision to Diff 510164.Mar 31 2023, 7:46 PM
xen0n added a comment.Mar 31 2023, 9:42 PM

Thanks for the improvements.
So this is only for the case 2 mentioned by @xen0n, right? Seems that the test for 81 is missing.

Will case 1 and case 3 be handled later?

Thanks for your comments.

  1. The missing case 81 is added.
  2. I will implement case 1 and case 3 later in another patch. BTW: Total amount of these two cases is small, and they even have redundant values with case 2, so how about implementing them without foreach, just standalone Pat one by one ?

As for point 2, fine by me. Case 1 would still have many remaining constants so a macro would go a long way (you could only go over the distinct Imm1 and Imm2 and auto-compute the source constant as you did for Case 2), and only 33 would be left for Case 3 so you may write the pattern straight-forward.

Thanks for the improvements.
So this is only for the case 2 mentioned by @xen0n, right? Seems that the test for 81 is missing.

Will case 1 and case 3 be handled later?

Thanks for your comments.

  1. The missing case 81 is added.
  2. I will implement case 1 and case 3 later in another patch. BTW: Total amount of these two cases is small, and they even have redundant values with case 2, so how about implementing them without foreach, just standalone Pat one by one ?

As for point 2, fine by me. Case 1 would still have many remaining constants so a macro would go a long way (you could only go over the distinct Imm1 and Imm2 and auto-compute the source constant as you did for Case 2), and only 33 would be left for Case 3 so you may write the pattern straight-forward.

Thanks for your suggestion. I will do in my next patch.

benshi001 edited the summary of this revision. (Show Details)Apr 1 2023, 12:53 AM
xen0n accepted this revision.Apr 1 2023, 1:12 AM
This revision is now accepted and ready to land.Apr 1 2023, 1:12 AM

Thanks for the improvements.
So this is only for the case 2 mentioned by @xen0n, right? Seems that the test for 81 is missing.

Will case 1 and case 3 be handled later?

Thanks for your comments.

  1. The missing case 81 is added.
  2. I will implement case 1 and case 3 later in another patch. BTW: Total amount of these two cases is small, and they even have redundant values with case 2, so how about implementing them without foreach, just standalone Pat one by one ?

As for point 2, fine by me. Case 1 would still have many remaining constants so a macro would go a long way (you could only go over the distinct Imm1 and Imm2 and auto-compute the source constant as you did for Case 2), and only 33 would be left for Case 3 so you may write the pattern straight-forward.

Unfortanately the case 3 you mentioned can not be implemented, besides other duplicates to case 2, the remaining immediate x * 33 will be optimized to (x << 5) + x.