Page MenuHomePhabricator

[InstCombine] Eliminate casts to optimize ctlz operation
Needs ReviewPublic

Authored by datta.nagraj on Sun, Jun 6, 11:53 PM.

Details

Summary

If a ctlz operation is performed on higher datatype and then
downcasted, then this can be optimized by doing a ctlz operation
on a lower datatype and adding the difference bitsize to the result
of ctlz to provide the same output:

https://alive2.llvm.org/ce/z/8uup9M

The original problem is shown in
https://llvm.org/PR50173

Diff Detail

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
datta.nagraj requested review of this revision.Sun, Jun 6, 11:53 PM
Herald added a project: Restricted Project. · View Herald TranscriptSun, Jun 6, 11:53 PM
RKSimon added inline comments.
llvm/test/Transforms/InstCombine/zext-ctlz-trunc-to-ctlz-add.ll
11

Drop the ;

13

please can you use descriptive test names - @src0 etc. don't give themselves to searching (or when variant tests get added in the middle of the file...).

@trunc_ctlz_zext_i32 etc. would be better

71

We need vector type coverage as well - at least fixed vectors, scalable vectors as well would be a bonus

declare <2 x i32> @llvm.ctlz.v2i32 (<2 x i32>, i1)

declare <vscale x 2 x i32> @llvm.ctlz.v2i32 (<vscale x 2 x i32>, i1)

RKSimon added inline comments.Mon, Jun 7, 1:18 AM
llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
954

We will need to add multiuse tests as well to check that that the m_OneUse is working correctly

Updated the unit tests

datta.nagraj marked 4 inline comments as done.Mon, Jun 7, 4:08 AM

Added more tests as per review comments.

llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
954

Done. Added 3 tests at the end which have multiple uses.

llvm/test/Transforms/InstCombine/zext-ctlz-trunc-to-ctlz-add.ll
71

Done. Added tests with vector and scalable vectors.

71

Done. Added tests with vector and scalable vectors.

datta.nagraj marked 2 inline comments as done.Mon, Jun 7, 4:09 AM

@RKSimon I have addressed the review comments Sir, please have a look.

spatel added inline comments.Mon, Jun 7, 11:55 AM
llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
959

We usually prefer to do something like:

Value *NarrowCtlz = Builder.CreateIntrinsic(...);
return BinaryOperator::CreateAdd(NarrowCtlz, WidthDiff);

The instcombine caller function then handles the replace uses and transfers the name of the existing value to the new value, so you don't have to do that explicitly. You should see a cosmetic (but not functional) difference if you regenerate the CHECK lines in the test files with that change.

Remove replaceAllUses manually, and let LLVM do it

datta.nagraj marked an inline comment as done.Mon, Jun 7, 10:07 PM
datta.nagraj added inline comments.
llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
959

Done. Actually, I had refered the visitAdd function to see how to match intrinsic patterns, and there they were using the replaceInstUsesWith, but the approach you suggested looks clean. Made the changes.

datta.nagraj marked an inline comment as done.Mon, Jun 7, 10:08 PM

Clang Format

spatel added inline comments.Tue, Jun 8, 10:41 AM
llvm/test/Transforms/InstCombine/zext-ctlz-trunc-to-ctlz-add.ll
53

There are a lot of tests here that don't provide much extra coverage. These are only changing the types to confirm that we get the add constant correct? I think we can verify that with 2 tests of varying types (including a vector type), but we can do better by using a weird type (for example 3 x i17). We can also get more value by varying the 2nd parameter (make it true half of the time?).

215

Typically, we make the extra use minimal and more plausible by adding a call void @use(i32 %p) instruction.
Also, what happens if the zext has an extra use?

vdsered added a subscriber: vdsered.Tue, Jun 8, 7:28 PM

Updated unit test and added condition for multiple use check for zext

datta.nagraj marked 2 inline comments as done.Tue, Jun 8, 10:19 PM

@spatel Addressed review comments.

llvm/test/Transforms/InstCombine/zext-ctlz-trunc-to-ctlz-add.ll
53

Removed redundant tests. (Yes, they were checking for the add constant)
Added weird type tests.
Varied the 2nd parameter as true for half of them.

215

Made use of the function call. That's a nice idea, didn't think of that.
Added condition check for extra use of zext and added a test case for that as well.

datta.nagraj marked 2 inline comments as done.Tue, Jun 8, 10:19 PM
spatel added inline comments.Wed, Jun 9, 8:41 AM
llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
955

I don't think we need this one-use check on the zext. Usually, we say that if we are not increasing the instruction count, but we are reducing the sequence of computation, it's ok to do the transform.

This will be easier to see if we commit the tests with the baseline CHECKs, so I pushed those to main:
9eef6e39816a

Please rebase and regenerate the CHECK lines (and see if it's ok to remove the 2nd one-use check).

lebedev.ri requested changes to this revision.Wed, Jun 9, 9:31 AM

I think this is being approached from the wrong angle.
You currently transform trunc (ctlz(zext(A))) --> add(ctlz(A), (bitwidth(zext(A))-bitwidth(A)),
but why does that trunc matter?
Wouldn't it make more sense to view this as ctlz(zext(A)) --> add(ctlz(A), (bitwidth(zext(A))-bitwidth(A))?

llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
954

Are you not missing a check that types of A and Trunc match?

This revision now requires changes to proceed.Wed, Jun 9, 9:31 AM

@lebedev.ri In this opt: ctlz(zext(A)) --> add(ctlz(A), (bitwidth(zext(A))-bitwidth(A))

I think we require to look at the trunc because, the ctlz in the LHS is operating on higher datatype and the one on RHS is operating on a lower datatype.
If we pass the output of RHS to trunc, that would be wrong, as the datatype of ADD operation and trunc operation would be same. So, we do need to remove the trunc, isn't it.

Am I missing something here, please suggest?

Right, it of course should be https://alive2.llvm.org/ce/z/wuRBBs

Note that this only works if the scalar bit width of original un-extended %x is at least 6: https://alive2.llvm.org/ce/z/5ThC64
so i think we actually want to do https://alive2.llvm.org/ce/z/ZtZm4w

@lebedev.ri Sir, but if we don't take trunc into consideration, then how are we optimizing this?
Earlier it was 2 insts (zext - ctlz), and now its 3 insts(ctlz - add - zext). I agree that the ctlz , add are done on a lower datatype.
If we consider trunc as well, then we can convert 3 insts (zext - ctlz - trunc) to 2 insts (ctlz - add).
Shouldn't we do this opt only when the trunc is present?

@lebedev.ri Sir, but if we don't take trunc into consideration, then how are we optimizing this?
Earlier it was 2 insts (zext - ctlz), and now its 3 insts(ctlz - add - zext). I agree that the ctlz , add are done on a lower datatype.
If we consider trunc as well, then we can convert 3 insts (zext - ctlz - trunc) to 2 insts (ctlz - add).
Shouldn't we do this opt only when the trunc is present?

In general - yes, in instcombine we should not increase the instruction count,
however as it was already hinted by @spatel, this seems like a rare edge case
where we should be okay with that. (unless @spatel disagrees?)

@lebedev.ri Sir, but if we don't take trunc into consideration, then how are we optimizing this?
Earlier it was 2 insts (zext - ctlz), and now its 3 insts(ctlz - add - zext). I agree that the ctlz , add are done on a lower datatype.
If we consider trunc as well, then we can convert 3 insts (zext - ctlz - trunc) to 2 insts (ctlz - add).
Shouldn't we do this opt only when the trunc is present?

In general - yes, in instcombine we should not increase the instruction count,
however as it was already hinted by @spatel, this seems like a rare edge case
where we should be okay with that. (unless @spatel disagrees?)

This could go either way, but I'd prefer that we have this more conventional (don't increase instructions) transform as a first step. If there's evidence that we benefit from the more general (no trunc required) transform, then we can always follow-up.
Definitely agree that we need to check for matching source and destination types to make this patch sound (and add a negative test to verify that).

Check if size of zext src and trunc dst are same and above 5 bits

datta.nagraj marked 2 inline comments as done.Thu, Jun 10, 8:12 AM
datta.nagraj added inline comments.
llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
954

Added the check and added a negative test for that as well.

955

Rebased on top of your change. Yes, seems that we don't need the second check. The test looks fine even without the second check.

datta.nagraj marked 2 inline comments as done.Thu, Jun 10, 8:12 AM

@lebedev.ri Sir, but if we don't take trunc into consideration, then how are we optimizing this?
Earlier it was 2 insts (zext - ctlz), and now its 3 insts(ctlz - add - zext). I agree that the ctlz , add are done on a lower datatype.
If we consider trunc as well, then we can convert 3 insts (zext - ctlz - trunc) to 2 insts (ctlz - add).
Shouldn't we do this opt only when the trunc is present?

In general - yes, in instcombine we should not increase the instruction count,
however as it was already hinted by @spatel, this seems like a rare edge case
where we should be okay with that. (unless @spatel disagrees?)

This could go either way, but I'd prefer that we have this more conventional (don't increase instructions) transform as a first step. If there's evidence that we benefit from the more general (no trunc required) transform, then we can always follow-up.
Definitely agree that we need to check for matching source and destination types to make this patch sound (and add a negative test to verify that).

Added a check to see that the source and destination types match. Added the negative test as well to verify that.
Also added a check to do this opt only for width more than 5 as @lebedev.ri Sir, has pointed out that this won't work for bitwidth below 6 and would require an additional zext.

Sorry, to ask this here, but this is my first time here and want to ask a lame question. Who is supposed to mark the review comments as done, the developer or the reviewer (really confused in this.)

If we look at the sequence with trunc, then preconditions are different.
I'm not sure if we have *some* problematic combination of src bitwidth and bitwidth increase,
but the hard cut-off that is currently there is not correct.

If we look at the sequence with trunc, then preconditions are different.
I'm not sure if we have *some* problematic combination of src bitwidth and bitwidth increase,
but the hard cut-off that is currently there is not correct.

I am currently not sure of what to do here Sir, also I have no clue as to why some unit tests are failing, because they don't even contain trunc inst, and don't even hit my changes.

If we look at the sequence with trunc, then preconditions are different.
I'm not sure if we have *some* problematic combination of src bitwidth and bitwidth increase,
but the hard cut-off that is currently there is not correct.

It seems like we could go wrong if the narrow (destination) type can't hold at least the bitwidth of the wide type -- that's what we saw in the Alive2 example of the transform without the trunc -- but I haven't come up with an example where it fails when we have the trunc:
https://alive2.llvm.org/ce/z/89_wJb

I am currently not sure of what to do here Sir, also I have no clue as to why some unit tests are failing, because they don't even contain trunc inst, and don't even hit my changes.

I don't understand this comment. What is failing?

Add a check to see that the source and dst size are same.

Removed the hard cut off check. Please review Sir.

If we look at the sequence with trunc, then preconditions are different.
I'm not sure if we have *some* problematic combination of src bitwidth and bitwidth increase,
but the hard cut-off that is currently there is not correct.

It seems like we could go wrong if the narrow (destination) type can't hold at least the bitwidth of the wide type -- that's what we saw in the Alive2 example of the transform without the trunc -- but I haven't come up with an example where it fails when we have the trunc:
https://alive2.llvm.org/ce/z/89_wJb

I am currently not sure of what to do here Sir, also I have no clue as to why some unit tests are failing, because they don't even contain trunc inst, and don't even hit my changes.

I don't understand this comment. What is failing?

The build had failed with that last commit, it might have been due to some other commit in main. The build is passing now after I did rebase.

Removed the hard cut off check. Please review Sir.

The hard check was definitely not right, but have you proven that a log2-of-bitwidth check is not necessary? The original Alive allowed for that kind of proof based on value type widths.
If we can't prove it, then I would include that check in this patch to be safe.
I don't know what the motivating source code looks like. It seems unlikely that we would ever truncate to a type smaller than log2 of the wide type in real code, but we can't rule it out, so it still deserves a regression test.

Removed the hard cut off check. Please review Sir.

The hard check was definitely not right, but have you proven that a log2-of-bitwidth check is not necessary? The original Alive allowed for that kind of proof based on value type widths.
If we can't prove it, then I would include that check in this patch to be safe.
I don't know what the motivating source code looks like. It seems unlikely that we would ever truncate to a type smaller than log2 of the wide type in real code, but we can't rule it out, so it still deserves a regression test.

Hi @spatel Sir, doesn't https://alive2.llvm.org/ce/z/89_wJb prove that this works for even type smaller than log2 of the wide type ? I can add the same test to the unit test, but I am in doubt whether to add it as a check, since the example here proves that it works for difference of bitwidth of more than log 2 as well. I tested with 63, 2 sizes as well, and that works too. Please suggest if we require to add the log2 check, since these are working fine from the above examples.

Add a test with bitwidth difference of more than log2 between zext src and trunc dst

Hi @spatel Sir, doesn't https://alive2.llvm.org/ce/z/89_wJb prove that this works for even type smaller than log2 of the wide type ? I can add the same test to the unit test, but I am in doubt whether to add it as a check, since the example here proves that it works for difference of bitwidth of more than log 2 as well. I tested with 63, 2 sizes as well, and that works too. Please suggest if we require to add the log2 check, since these are working fine from the above examples.

That test proves that the transform is correct for that pair of types exactly. It does not prove that the transform is correct for all pairs of types. If you can prove *why* truncating the addition constant will always work, then we can proceed without the additional check. If not, please add the extra check and mark the test with a 'TODO' comment that references what we have discussed here.

I'd rather be safe than cause a miscompile on some pair of types that we did not think of (and then potentially have the whole patch reverted). :)

Do the opt only if the difference in bitwidth of zext src and dst is less than log2

Hi @spatel Sir, doesn't https://alive2.llvm.org/ce/z/89_wJb prove that this works for even type smaller than log2 of the wide type ? I can add the same test to the unit test, but I am in doubt whether to add it as a check, since the example here proves that it works for difference of bitwidth of more than log 2 as well. I tested with 63, 2 sizes as well, and that works too. Please suggest if we require to add the log2 check, since these are working fine from the above examples.

That test proves that the transform is correct for that pair of types exactly. It does not prove that the transform is correct for all pairs of types. If you can prove *why* truncating the addition constant will always work, then we can proceed without the additional check. If not, please add the extra check and mark the test with a 'TODO' comment that references what we have discussed here.

I'd rather be safe than cause a miscompile on some pair of types that we did not think of (and then potentially have the whole patch reverted). :)

Done. Agree to all your points Sir. :)

datta.nagraj added inline comments.Sat, Jun 12, 1:25 AM
llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
957

I am not sure if the 32 should be hardcoded here, or there is some other way for creating the APInt out of SrcWidth.
Please guide here sir. @spatel

spatel added inline comments.Sun, Jun 13, 4:59 AM
llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
957

You don't need to create an APInt; see "Log2_32()" in MathExtras.h.

Use Log2_32 function instead of APInt

datta.nagraj marked an inline comment as done.Sun, Jun 13, 6:31 AM
datta.nagraj added inline comments.
llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
957

Done.

datta.nagraj marked an inline comment as done.Sun, Jun 13, 6:31 AM
spatel accepted this revision.Sun, Jun 13, 4:12 PM

LGTM, but let's see if there are any more comments or concerns.

spatel added inline comments.Sun, Jun 13, 4:14 PM
llvm/test/Transforms/InstCombine/zext-ctlz-trunc-to-ctlz-add.ll
71

This comment should be removed or updated. The transform was enabled for this case.

Update comment for multiple use of zext test case

datta.nagraj marked an inline comment as done.Sun, Jun 13, 7:01 PM