Page MenuHomePhabricator

Constant folding sometimes creates large constants even though a smaller constant is used multiple times
Needs RevisionPublic

Authored by ramred01 on Mar 28 2019, 4:53 AM.

Details

Reviewers
spatel
lebedev.ri
Summary

Sometimes, the constant folding pass creates a large constant from smaller constants, even though the smaller constant is used multiple times and hence can benefit from materializing the constant once and reusing that for multiple operations instead of creating a new large constant and trying to materialize both the constants into two registers.

The simple test case is as follows:

int test(int A, int B) {

int t = A * B >> 8;
return ((t <= 4096) ? t : 4096);

}

Here, the right shift by 8 is folded into the 4096 for the compare. Thus the generated code materializes both 4096 as well as 4096 << 8.

We add a new pass to analyze the number of times a given constant is used. If a constant is used more than once, we prevent folding on that constant as it is better to materialize it once and reuse.

Diff Detail

Event Timeline

ramred01 created this revision.Mar 28 2019, 4:53 AM

This sounds like a back-end problem.
Have you considered/tried solving it there, instead of limiting the middle-end?

This sounds like a back-end problem.
Have you considered/tried solving it there, instead of limiting the middle-end?

This folding is happening in the middle-end itself. The LLVM IR already has the folded value. Hence we thought it best to handle it there as this should be affecting multiple architectures where materializing constants can take two or more instructions.

This sounds like a back-end problem.
Have you considered/tried solving it there, instead of limiting the middle-end?

This folding is happening in the middle-end itself. The LLVM IR already has the folded value.

That is precisely my point. Is that folding not optimal for the IR?

Hence we thought it best to handle it there as this should be affecting multiple architectures where materializing constants can take two or more instructions.

Have you considered the case where you *already* had that IR?
Even thought the middle end did not 'degrade' it, it will still not be handled by back-ends 'properly', right?

This sounds like a back-end problem.
Have you considered/tried solving it there, instead of limiting the middle-end?

This folding is happening in the middle-end itself. The LLVM IR already has the folded value.

That is precisely my point. Is that folding not optimal for the IR?

That folding would have been optimal for the IR if the constant were not to be reused. If the constant is reused, then most architectures will do better with materializing one constant in a register and reusing it rather than materializing two constants. Most architectures require two instructions to materialize 32-bit constants. If, we add the additional operation of shift also, that now needs to be done, even with one reuse, it will generate 1 instruction fewer. With more reuses and each reuse folded, that number could increase.

Hence we thought it best to handle it there as this should be affecting multiple architectures where materializing constants can take two or more instructions.

Have you considered the case where you *already* had that IR?
Even thought the middle end did not 'degrade' it, it will still not be handled by back-ends 'properly', right?

Yes, you are right. What if I had that IR itself to deal with in the backend. There are two possibilities. One is when one constant is obviously related to another by some simple single operation arithmetic. Those cases might be easier to decipher. But what if a compare canonicalization pass in the interim has changed the constant. Then such a situation will not be obvious and we stand to miss it or employ some more complex mechanism to undo the folding.

Instead isn't it better if we prevent the folding in the first place when we know that for most architectures it will cause more instructions to be generated? There could be possible benefits for a few architectures in the situation hitherto but I'm not aware of them.

This sounds like a back-end problem.
Have you considered/tried solving it there, instead of limiting the middle-end?

This folding is happening in the middle-end itself. The LLVM IR already has the folded value.

That is precisely my point. Is that folding not optimal for the IR?

That folding would have been optimal for the IR if the constant were not to be reused. If the constant is reused, then most architectures will do better with materializing one constant in a register and reusing it rather than materializing two constants. Most architectures require two instructions to materialize 32-bit constants. If, we add the additional operation of shift also, that now needs to be done, even with one reuse, it will generate 1 instruction fewer. With more reuses and each reuse folded, that number could increase.

I think you're mixing up canonicalization with optimization (which I've also done a bunch). The goal here in instcombine is to produce canonical code. Ie, create a reduced set of easier-to-optimize forms of the infinite patterns that might have existed in the source. Canonical code often is identical to optimal code, but sometimes they diverge. That's when things get harder (and we might end up reversing an earlier transform). But we have to deal with it in the backend because -- as Roman noted -- any solution at this level would be incomplete simply because the input might already be in the form that you don't want.

This sounds like a back-end problem.
Have you considered/tried solving it there, instead of limiting the middle-end?

This folding is happening in the middle-end itself. The LLVM IR already has the folded value.

That is precisely my point. Is that folding not optimal for the IR?

That folding would have been optimal for the IR if the constant were not to be reused. If the constant is reused, then most architectures will do better with materializing one constant in a register and reusing it rather than materializing two constants. Most architectures require two instructions to materialize 32-bit constants. If, we add the additional operation of shift also, that now needs to be done, even with one reuse, it will generate 1 instruction fewer. With more reuses and each reuse folded, that number could increase.

I think you're mixing up canonicalization with optimization (which I've also done a bunch). The goal here in instcombine is to produce canonical code. Ie, create a reduced set of easier-to-optimize forms of the infinite patterns that might have existed in the source. Canonical code often is identical to optimal code, but sometimes they diverge. That's when things get harder (and we might end up reversing an earlier transform). But we have to deal with it in the backend because -- as Roman noted -- any solution at this level would be incomplete simply because the input might already be in the form that you don't want.

I get your point and fully agree with it.

But if a certain canonicalization were to result in a form that always requires reversing that transform at a later stage, won't we be better off not performing that transform in the first place? If a canonicalization were to result in exposing better optimization opportunities across architectures, then that makes more sense, isn't it?

This sounds like a back-end problem.
Have you considered/tried solving it there, instead of limiting the middle-end?

This folding is happening in the middle-end itself. The LLVM IR already has the folded value.

That is precisely my point. Is that folding not optimal for the IR?

That folding would have been optimal for the IR if the constant were not to be reused. If the constant is reused, then most architectures will do better with materializing one constant in a register and reusing it rather than materializing two constants. Most architectures require two instructions to materialize 32-bit constants. If, we add the additional operation of shift also, that now needs to be done, even with one reuse, it will generate 1 instruction fewer. With more reuses and each reuse folded, that number could increase.

I think you're mixing up canonicalization with optimization (which I've also done a bunch). The goal here in instcombine is to produce canonical code. Ie, create a reduced set of easier-to-optimize forms of the infinite patterns that might have existed in the source. Canonical code often is identical to optimal code, but sometimes they diverge. That's when things get harder (and we might end up reversing an earlier transform). But we have to deal with it in the backend because -- as Roman noted -- any solution at this level would be incomplete simply because the input might already be in the form that you don't want.

I get your point and fully agree with it.

But if a certain canonicalization were to result in a form that always requires reversing that transform at a later stage, won't we be better off not performing that transform in the first place? If a canonicalization were to result in exposing better optimization opportunities across architectures, then that makes more sense, isn't it?

Is it *always*, though?
If i understand correctly, you want to block this transform, right?
https://godbolt.org/z/IvWYKl

we have essentially replaced

%4 = ashr i32 %3, 8
%5 = icmp slt i32 %4, 4096

with

%5 = icmp slt i32 %3, 1048832

thus that cmp does not depend on the ashr, thus we reduced the data dependency chain,
thus icmp can execute without waiting for the ashr.

Which is, as it can be seen from the lowest view there, unsurprisingly improves performance. (assuming i fixed-up asm for mca correctly..)

So i'm not convinced that this hammer approach is the right solution to the problem.

ramred01 added a comment.EditedMar 29 2019, 5:46 AM

This sounds like a back-end problem.
Have you considered/tried solving it there, instead of limiting the middle-end?

This folding is happening in the middle-end itself. The LLVM IR already has the folded value.

That is precisely my point. Is that folding not optimal for the IR?

That folding would have been optimal for the IR if the constant were not to be reused. If the constant is reused, then most architectures will do better with materializing one constant in a register and reusing it rather than materializing two constants. Most architectures require two instructions to materialize 32-bit constants. If, we add the additional operation of shift also, that now needs to be done, even with one reuse, it will generate 1 instruction fewer. With more reuses and each reuse folded, that number could increase.

I think you're mixing up canonicalization with optimization (which I've also done a bunch). The goal here in instcombine is to produce canonical code. Ie, create a reduced set of easier-to-optimize forms of the infinite patterns that might have existed in the source. Canonical code often is identical to optimal code, but sometimes they diverge. That's when things get harder (and we might end up reversing an earlier transform). But we have to deal with it in the backend because -- as Roman noted -- any solution at this level would be incomplete simply because the input might already be in the form that you don't want.

I get your point and fully agree with it.

But if a certain canonicalization were to result in a form that always requires reversing that transform at a later stage, won't we be better off not performing that transform in the first place? If a canonicalization were to result in exposing better optimization opportunities across architectures, then that makes more sense, isn't it?

Is it *always*, though?
If i understand correctly, you want to block this transform, right?
https://godbolt.org/z/IvWYKl

we have essentially replaced

%4 = ashr i32 %3, 8
%5 = icmp slt i32 %4, 4096

with

%5 = icmp slt i32 %3, 1048832

thus that cmp does not depend on the ashr, thus we reduced the data dependency chain,
thus icmp can execute without waiting for the ashr.

Which is, as it can be seen from the lowest view there, unsurprisingly improves performance. (assuming i fixed-up asm for mca correctly..)

So i'm not convinced that this hammer approach is the right solution to the problem.

Yes, that is exactly what I am trying to block.

What I have missed out is that this should be applicable only for -Os and -Oz. When performance is the criteria, then the folding is perfectly fine. But when size is the criteria, then across architectures, the problem remains the same, since you cannot materialize a 32-bit constant in less than two instructions. The only exceptions being the powers of two in certain architectures. So each new constant that is created adds two instructions to the size if we perform constant folding on a constant that has multiple uses and create a new constant for each such use.

In fact, we can make the entire analysis and decision making conditional upon Os and Oz. That way it doesn't affect the performance based optimization path.

ramred01 updated this revision to Diff 193855.Apr 5 2019, 7:28 AM

Updated the patch with checks for optimization level to be either -Os or -Oz.

Now, if the compilation is not for size, then the code added by this patch is not executed at all.

lebedev.ri requested changes to this revision.Apr 6 2019, 8:04 AM

This still feels incredibly intrusive.
At this stage, you don't know if you will have issues with the immediates you would get.
This is also extremely pessimistic, as this bans creation of *all* new constants.
It does not attempt to model whether that new immediate would or would not be a problem.

TLDR: i believe this needs cost/benefit analysis (test-suite codesize,performance change report) + RFC on llvm-dev.

This revision now requires changes to proceed.Apr 6 2019, 8:04 AM