This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Prevent memcpy generation for small data size
AcceptedPublic

Authored by hiraditya on Jul 5 2017, 2:22 PM.

Details

Summary

InstCombine pass converts @llvm.memcpy to either memcpy or LD/ST depending on if the data size is greater than 8 bytes or not.
However for different targets, 8 bytes may be smaller.
Hence, we queried target datalayout to find the threshold.

Worked in collaboration with Aditya Kumar

Diff Detail

Event Timeline

DIVYA created this revision.Jul 5 2017, 2:22 PM
DIVYA updated this revision to Diff 105339.Jul 5 2017, 2:29 PM

Updated array2.ll testcase file

spatel added inline comments.Jul 6 2017, 11:29 AM
lib/Transforms/InstCombine/InstCombineCalls.cpp
195–204

This code comment isn't correct any more.

199–200

Why 32? I'm not sure what it means if there are no legal types in the DL. Please add a code comment to explain.

test/Transforms/InstCombine/builtin_memcpy_pattern.ll
1 ↗(On Diff #105339)
  1. The test should be limited to -instcombine, not -O2.
  2. The tests here have way too much extraneous stuff. Please minimize to just the IR necessary to show the transform.
  3. You can check the tests in ahead of this patch to show the current IR, then this patch will just show the diffs from the existing code.
  4. Please use the script at util/update_tests_checks.py to auto-generate the complete CHECK lines.
hiraditya added inline comments.
lib/Transforms/InstCombine/InstCombineCalls.cpp
199–200

Why 32?

32 because we want to default to 8, same as the previous behavior before this patch.

I'm not sure what it means if there are no legal types in the DL. Please add a code comment to explain.

For example, in test cases when the target datalayout is not specified, DL.getLargestLegalIntTypeSizeInBits() returns 0.

DIVYA updated this revision to Diff 105643.EditedJul 7 2017, 7:36 AM
  • removed some comments
  • added testcase with different datalayout
DIVYA marked 2 inline comments as done.Jul 25 2017, 11:31 AM
spatel added inline comments.Jul 31 2017, 2:16 PM
lib/Transforms/InstCombine/InstCombineCalls.cpp
184–185

This comment is wrong now? I'm still confused about the current behavior and what this patch is changing.

Did you see my earlier comment to check in minimal tests to trunk before this patch so we have a baseline view of the current behavior? Do the tests really need loops, attributes, globals, etc?

If there are no legal types in the DL, why wouldn't we just bail out rather than assuming that 32-bit and smaller is safe/desirable to transform?

DIVYA updated this revision to Diff 109978.Aug 7 2017, 6:48 AM
  • Added a diff for the output before and after applying the patch
  • Removed some comments
DIVYA updated this revision to Diff 109979.Aug 7 2017, 6:51 AM
DIVYA marked an inline comment as done.Aug 7 2017, 6:56 AM
DIVYA added inline comments.
lib/Transforms/InstCombine/InstCombineCalls.cpp
184–185

The test contains loops to show that builtin_memcpy within the loops , will also be converted to either memcpy or store and load operations depending on the maximum allowed stores

DIVYA marked an inline comment as done.Aug 7 2017, 6:57 AM
DIVYA updated this revision to Diff 109989.Aug 7 2017, 7:27 AM
DIVYA added a comment.EditedAug 9 2017, 1:29 PM

Diff for test/Transforms/InstCombine/builtin_memcpy_patterns.ll

TESTCASE1
Before applying the patch

define void @foo(i8* %a, i8* %b) local_unnamed_addr #0 {
entry:

call void @llvm.memcpy.p0i8.p0i8.i64(i8* %a, i8* %b, i64 16, i32 1, i1 false)
ret void

}

After Applying the patch

define void @foo(i8* %a, i8* %b) local_unnamed_addr #0 {
entry:

%0 = bitcast i8* %b to i128*
%1 = bitcast i8* %a to i128*
%2 = load i128, i128* %0, align 1
store i128 %2, i128* %1, align 1
ret void

}

TESTCASE2

Before Applying the patch

%0 = bitcast i32* %add.ptr to i8*
%add.ptr3 = getelementptr inbounds [16 x i32], [16 x i32]* @b, i64 0, i64 %idx.ext
%1 = bitcast i32* %add.ptr3 to i8*
call void @llvm.memcpy.p0i8.p0i8.i64(i8* %0, i8* %1, i64 16, i32 4, i1 false)
%inc = add nsw i32 %i.0, 1

After Applying the patch

%0 = bitcast i32* %add.ptr3 to i128*
%1 = bitcast i32* %add.ptr to i128*
%2 = load i128, i128* %0, align 16
store i128 %2, i128* %1, align 4
DIVYA updated this revision to Diff 110467.Aug 9 2017, 1:33 PM
spatel edited edge metadata.Aug 10 2017, 8:51 AM

This patch as uploaded makes even less sense to me than before:

  1. There are no tests to show the diff.
  2. You've changed the code rather than an assertion in an unrelated test to show a functional difference?

I've checked in some extra tests where I think your patch will fire here:
https://reviews.llvm.org/rL310611

Please update this patch to include that and then explain the intent of each diff.

As I mentioned before, you can use a script to update the check lines (please don't hand-edit them):
$ path/to/update_test_checks.py --opt=path/to/build/bin/opt memcpy-to-load.ll

DIVYA updated this revision to Diff 110787.Aug 11 2017, 12:32 PM
  • Updated patch rL310611 and added the diff
  • added a new testcase
spatel added inline comments.Aug 11 2017, 1:36 PM
test/Transforms/InstCombine/builtin_memcpy_patterns.ll
3–6

I still don't understand what value these tests provide over the existing ones. These are the same llvm.memcpy calls except they are in loops? The transform doesn't depend on loops from what I can tell, so why do we need to test this?

test/Transforms/InstCombine/memcpy-to-load.ll
81

That is the existing behavior, but why is that valid now that we're using the datalayout?

93

You're mixing bytes and bits here. It makes sense to me that we would use the datalayout to drive this transform, but why are we then ignoring the datalayout and producing load + store of an illegal type (i128)?

joerg added a subscriber: joerg.Aug 12 2017, 2:18 AM

What is the advantage of expanding the memcpy intrinsic in InstCombine vs doing it later in the target-specific code?

What is the advantage of expanding the memcpy intrinsic in InstCombine vs doing it later in the target-specific code?

I can't answer for 'memcpy' directly, but probably related - I'm looking at cases where late 'memcmp' expansion causes us to miss optimizations in:
https://bugs.llvm.org/show_bug.cgi?id=34032#c13

instcombine is definitely not the IR pass for 'memcmp' expansion. (It's complicated enough to be its own pass, but currently it's a pass-within-a-pass of CGP.)

We could make the argument that the post-IR backend should handle those cases, but I think it would require the collective powers of gvn, cse, instcombine, and simplifycfg.

joerg added a comment.Aug 12 2017, 7:18 AM

@spatel: I don't see a reason why we can't (or shouldn't) try to do common-prefix elimination for the memcmp intrinsic. It certainly seems to be better to me to preserve the intrinsics in your case as they should be easier to reason about. That's kind of my question for here too -- why does the expansion allow better code?

@spatel: I don't see a reason why we can't (or shouldn't) try to do common-prefix elimination for the memcmp intrinsic. It certainly seems to be better to me to preserve the intrinsics in your case as they should be easier to reason about. That's kind of my question for here too -- why does the expansion allow better code?

Thanks, @joerg . I can't see any reason not to do prefix elimination either. I'm not sure if that will solve everything for that case, but it should make it better. I'll give that a try.

DIVYA updated this revision to Diff 110982.Aug 14 2017, 8:42 AM
DIVYA marked an inline comment as done.
DIVYA added inline comments.Aug 14 2017, 8:44 AM
test/Transforms/InstCombine/memcpy-to-load.ll
81

In test cases when the target datalayout is not specified, DL.getLargestLegalIntTypeSizeInBits() returns 0.For those cases it will have the previous behavious before applying the patch

DIVYA updated this revision to Diff 111029.Aug 14 2017, 10:27 AM
craig.topper added inline comments.Sep 16 2018, 6:00 PM
lib/Transforms/InstCombine/InstCombineCalls.cpp
195–204

Is there an extra blank line here?

199–200

But the previous behavior allowed 8 bytes so shouldn't it be 64?

hiraditya added inline comments.Sep 16 2018, 11:54 PM
lib/Transforms/InstCombine/InstCombineCalls.cpp
195–204

We'll fix this.

199–200

It's been a while when we submitted the patch but I think you're right unless @DIVYA has some comments. We'll make the changes.

I would quickly like to check the status of this patch: do you have plans to continue this work? If not, I would like to pick it up.

I think this patch is good to go, I can push this if someone accepts. I'll fix the comments.

I think this patch is good to go, I can push this if someone accepts. I'll fix the comments.

I still don't understand the (LargestInt == 0) hack. I think everyone agrees that the existing code is wrong, so why preserve the existing behavior if we don't have a valid datalayout? Ie, there's no risk for real targets because they always have a non-zero getLargestLegalIntTypeSizeInBits()?

I think this patch is good to go, I can push this if someone accepts. I'll fix the comments.

I still don't understand the (LargestInt == 0) hack. I think everyone agrees that the existing code is wrong, so why preserve the existing behavior if we don't have a valid datalayout? Ie, there's no risk for real targets because they always have a non-zero getLargestLegalIntTypeSizeInBits()?

Agreed, I'll remove that part then. That was only to make other testcases happy by preserving existing behavior. Thanks

I think this patch is good to go, I can push this if someone accepts. I'll fix the comments.

I still don't understand the (LargestInt == 0) hack. I think everyone agrees that the existing code is wrong, so why preserve the existing behavior if we don't have a valid datalayout? Ie, there's no risk for real targets because they always have a non-zero getLargestLegalIntTypeSizeInBits()?

Agreed, I'll remove that part then. That was only to make other testcases happy by preserving existing behavior. Thanks

Presumably those testcases should simply be updated to contain the expected datalayout?

I think this patch is good to go, I can push this if someone accepts. I'll fix the comments.

I still don't understand the (LargestInt == 0) hack. I think everyone agrees that the existing code is wrong, so why preserve the existing behavior if we don't have a valid datalayout? Ie, there's no risk for real targets because they always have a non-zero getLargestLegalIntTypeSizeInBits()?

Agreed, I'll remove that part then. That was only to make other testcases happy by preserving existing behavior. Thanks

Presumably those testcases should simply be updated to contain the expected datalayout?

I think so...

hiraditya commandeered this revision.Sep 30 2018, 10:15 AM
hiraditya added a reviewer: DIVYA.

To fix tests and comments from reviewers.

Fix some unit-tests, a few more remaining.

lebedev.ri added inline comments.Sep 30 2018, 10:19 AM
lib/Transforms/InstCombine/InstCombineCalls.cpp
133

You can check that the Size is power of two here.

139–140

I think we decided that the tests should be updated instead?

142

You overrode LargestInt already, it can't be 0.
This should only be Size > LargestInt.

Fixed all testcases, and removed memcpy inlining when target datalayout is not present.

Moved the check of Size before checking the LargestInt.

hiraditya marked 3 inline comments as done.Sep 30 2018, 12:28 PM
hiraditya added reviewers: sebpop, SirishP.

Code looks good.

test/Transforms/InstCombine/element-atomic-memintrins.ll
101 ↗(On Diff #167660)

Please just use utils/update_test_checks.py, as stated in the first line of the test.

test/Transforms/InstCombine/memcpy-to-load.ll
82

Will running the utils/update_test_checks.py preserve this 'inline' comments?

lebedev.ri added inline comments.Sep 30 2018, 1:45 PM
lib/Transforms/InstCombine/InstCombineCalls.cpp
133

Actually, please use !isPowerOf2_64(Size).

hiraditya updated this revision to Diff 167673.Oct 1 2018, 12:00 AM

Updated testcases with utils/update_test_checks.py
use isPowerOf2_64

hiraditya marked 2 inline comments as done.Oct 1 2018, 12:01 AM
dmgreen added inline comments.Oct 1 2018, 6:26 AM
lib/Transforms/InstCombine/InstCombineCalls.cpp
140

Are the units of these the same? Or is one bits and the other bytes?

jfb added a comment.Oct 1 2018, 10:44 AM

I don't understand why it makes sense to tie this to the larger integer type. I understand that targets might want what you're doing, but tying it to the largest integer type instead of having a separate target-specific value seems odd.
How does this interact with memcpyopt?
Which targets are affected by this change?
Can you please provide size and performance numbers for relevant targets?

lib/Transforms/InstCombine/InstCombineCalls.cpp
139

Why does this make sense? I don't understand why we'd want to tie this to the largest legal integer type, and not have it be its own target parameter.

test/DebugInfo/X86/array2.ll
20

Why change this?

test/Transforms/InstCombine/2007-10-10-EliminateMemCpy.ll
3 ↗(On Diff #167673)

Why change this?

test/Transforms/InstCombine/alloca.ll
147 ↗(On Diff #167673)

This test checks that the inalloca remains. Change the test to still test the same thing, don't just delete the CHECK.
https://bugs.llvm.org/show_bug.cgi?id=19569

jfb added a subscriber: MatzeB.Oct 1 2018, 10:44 AM
hiraditya added inline comments.Oct 2 2018, 7:17 AM
lib/Transforms/InstCombine/InstCombineCalls.cpp
139

We used 'largest legal integer size' because that will fit in a register for sure. I think making a target specific parameter seems reasonable, or maybe using TargetLowering.h:getMaxStoresPerMemcpy() which is already available.

140

Thank you!

spatel added inline comments.Oct 2 2018, 8:44 AM
lib/Transforms/InstCombine/InstCombineCalls.cpp
139

I'm not understanding this discussion...
This patch is trying to apply some target-based constraint to a questionable (reverse) IR canonicalization that currently just pulls a number out of the air.
The ultimate goal would be to simply always canonicalize to memcpy and not expand it ever in instcombine as mentioned in D52081.
But we don't do that (yet) because we're afraid of missed optimizations that can't be replicated in the backend.
Expanding memcpy for performance using target parameters belongs in a late, target-aware pass (and as mentioned, it already exists), not early in generic instcombine.

jfb added inline comments.Oct 2 2018, 9:24 AM
lib/Transforms/InstCombine/InstCombineCalls.cpp
139

Let me try to expand a bit, and let me know if that makes sense:

IMO LargestLegalIntTypeSizeInBits doesn't make sense to use here. On ARM I might want to use paired integer load / store for memcpy, or paired Q registers. These have nothing to do with the largest legal integer type. It doesn't matter what the patch is trying to do or what is currently being done: the patch is adding something weird. It shouldn't.

lebedev.ri added inline comments.Oct 2 2018, 9:45 AM
lib/Transforms/InstCombine/InstCombineCalls.cpp
139

It's replacing magic hardcoded number with a number pulled out of target datalayout.
Should be a new field in target datalayout?
I'm not sure if we can get TargetLowering info from the backend in the middle-end?

spatel added inline comments.Oct 2 2018, 9:50 AM
lib/Transforms/InstCombine/InstCombineCalls.cpp
139

I agree that this is a strange transform for instcombine. We're trying to safely replace an obviously bogus magic number constraint (8) with something based on target reality.

So this is an intermediate step as I see it. The goal of this patch (and this may be counter to the original goal...) is to not expand memcpy(x, y, 8) on a 32-bit system because that could take >1 load/store pair. The final goal is to not expand memcpy at all.

IIUC, you'd like to make this transform more aggressive rather than more conservative though? Ie, a pair of registers or Q regs are always bigger than LargestLegalIntType (let me know if I'm wrong).

But that's moving us further away from the target-independent canonicalization goal of instcombine. The kind of expansion where we ask if pairs or vector regs are available should be happening in memcpyopt (if it's not, then that should be enhanced in that pass/lowering).

jfb added inline comments.Oct 2 2018, 9:55 AM
lib/Transforms/InstCombine/InstCombineCalls.cpp
139

All I'm saying is that replacing MagicNumber == 8 with ErroneousUsage == LargestLegalIntTypeSizeInBits isn't what we should be doing.

hiraditya added inline comments.Oct 2 2018, 10:14 AM
lib/Transforms/InstCombine/InstCombineCalls.cpp
139

Totally agreed with @spatel , ideally we shouldn't be inlining memcpy this early on, we should try to be independent of target specific behavior at instcombine. It be ideal to just not have inlining of memcpy here and do it all in memcpyopt with better cost-model.

The ultimate goal would be to simply always canonicalize to memcpy and not expand it ever in instcombine as mentioned in D52081.

Looks like we all agree on this now.

But we don't do that (yet) because we're afraid of missed optimizations that can't be replicated in the backend.

Perhaps the impact is negligible, non-existent, and we worry about this for nothing. As also suggested earlier, I will try to get some numbers on the table for ARM and AArch64 if we strip out the lowering here, if that is helpful for this discussion, but probably need a day or two to get them.

The ultimate goal would be to simply always canonicalize to memcpy and not expand it ever in instcombine as mentioned in D52081.

Looks like we all agree on this now.

But we don't do that (yet) because we're afraid of missed optimizations that can't be replicated in the backend.

Perhaps the impact is negligible, non-existent, and we worry about this for nothing. As also suggested earlier, I will try to get some numbers on the table for ARM and AArch64 if we strip out the lowering here, if that is helpful for this discussion, but probably need a day or two to get them.

If you could provide some numbers, I can go ahead and remove the inlining of memcpy altogether provided the reviewers agree with it, or we can merge this patch which is trying to improve on previously hardcoded numbers.

Perhaps the impact is negligible, non-existent, and we worry about this for nothing. As also suggested earlier, I will try to get some numbers on the table for ARM and AArch64 if we strip out the lowering here, if that is helpful for this discussion, but probably need a day or two to get them.

If you could provide some numbers, I can go ahead and remove the inlining of memcpy altogether provided the reviewers agree with it, or we can merge this patch which is trying to improve on previously hardcoded numbers.

Yes, I support removing the expansion entirely, but I don't think we can commit that change without doing some advance perf testing.
And yes, in the best case, we'll discover that there are no regressions because all of the other analyses and lowering will do the transform as intended when it's profitable.

If that doesn't work though, using LargestLegalIntTypeSizeInBits still seems like a good compromise to me. We want to conservatively limit the expansion to a size/type that the target tells us is ok (can be performed with a single load/store), and that's the value that most closely matches what we have today, so we avoid regressions as we work to the goal. It's not the ideal change, but there's precedence for this sort of datalayout use in instcombine (see InstCombiner::shouldChangeType()). Adding a new specifier to the datalayout to account for things like pair ops or vectors doesn't make sense to me - that moves us away from the goal of improving the other passes and removing the expansion in instcombine.

Perhaps the impact is negligible, non-existent, and we worry about this for nothing. As also suggested earlier, I will try to get some numbers on the table for ARM and AArch64 if we strip out the lowering here, if that is helpful for this discussion, but probably need a day or two to get them.

If you could provide some numbers, I can go ahead and remove the inlining of memcpy altogether provided the reviewers agree with it, or we can merge this patch which is trying to improve on previously hardcoded numbers.

Yes, I support removing the expansion entirely, but I don't think we can commit that change without doing some advance perf testing.
And yes, in the best case, we'll discover that there are no regressions because all of the other analyses and lowering will do the transform as intended when it's profitable.

Just to drop a nit, here is a somewhat idiomatic pattern, that is recommended to avoid breaking strict aliasing: https://godbolt.org/z/_JNwOp
If we don't do this memcpy expansion where, will we have memcpy till the backend?
Surely this is not good.

spatel added a comment.Oct 2 2018, 1:38 PM

Perhaps the impact is negligible, non-existent, and we worry about this for nothing. As also suggested earlier, I will try to get some numbers on the table for ARM and AArch64 if we strip out the lowering here, if that is helpful for this discussion, but probably need a day or two to get them.

If you could provide some numbers, I can go ahead and remove the inlining of memcpy altogether provided the reviewers agree with it, or we can merge this patch which is trying to improve on previously hardcoded numbers.

Yes, I support removing the expansion entirely, but I don't think we can commit that change without doing some advance perf testing.
And yes, in the best case, we'll discover that there are no regressions because all of the other analyses and lowering will do the transform as intended when it's profitable.

Just to drop a nit, here is a somewhat idiomatic pattern, that is recommended to avoid breaking strict aliasing: https://godbolt.org/z/_JNwOp
If we don't do this memcpy expansion where, will we have memcpy till the backend?
Surely this is not good.

That case (and a couple of similar tests that I tried) are handled by -sroa, so they probably never made it to instcombine in the 1st place. I don't know anything about SROA, but hopefully, it's making that transform using some principled logic. :)

jfb added a comment.Oct 2 2018, 1:50 PM

Perhaps the impact is negligible, non-existent, and we worry about this for nothing. As also suggested earlier, I will try to get some numbers on the table for ARM and AArch64 if we strip out the lowering here, if that is helpful for this discussion, but probably need a day or two to get them.

If you could provide some numbers, I can go ahead and remove the inlining of memcpy altogether provided the reviewers agree with it, or we can merge this patch which is trying to improve on previously hardcoded numbers.

Yes, I support removing the expansion entirely, but I don't think we can commit that change without doing some advance perf testing.
And yes, in the best case, we'll discover that there are no regressions because all of the other analyses and lowering will do the transform as intended when it's profitable.

Just to drop a nit, here is a somewhat idiomatic pattern, that is recommended to avoid breaking strict aliasing: https://godbolt.org/z/_JNwOp
If we don't do this memcpy expansion where, will we have memcpy till the backend?
Surely this is not good.

That case (and a couple of similar tests that I tried) are handled by -sroa, so they probably never made it to instcombine in the 1st place. I don't know anything about SROA, but hopefully, it's making that transform using some principled logic. :)

clang also performs some of this work, even at O0.

Perhaps the impact is negligible, non-existent, and we worry about this for nothing. As also suggested earlier, I will try to get some numbers on the table for ARM and AArch64 if we strip out the lowering here, if that is helpful for this discussion, but probably need a day or two to get them.

If you could provide some numbers, I can go ahead and remove the inlining of memcpy altogether provided the reviewers agree with it, or we can merge this patch which is trying to improve on previously hardcoded numbers.

Yes, I support removing the expansion entirely, but I don't think we can commit that change without doing some advance perf testing.

Makes sense.

And yes, in the best case, we'll discover that there are no regressions because all of the other analyses and lowering will do the transform as intended when it's profitable.

If that doesn't work though, using LargestLegalIntTypeSizeInBits still seems like a good compromise to me. We want to conservatively limit the expansion to a size/type that the target tells us is ok (can be performed with a single load/store), and that's the value that most closely matches what we have today, so we avoid regressions as we work to the goal. It's not the ideal change, but there's precedence for this sort of datalayout use in instcombine (see InstCombiner::shouldChangeType()). Adding a new specifier to the datalayout to account for things like pair ops or vectors doesn't make sense to me - that moves us away from the goal of improving the other passes and removing the expansion in instcombine.

Do you think getting this patch in is a good for now. After some performance analysis if we find that we dont need to inline memcpy here then we can remove this entirely at a later stage.

spatel added a comment.Oct 3 2018, 2:32 PM

Do you think getting this patch in is a good for now. After some performance analysis if we find that we dont need to inline memcpy here then we can remove this entirely at a later stage.

We should wait for the perf results from @SjoerdMeijer (and anyone else that is in a position to collect benchmark results?).
If the full solution (don't expand at all) is too ambitious, then I support continuing with this patch, but we need to make sure that we've answered @jfb's concerns. Ie, if the datalayout query is too distasteful, what is the alternative?

I am doing some experiments with a hack that simply comments out the call to InstCombiner::SimplifyAnyMemTransfer. I've ran 3 smaller benchmarks. 2 didn't show any difference. The 3rd shows 1 tiny regression and 1 tiny improvement in a test case, and will probably not even show a difference in the geomean. I am doing this as a background task; tomorrow I will ran bigger benchmarks, on different platforms. But I guess we need some numbers for non-Arm platforms too.

Oops, sorry, forgot to reply, and also got distracted by a few other things. I ran one bigger benchmark, and didn't see anything worth mentioning. A first preliminary conclusion: looks like we don't miss much by not doing this lowering here. Disclaimer: I've tested only on Arm targets, definitely not on all interesting architecture combinations, and a handful of benchmarks.

lebedev.ri requested changes to this revision.Oct 16 2018, 8:27 AM
lebedev.ri added inline comments.
lib/Transforms/InstCombine/InstCombineCalls.cpp
140

I think this is still broken? (and thus, any and all benchmarks thus far are incorrect.)
As it was previously pointed out by @dmgreen, Size is bytes, while LargestInt is clearly in bits.

This revision now requires changes to proceed.Oct 16 2018, 8:27 AM

Fix testcases and address comments

hiraditya marked 2 inline comments as done.Oct 21 2018, 12:37 PM
lebedev.ri resigned from this revision.Oct 30 2018, 3:23 PM

Not sure what is the general consensus wrt this patch, but i guess it now consistently uses bytes.

lib/Transforms/InstCombine/InstCombineCalls.cpp
199

Nit: spaces around /.

Not sure what is the general consensus wrt this patch, but i guess it now consistently uses bytes.

Agree - the code change looks like what I expected now.

So:

  1. Is @jfb objecting to this as an intermediate improvement?
  2. If not, there are unanswered comments about the test diffs.
  3. There was also an unanswered request for the targets and size/perf improvements for this change (presumably some 32-bit target shows wins).
  4. Are there any updates on the perf for the ideal change (removing this expansion completely)?
jfb added a comment.Nov 1 2018, 10:44 AM

Not sure what is the general consensus wrt this patch, but i guess it now consistently uses bytes.

Agree - the code change looks like what I expected now.

So:

  1. Is @jfb objecting to this as an intermediate improvement?

My concern is here: https://reviews.llvm.org/D35035#inline-464768
This uses an unrelated constant to drive optimization decisions. Create a new per-target constant.

  1. If not, there are unanswered comments about the test diffs.
  2. There was also an unanswered request for the targets and size/perf improvements for this change (presumably some 32-bit target shows wins).
  3. Are there any updates on the perf for the ideal change (removing this expansion completely)?

I do want to see size and perf results.

In D35035#1284164, @jfb wrote:

Not sure what is the general consensus wrt this patch, but i guess it now consistently uses bytes.

Agree - the code change looks like what I expected now.

So:

  1. Is @jfb objecting to this as an intermediate improvement?

My concern is here: https://reviews.llvm.org/D35035#inline-464768
This uses an unrelated constant to drive optimization decisions. Create a new per-target constant.

I think the intent of this patch is to do best-effort without relying a lot on target specific constants. Will it help to have per-target constant only for mem* functions? Could that constant be used to drive other optimizations?

  1. If not, there are unanswered comments about the test diffs.
  2. There was also an unanswered request for the targets and size/perf improvements for this change (presumably some 32-bit target shows wins).
  3. Are there any updates on the perf for the ideal change (removing this expansion completely)?

I do want to see size and perf results.

jfb added a comment.Nov 13 2018, 9:57 AM
In D35035#1284164, @jfb wrote:

Not sure what is the general consensus wrt this patch, but i guess it now consistently uses bytes.

Agree - the code change looks like what I expected now.

So:

  1. Is @jfb objecting to this as an intermediate improvement?

My concern is here: https://reviews.llvm.org/D35035#inline-464768
This uses an unrelated constant to drive optimization decisions. Create a new per-target constant.

I think the intent of this patch is to do best-effort without relying a lot on target specific constants. Will it help to have per-target constant only for mem* functions? Could that constant be used to drive other optimizations?

Yes, please add a constant which informs this optimization. Don't reuse an unrelated constant for this purpose. Don't use this new constant to dive other unrelated optimizations.

  1. If not, there are unanswered comments about the test diffs.
  2. There was also an unanswered request for the targets and size/perf improvements for this change (presumably some 32-bit target shows wins).
  3. Are there any updates on the perf for the ideal change (removing this expansion completely)?

I do want to see size and perf results.

That case (and a couple of similar tests that I tried) are handled by -sroa, so they probably never made it to instcombine in the 1st place. I don't know anything about SROA, but hopefully, it's making that transform using some principled logic. :)

FYI: @chandlerc mentioned to me (in the context of https://bugs.llvm.org/show_bug.cgi?id=39780 ) that SROA expects instcombine to handle memcpy and promote them to registers.

lib/Transforms/InstCombine/InstCombineCalls.cpp
139

All I'm saying is that replacing MagicNumber == 8 with ErroneousUsage == LargestLegalIntTypeSizeInBits isn't what we should be doing.

@jfb can you elaborate why we shouldn't do that?

I'm not sure I understand why are you qualifying this as ErroneousUsage?
I can see that it would be suboptimal in some cases (and ideally it would be a TTI), but that does not enough to make it erroneous. I see a rational to consider that if I have a legal integer I can likely load/store it with a single pair of instructions. So it seems to me like a strict improvements over the existing "magic number".

That case (and a couple of similar tests that I tried) are handled by -sroa, so they probably never made it to instcombine in the 1st place. I don't know anything about SROA, but hopefully, it's making that transform using some principled logic. :)

FYI: @chandlerc mentioned to me (in the context of https://bugs.llvm.org/show_bug.cgi?id=39780 ) that SROA expects instcombine to handle memcpy and promote them to registers.

Is it possible to improve SROA to understand llvm.memcpy instead? IIUC, the memcpy intrinsic is considered canonical form, and the consensus in this review was that we would not expand memcpy in instcombine if that did not cause regressions.

D54887 is proposing a similar change as here, but it's more liberal. It would allow expansion to load/store of weird types (non-power-of-2) as long as they fit in a single integer register. That seems dangerous given that the backend isn't very good at handling those illegal types. Example:

target datalayout = "e-p:64:64:64-p1:32:32:32-p2:16:16:16-n8:16:32:64"
declare void @llvm.memcpy.p0i8.p0i8.i32(i8* nocapture, i8* nocapture, i32, i1) nounwind

define void @copy_3_bytes(i8* %d, i8* %s) {
  call void @llvm.memcpy.p0i8.p0i8.i32(i8* %d, i8* %s, i32 3, i1 false)
  ret void
}
$ opt -instcombine -S memcpy.ll
...
define void @copy_3_bytes(i8* %d, i8* %s) {
  %1 = bitcast i8* %s to i24*
  %2 = bitcast i8* %d to i24*
  %3 = load i24, i24* %1, align 1
  store i24 %3, i24* %2, align 1
  ret void
}
jfb added inline comments.Nov 26 2018, 10:49 AM
lib/Transforms/InstCombine/InstCombineCalls.cpp
139

Because we should be able to tune this magic number over time, per target, and using LargestLegalIntTypeSizeInBits doesn't allow us to do so. It's trivial to add this new magic number, and default it to LargestLegalIntTypeSizeInBits in the generic target setup. Let's just do that, and then the code is even self-documenting.

alnemr55 accepted this revision.Jan 4 2019, 2:38 AM
This revision is now accepted and ready to land.Jan 4 2019, 2:38 AM
lebedev.ri requested changes to this revision.Jan 4 2019, 3:05 AM

Data layout question wasn't resolved. (just signalling, i don't have a preference one way or another)

This revision now requires changes to proceed.Jan 4 2019, 3:05 AM

Thanks Roman.

@alnemr55 - Why did you mark this as accepted? Maybe it was an accidental click? If not, please don't mark things as accepted unless you've been actively participating in the review and the review has genuinely concluded.

ychen added a subscriber: ychen.Jun 20 2019, 4:05 PM
Herald added a project: Restricted Project. · View Herald TranscriptJun 20 2019, 4:05 PM
lebedev.ri resigned from this revision.Jan 12 2023, 4:41 PM

This review seems to be stuck/dead, consider abandoning if no longer relevant.

This revision is now accepted and ready to land.Jan 12 2023, 4:41 PM
Herald added a project: Restricted Project. · View Herald TranscriptJan 12 2023, 4:41 PM
Herald added a subscriber: StephenFan. · View Herald Transcript