This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] try to narrow a truncated load
ClosedPublic

Authored by spatel on Jul 9 2019, 11:29 AM.

Details

Summary

trunc (load X) --> load (bitcast X to narrow type)

We have this transform in DAGCombiner::ReduceLoadWidth(), but the truncated load pattern can interfere with other instcombine transforms, so I'd like to allow the fold sooner.

Example:
https://bugs.llvm.org/show_bug.cgi?id=16739
...in that report, we have bitcasts bracketing these ops, so those could get eliminated too.

We've generally ruled out widening of loads early in IR ( LoadCombine - http://lists.llvm.org/pipermail/llvm-dev/2016-September/105291.html ), but I'm not sure if that reasoning applies to narrowing.

There's another request for narrowing in IR here, but it's a different pattern:
https://bugs.llvm.org/show_bug.cgi?id=42424

Diff Detail

Event Timeline

spatel created this revision.Jul 9 2019, 11:29 AM
Herald added a project: Restricted Project. · View Herald TranscriptJul 9 2019, 11:29 AM

I'm not sure that doing this at the IR level is the best idea. The problem is that when we narrow, we loose the dereferenceable fact about part of the memory access. This can in turn limit other transforms which would have been profitable. As an example:
a = load <2 x i8>* p
b = load <2 x i8>* (p+1)
sum = a[0] + a[1] + b[1]

Narrowing the b load to i8 looses the fact that the memory location corresponding to b[0] is dereferenceable, which would prevent transforms such as:
a = load <4 x i8>* p
a[2] = 0;
sum = horizontal_sum(a);

(Note: I'm not saying this alternate transform is always profitable. I'm just making a point about lost opportunity.)

I'm not sure that doing this at the IR level is the best idea. The problem is that when we narrow, we loose the dereferenceable fact about part of the memory access. This can in turn limit other transforms which would have been profitable. As an example:
a = load <2 x i8>* p
b = load <2 x i8>* (p+1)
sum = a[0] + a[1] + b[1]

Narrowing the b load to i8 looses the fact that the memory location corresponding to b[0] is dereferenceable, which would prevent transforms such as:
a = load <4 x i8>* p
a[2] = 0;
sum = horizontal_sum(a);

(Note: I'm not saying this alternate transform is always profitable. I'm just making a point about lost opportunity.)

Yes, I agree that we can lose information by narrowing. I was hoping to avoid that conflict with D64258, but we need to refine our definition of 'dereferenceable'. Potentially, we could have instcombine preserve the dereferenceable range in this transform?

I don't know if it makes a difference, but my intent is to not allow narrowing for vectors in this patch by using the data layout legality check. (We could make that vector bailout explicit.) So I don't think the given example with a vector type is at risk.

define i8 @narrowload(<2 x i8>* %p) {
  %a = load <2 x i8>, <2 x i8>* %p
  %p1 = getelementptr <2 x i8>, <2 x i8>* %p, i64 1
  %b = load <2 x i8>, <2 x i8>* %p1
  %a0 = extractelement <2 x i8> %a, i64 0
  %a1 = extractelement <2 x i8> %a, i64 1
  %b1 = extractelement <2 x i8> %b, i64 1
  %add1 = add i8 %a0, %a1
  %add2 = add i8 %add1, %b1
  ret i8 %add2
}
;sum = a[0] + a[1] + b[1]
jdoerfert added a comment.EditedJul 14 2019, 11:18 AM

I'm not sure that doing this at the IR level is the best idea. The problem is that when we narrow, we loose the dereferenceable fact about part of the memory access. This can in turn limit other transforms which would have been profitable. As an example:
a = load <2 x i8>* p
b = load <2 x i8>* (p+1)
sum = a[0] + a[1] + b[1]

Narrowing the b load to i8 looses the fact that the memory location corresponding to b[0] is dereferenceable, which would prevent transforms such as:
a = load <4 x i8>* p
a[2] = 0;
sum = horizontal_sum(a);

(Note: I'm not saying this alternate transform is always profitable. I'm just making a point about lost opportunity.)

Could we check here if the base pointer has dereferenceable annotation and use that as a condition for this transformation? (It's more complicated to be completely lossless but this seems to be an easy to test starting point).

Could we check here if the base pointer has dereferenceable annotation and use that as a condition for this transformation? (It's more complicated to be completely lossless but this seems to be an easy to test starting point).

That seems like a reasonable way to overcome the potential loss, so I'll update this patch with that change.
I'm expecting that will significantly reduce the opportunities for the transform though...until Attributor is working at full strength.

spatel updated this revision to Diff 210151.Jul 16 2019, 12:41 PM

Patch updated:

  1. Add limitation based on dereferenceable attribute to prevent information loss.
  2. Add/adjust tests to include dereferenceable attributes.

I think you are missing the negative test without dereferenceable.

I'm expecting that will significantly reduce the opportunities for the transform though...until Attributor is working at full strength.

Agreed. Though, we already have dereferenceabl arguments. @uenoku Is now working on dereferenceable deduction for the Attributor :)

spatel updated this revision to Diff 210342.Jul 17 2019, 8:45 AM

Patch updated:
Add a test with no 'dereferenceable' attribute on the pointer argument.

I think you are missing the negative test without dereferenceable.

Yes - updated.

I'm expecting that will significantly reduce the opportunities for the transform though...until Attributor is working at full strength.

Agreed. Though, we already have dereferenceabl arguments. @uenoku Is now working on dereferenceable deduction for the Attributor :)

Great!

Looks like you guys came to a completely reasonable short term solution with the dereferenceable limitation. I agree we need a better answer here long term, but being able to make progress on short term items without blocking is good to see.

Amusingly, I just stumbled across an example locally where narrowing a vector load was the right answer performance wise. :)

lebedev.ri accepted this revision.Jul 24 2019, 5:07 AM
lebedev.ri added a subscriber: lebedev.ri.

LGTM with dereferenceable() restriction, but maybe wait for one more review.

llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
714

This one is deprecated, let's use the proper one

LoadInst *NarrowLoad = new LoadInst(PtrTy, Bitcast);
llvm/test/Transforms/InstCombine/trunc-load.ll
8

Please precommit the tests.

This revision is now accepted and ready to land.Jul 24 2019, 5:07 AM
spatel marked an inline comment as done.Jul 24 2019, 5:13 AM
spatel added inline comments.
llvm/test/Transforms/InstCombine/trunc-load.ll
8

Yes - the tests were evolving with the code changes here, but now that we are settled on the constraints, I'll push those with baseline results as a preliminary step.

Thanks for reviewing!

spatel updated this revision to Diff 211500.Jul 24 2019, 7:21 AM

Patch updated:

  1. Use new API for creating load that takes element type and pointer.
  2. Rebased after adding baseline tests (rL366901).
spatel marked an inline comment as done.Jul 24 2019, 7:23 AM
spatel added inline comments.
llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
714

This isn't obvious (because there are apparently no header comments for any of these constructors), but the type specified by that 1st parameter is the element type of the loaded value, not the pointer type.

jdoerfert accepted this revision.Jul 24 2019, 12:57 PM

One final request, otherwise LGTM.

llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
715

I think you need to copy over more information (probably there is code for that somewhere):

  • !nontemporal !<index>
  • !invariant.load !<index>
  • !invariant.group !<index>
spatel marked an inline comment as done.Jul 24 2019, 3:50 PM
spatel added inline comments.
llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
715

The closest match that I see is here:
rL366949

This revision was automatically updated to reflect the committed changes.