This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine] Transform extractelement-trunc -> bitcast-extractelement
ClosedPublic

Authored by dsprenkels on Mar 28 2020, 4:53 AM.

Details

Summary

Canonicalize the case when a scalar extracted from a vector is
truncated. Transform such cases to bitcast-then-extractelement.
This will enable erasing the truncate operation.

This commit fixes PR45314.

Diff Detail

Event Timeline

dsprenkels created this revision.Mar 28 2020, 4:53 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 28 2020, 4:53 AM

It was a bit unclear to me how involved the tests should be for this patch. At this time, I kept them pretty minimal, but I could add more if that's desired.

It was a bit unclear to me how involved the tests should be for this patch. At this time, I kept them pretty minimal, but I could add more if that's desired.

I didn't look at the logic closely, but seems to be on the right track from the tests (feel free to include Alive2 links in the review if you tested any/all of these).
I see at least 2 variations where we need more code logic (and more tests):

; In IR, crazy types are allowed.
define i13 @shrinkExtractElt_i67_to_i13_2(<3 x i67> %x) {
  %e = extractelement <3 x i67> %x, i459 2
  %t = trunc i67 %e to i13
  ret i13 %t
}

; We generally don't want to canonicalize to a form that increases the instruction count.
declare void @use(i64)
define i16 @shrinkExtractElt_i64_to_i16_2_extra_use(<3 x i64> %x) {
  %e = extractelement <3 x i64> %x, i64 2
  call void @use(i64 %e)
  %t = trunc i64 %e to i16
  ret i16 %t
}

I didn't run those through 'opt', so check for typos.

llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
855

m_ExtractElement() for matching/capturing the operands ?

llvm/test/Transforms/InstCombine/pr45314_be.ll
4

Only need the first "E" here (and that makes it clear that the transform depends on endian and not something else in that string).

Should we be checking anything about legality of the types here?

Do not canonicalize if the extractvector operand has other users.
Do not canonicalize if it would result in an invalid bitcast instruction.

dsprenkels marked 2 inline comments as done.Mar 28 2020, 9:49 AM

I didn't look at the logic closely, but seems to be on the right track from the tests (feel free to include Alive2 links in the review if you tested any/all of these).
I see at least 2 variations where we need more code logic (and more tests):

I added these cases to the patch.

For the case of the crazy types I added a check that requires the respective sizes of the vector and the truncated value to be compatible. This should prevent against an invalid bitcast being created. However, I still kinda allow really crazy types. Would it be better to just disable this canonicalization for types that don't look nice ? (Are "nice" types somehow even defined?)

Some examples that I checked with Alive2:

Little endian
  - <3 x i64> to i32: http://volta.cs.utah.edu:8080/z/tt-qpe
  - <3 x i64> to i32: http://volta.cs.utah.edu:8080/z/KAkFBh (idx=1)
  - <3 x i64> to i16: http://volta.cs.utah.edu:8080/z/EqzJSY (idx=2)

Big endian:
  - <3 x i64> to i32: http://volta.cs.utah.edu:8080/z/FRbHAA
  - <3 x i64> to i32: http://volta.cs.utah.edu:8080/z/oFvAdt (idx=1)
  - <3 x i64> to i16: http://volta.cs.utah.edu:8080/z/yB7Hng (idx=2)

See inline for a few code nits, and I'd make some test changes:

  1. Combine the 2 separate test files into 1 to reduce duplication and make the endian diffs easier to see. Take a look at llvm/test/Transforms/InstCombine/extractelement.ll to see an example.
  2. Name the file based on the transform rather than a bug name.
  3. Add a test that corresponds to the larger pattern from the bug report, so we know that the sequence of transforms within instcombine works on the motivating bug:
define <4 x i64> @PR45314(<4 x i64> %x) {
  %e = extractelement <4 x i64> %x, i32 0
  %t = trunc i64 %e to i32
  %i = insertelement <8 x i32> undef, i32 %t, i32 0
  %s = shufflevector <8 x i32> %i, <8 x i32> undef, <8 x i32> zeroinitializer
  %b = bitcast <8 x i32> %s to <4 x i64>
  ret <4 x i64> %b
}
  1. Generate the baseline CHECK lines without this code patch in place and push the tests to master as a preliminary NFC patch. Then apply this code patch, so we see the code diffs resulting from this patch in this review. That way, the tests will safely remain even if this code patch has to be reverted for some reason.
llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
847

I don't think we use the triple-slash documentation comment within a function. Either change to the standard "//" or make a helper function with the doxygen comment (similar to "foldVecTruncToExtElt" just above here).

855

Don't initialize this to nullptr. If the match fails, the variable should not be used, so initializing hides a potential compile-time warning for an unintended use of that variable. The same should be true of the existing variables, so if you want to fix them 1st with an NFC patch, that would be ok.

864

The VecNumElts factor doesn't change the modulo constraint?

865

Would be slightly easier to read if we made a local name for the common factor like:

unsigned TruncRatio = VecOpScalarSize / DestScalarSize;

However, I still kinda allow really crazy types. Would it be better to just disable this canonicalization for types that don't look nice ? (Are "nice" types somehow even defined?)

That's similar to @lebedev.ri 's question about legality. We have helpers that look at the data-layout to decide if a type is target-friendly/legal:

bool InstCombiner::shouldChangeType(unsigned FromWidth, unsigned ToWidth)
bool InstCombiner::shouldChangeType(Type *From, Type *To);

But that doesn't work with vector types (...because the data-layout doesn't specify vector types AFAIK). In this transform, we are not creating any new type except for a bitcast of a vector type. So I don't think we want to limit the transform. Targets/codegen should be able to deal with bitcasts of vector types.

dsprenkels marked an inline comment as done.Mar 29 2020, 7:29 AM

@spatel Thanks for the review. I will soon look into it!

Generate the baseline CHECK lines without this code patch in place and push the tests to master as a preliminary NFC patch. Then apply this code patch, so we see the code diffs resulting from this patch in this review. That way, the tests will safely remain even if this code patch has to be reverted for some reason.

I don't think I have the permissions do push directly to master. Is this easily fixed? Or should I create a separate differential for this?

llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
864

This is intentional. In this check, I need the bit-width of the whole vector. Consider this example:

http://volta.cs.utah.edu:8080/z/-GpGHX

target datalayout = "e"

define i30 @src(<3 x i40> %x) {
  %e = extractelement <3 x i40> %x, i32 0
  %t = trunc i40 %e to i30
  ret i30 %t
}

define i30 @tgt(<3 x i40> %x) {
  %1 = bitcast <3 x i40> %x to <4 x i30>
  %t = extractelement <4 x i30> %1, i30 0
  ret i30 %t
}

This describes a valid case to be canonicalized by this patch, however VecOpScalarSize % DestScalarSize == 40 % 30 != 0. That is why I check if (VecNumElts * VecOpScalarSize) % DestScalarSize == (3 * 40) % 30 == 0.

Should I add a comment here to clarify this?

@spatel Thanks for the review. I will soon look into it!

Generate the baseline CHECK lines without this code patch in place and push the tests to master as a preliminary NFC patch. Then apply this code patch, so we see the code diffs resulting from this patch in this review. That way, the tests will safely remain even if this code patch has to be reverted for some reason.

I don't think I have the permissions do push directly to master. Is this easily fixed? Or should I create a separate differential for this?

My git lingo might be off here. Are you saying you don't have commit permissions for LLVM yet?
https://llvm.org/docs/DeveloperPolicy.html#obtaining-commit-access

If not, then yes please create a separate Phab review and someone will push that for you.

llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
864

Hmm...how does that example translate for big-endian?
A code comment is good; another regression test is even better.

The new preliminary tests have been submitted in https://reviews.llvm.org/D77024. I removed the test for now and will submit another update when the other diff has been committed.

I also updated the diff in line with the inline comments.

dsprenkels marked 5 inline comments as done.Mar 29 2020, 1:03 PM

My git lingo might be off here. Are you saying you don't have commit permissions for LLVM yet?

Yup. Should I ask for it? (I don't know what the etiquette is; i.e. how many patches before one should request commit access.)

llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
855

I updated every variable that this patch touches. I can probably fix the others later in an NFC patch.

864

You are right, this case would lead to an invalid transform. I fixed this, and added a test in https://reviews.llvm.org/D77024 that checks that these cases should not be changed.

My git lingo might be off here. Are you saying you don't have commit permissions for LLVM yet?

Yup. Should I ask for it? (I don't know what the etiquette is; i.e. how many patches before one should request commit access.)

I don't see it stated explicitly anywhere, but I think 2-3 functional patches usually qualifies as a good "track record". If you have that already, then please request. If not, no problem - I can commit on your behalf.

dsprenkels marked 2 inline comments as done.

Update the new trunc-extractelement.ll test file.

I think we're pretty close now.
Added another test, so please rebase/update:
rGbc60cdcc3f8

llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
869–872

The index math can overflow:

define i8 @src(<1073741824 x i32> %x) {
  %e = extractelement <1073741824 x i32> %x, i32 1073741823
  %t = trunc i32 %e to i8
  ret i8 %t
}

To be safe(r), use uint64_t for these variables. Normally, we want to have a regression test for a known problem like that, but I'm going to suggest not adding that because it could cost a lot of execution time for a test case that is probably not going to occur in the real-world before LLVM is long gone.

  • Updated the test.
  • Changed uint32_ts to uint64_t and added an assert to catch overflows.
dsprenkels marked 2 inline comments as done.Mar 30 2020, 9:17 AM
dsprenkels marked an inline comment as done.
spatel accepted this revision.Mar 30 2020, 9:59 AM

LGTM

This revision is now accepted and ready to land.Mar 30 2020, 9:59 AM

Let me look at the failed test in a couple of hours.

Btw. I have commit permissions now, so I will be able to commit this myself now. :)

Let me look at the failed test in a couple of hours.

Btw. I have commit permissions now, so I will be able to commit this myself now. :)

Great! A good first commit / NFC patch will update this old test file:
https://github.com/llvm/llvm-project/blob/master/llvm/test/Transforms/InstCombine/ExtractCast.ll
by using utils/update_test_checks.py to auto-generate the CHECK lines. I suspect that we're just picking up an existing test with this transform (not a bug).

  • Updated ExtractCast.ll test.

Should I explicitly add the data layout in ExtractCast.ll, because the added assertions are only valid on little endian platforms?

  • Updated ExtractCast.ll test.

Should I explicitly add the data layout in ExtractCast.ll, because the added assertions are only valid on little endian platforms?

I don’t see a need for that. That test is practically identical to the tests we added explicitly for this transform.

This revision was automatically updated to reflect the committed changes.

This patch triggers a regression on our side:

For the following code:

define dso_local <4 x i16> @truncate_v_v(<4 x i32> %lhs) #0 {
entry:
  %vecext = extractelement <4 x i32> %lhs, i32 0
  %conv = trunc i32 %vecext to i16
  %vecinit = insertelement <4 x i16> undef, i16 %conv, i32 0
  %vecext1 = extractelement <4 x i32> %lhs, i32 1
  %conv2 = trunc i32 %vecext1 to i16
  %vecinit3 = insertelement <4 x i16> %vecinit, i16 %conv2, i32 1
  %vecext4 = extractelement <4 x i32> %lhs, i32 2
  %conv5 = trunc i32 %vecext4 to i16
  %vecinit6 = insertelement <4 x i16> %vecinit3, i16 %conv5, i32 2
  %vecext7 = extractelement <4 x i32> %lhs, i32 3
  %conv8 = trunc i32 %vecext7 to i16
  %vecinit9 = insertelement <4 x i16> %vecinit6, i16 %conv8, i32 3
  ret <4 x i16> %vecinit9
}

The tests expects to see:

define dso_local <4 x i16> @truncate_v_v(<4 x i32> %lhs) local_unnamed_addr #0 {
entry:
  %0 = trunc <4 x i32> %lhs to <4 x i16>
  ret <4 x i16> %0
}

which, in machine instructions, is mapped onto a vector trunc instruction.

But now, we see:

define dso_local <4 x i16> @truncate_v_v(<4 x i32> %lhs) local_unnamed_addr #0 {
entry:
  %0 = bitcast <4 x i32> %lhs to <8 x i16>
  %vecinit9 = shufflevector <8 x i16> %0, <8 x i16> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
  ret <4 x i16> %vecinit9
}

which is expanded into a large sequence of code going through the stack.

This patch triggers a regression on our side:

<...>

The tests expects to see:

define dso_local <4 x i16> @truncate_v_v(<4 x i32> %lhs) local_unnamed_addr #0 {
entry:
  %0 = trunc <4 x i32> %lhs to <4 x i16>
  ret <4 x i16> %0
}

which, in machine instructions, is mapped onto a vector trunc instruction.

But now, we see:

define dso_local <4 x i16> @truncate_v_v(<4 x i32> %lhs) local_unnamed_addr #0 {
entry:
  %0 = bitcast <4 x i32> %lhs to <8 x i16>
  %vecinit9 = shufflevector <8 x i16> %0, <8 x i16> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
  ret <4 x i16> %vecinit9
}

which is expanded into a large sequence of code going through the stack.

This looks like a simple missed transform to me, not a miscompile

spatel added a comment.Apr 1 2020, 6:38 AM

This patch triggers a regression on our side:

<...>

The tests expects to see:

define dso_local <4 x i16> @truncate_v_v(<4 x i32> %lhs) local_unnamed_addr #0 {
entry:
  %0 = trunc <4 x i32> %lhs to <4 x i16>
  ret <4 x i16> %0
}

which, in machine instructions, is mapped onto a vector trunc instruction.

But now, we see:

define dso_local <4 x i16> @truncate_v_v(<4 x i32> %lhs) local_unnamed_addr #0 {
entry:
  %0 = bitcast <4 x i32> %lhs to <8 x i16>
  %vecinit9 = shufflevector <8 x i16> %0, <8 x i16> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
  ret <4 x i16> %vecinit9
}

which is expanded into a large sequence of code going through the stack.

This looks like a simple missed transform to me, not a miscompile

I agree. We hit a phase ordering difference - SLP can reduce the chain of insert/extract to a vector trunc, but it doesn't handle the shuffle-of-bitcast. The open question is where to implement that transform. We're on the edge of instcombine vs. vector-combine if we want to do this in IR. Ie, is there consensus that forming a size-changing vector cast from a shuffle is canonical?
Alternatively, we could defer to the backend, but that could still be viewed as a regression in IR since we have more instructions now.

This patch triggers a regression on our side:

<...>

The tests expects to see:

define dso_local <4 x i16> @truncate_v_v(<4 x i32> %lhs) local_unnamed_addr #0 {
entry:
  %0 = trunc <4 x i32> %lhs to <4 x i16>
  ret <4 x i16> %0
}

which, in machine instructions, is mapped onto a vector trunc instruction.

But now, we see:

define dso_local <4 x i16> @truncate_v_v(<4 x i32> %lhs) local_unnamed_addr #0 {
entry:
  %0 = bitcast <4 x i32> %lhs to <8 x i16>
  %vecinit9 = shufflevector <8 x i16> %0, <8 x i16> undef, <4 x i32> <i32 0, i32 2, i32 4, i32 6>
  ret <4 x i16> %vecinit9
}

which is expanded into a large sequence of code going through the stack.

This looks like a simple missed transform to me, not a miscompile

I agree. We hit a phase ordering difference - SLP can reduce the chain of insert/extract to a vector trunc,
but it doesn't handle the shuffle-of-bitcast. The open question is where to implement that transform.
We're on the edge of instcombine vs. vector-combine if we want to do this in IR.

Ie, is there consensus that forming a size-changing vector cast from a shuffle is canonical?

I would have guessed it is, yes.

Alternatively, we could defer to the backend, but that could still be viewed as a regression in IR since we have more instructions now.

spatel added a comment.Apr 1 2020, 1:56 PM

Ie, is there consensus that forming a size-changing vector cast from a shuffle is canonical?

I would have guessed it is, yes.

Agree - the trunc is better for analysis, and a quick check of various backends says we do worse at codegen of the shuffle than the trunc, so that's more likely to be the expected form.
And not sure if it counts for anything, but the trunc is the more human-readable form (vs. translating shuffle indexes that depend on endian).
I'll draft a patch.

I'll draft a patch.

thanks !

spatel added a comment.Apr 2 2020, 6:33 AM

I'll draft a patch.

thanks !

D77299