This is an archive of the discontinued LLVM Phabricator instance.

[InstCombine, ARM, AArch64] Convert table lookup to shuffle vector
ClosedPublic

Authored by labrinea on Apr 26 2018, 9:51 AM.

Details

Summary

Turning a table lookup intrinsic into a shuffle vector instruction can be beneficial. If the mask used for the lookup is the constant vector {7,6,5,4,3,2,1,0}, then the back-end generates rev64 instructions instead.

Diff Detail

Event Timeline

labrinea created this revision.Apr 26 2018, 9:51 AM

The constant folding of the vld1 intrinsic might worth moving in lib/Analysis/ConstantFolding.cpp as a separate patch, but I wasn't sure whether we always want to fold vld1, even when it's not used as a table lookup mask. I've asked about the matter in llvm-dev.

lebedev.ri edited reviewers, added: spatel; removed: llvm-commits.Apr 26 2018, 9:58 AM
efriedma added inline comments.
lib/Transforms/InstCombine/InstCombineCalls.cpp
951

Please use llvm::ConstantFoldLoadFromConstPtr.

974

Why does the mask pattern matter?

javed.absar added inline comments.Apr 26 2018, 1:31 PM
lib/Transforms/InstCombine/InstCombineCalls.cpp
938

Should we assert here or simply return nullptr?

test/Transforms/InstCombine/AArch64/table-lookup.ll
2 ↗(On Diff #144141)

It would be nice to add a comment here to explain the purpose of this test.

labrinea added inline comments.Apr 27 2018, 1:55 AM
lib/Transforms/InstCombine/InstCombineCalls.cpp
938

I followed the same practice as other conversion functions of this file. We call simplifyTableLookup only when handling the Arm/AArch64 tbl1 intrinsics, so we certainly know that NumElts is 8. The assertion makes sure we don't call this routine from another context.

951

This won't work. ConstantFoldLoadFromConstPtr first tries ConstantFoldLoadThroughGEPConstantExpr. This returns just the first element of the [8xi8] array, since vld1 expects i8*, which is obtained by i8* getelementptr inbounds ([8 x i8], [8 x i8]* @big_endian_mask, i32 0, i32 0) in my reproducer.

974

Turning a table lookup with constant mask into a shuffle vector is not always beneficial. This particular pattern allows the back-end to generate byte reverse instructions instead of a table lookup, which is better. Generalising the transformation for every pattern probably doesn't hurt but it's not beneficial either. However, applying the transformation for larger table lookups (tbl2,tbl3,tbl4) results in worse codegen from the back-end.

test/Transforms/InstCombine/AArch64/table-lookup.ll
2 ↗(On Diff #144141)

Sure, thanks.

efriedma added inline comments.Apr 27 2018, 12:30 PM
lib/Transforms/InstCombine/InstCombineCalls.cpp
951

ConstantFoldLoadThroughGEPConstantExpr(ConstantExpr::getBitCast(C, VecTy.getPointerTo()) or something like that should work.

974

Sure, it's not always beneficial, but it's likely we can perform some useful transform on the resulting shuffle, and worst-case the backend just turns the shuffle back into a vtbl. And I don't really want to have code here to try to figure out whether a given shuffle is legal; that gets really complicated in general.

@efriedma I am going to create a new revision for converting a vld1 into an llvm load. I'll then update this revision to keep only the tbl1~>shufflevector conversion. I'll also make the patch accept any constant mask pattern.

labrinea updated this revision to Diff 144699.May 1 2018, 5:22 AM

I've moved the constant folding to a new revision: https://reviews.llvm.org/D46273. I've also added comments to the tests explaining the reason of this transformation.

Please add a testcase for llvm.aarch64.neon.tbl1.v16i8.

Good catch. The current patch hits the assertion when handling the llvm.aarch64.neon.tbl1.v16i8 inrinsic, because NumElts is 16. Does it make sense to perform the transformation in this case? I could get rid of the assert and bail the optimization if NumElts neither 8 nor 16 (or just 8).

labrinea updated this revision to Diff 144867.May 2 2018, 6:13 AM
efriedma added inline comments.May 18 2018, 1:40 PM
lib/Transforms/InstCombine/InstCombineCalls.cpp
1494

I guess you could also check that the vector element type is i8? It should be for code generated by clang, but the verifier doesn't actually enforce it, and this code can crash if it isn't.

1510

getLimitedValue() instead of getZExtValue()? Should be the same thing in valid cases, and avoids a potential crash on invalid input.

1511

You need special handling if Index is greater than or equal to NumElts*2. (vtbl produces 0, shufflevector produces undef.)

1514

You could simplify this code a little by using ConstantDataVector::get instead of ConstantVector::get.

test/Transforms/InstCombine/AArch64/tbl1.ll
24

Please add some testcases where the transform bails out.

labrinea updated this revision to Diff 148008.May 22 2018, 8:27 AM
  • Restrict the optimization to table lookups returning a <8 x i8> vector type. It's only beneficial for the mask {7,6,5,4,3,2,1,0} anyway.
  • Bail out if the constant mask contains an index out of range. The second argument of the new shufflevector is always going to be <undef> so the range is [0 ~ NumElts-1], not [0 ~ 2*NumElts-1], like @efriedma noticed.
  • Replaced getZExtValue() with getLimitedValue() for getting the value of a ConstantInt.
  • Simplified the retrieval of mask indices by using ConstantDataVector::get instead of ConstantVector::get.
  • Added testcases where the transform bails out.
  • Autogenerated the Filecheck patterns using the script utils/update_test_checks.py
This revision is now accepted and ready to land.May 23 2018, 11:51 AM
This revision was automatically updated to reflect the committed changes.