Turning a table lookup intrinsic into a shuffle vector instruction can be beneficial. If the mask used for the lookup is the constant vector {7,6,5,4,3,2,1,0}, then the back-end generates rev64 instructions instead.
Details
Diff Detail
Event Timeline
The constant folding of the vld1 intrinsic might worth moving in lib/Analysis/ConstantFolding.cpp as a separate patch, but I wasn't sure whether we always want to fold vld1, even when it's not used as a table lookup mask. I've asked about the matter in llvm-dev.
lib/Transforms/InstCombine/InstCombineCalls.cpp | ||
---|---|---|
938 | I followed the same practice as other conversion functions of this file. We call simplifyTableLookup only when handling the Arm/AArch64 tbl1 intrinsics, so we certainly know that NumElts is 8. The assertion makes sure we don't call this routine from another context. | |
951 | This won't work. ConstantFoldLoadFromConstPtr first tries ConstantFoldLoadThroughGEPConstantExpr. This returns just the first element of the [8xi8] array, since vld1 expects i8*, which is obtained by i8* getelementptr inbounds ([8 x i8], [8 x i8]* @big_endian_mask, i32 0, i32 0) in my reproducer. | |
974 | Turning a table lookup with constant mask into a shuffle vector is not always beneficial. This particular pattern allows the back-end to generate byte reverse instructions instead of a table lookup, which is better. Generalising the transformation for every pattern probably doesn't hurt but it's not beneficial either. However, applying the transformation for larger table lookups (tbl2,tbl3,tbl4) results in worse codegen from the back-end. | |
test/Transforms/InstCombine/AArch64/table-lookup.ll | ||
2 ↗ | (On Diff #144141) | Sure, thanks. |
lib/Transforms/InstCombine/InstCombineCalls.cpp | ||
---|---|---|
951 | ConstantFoldLoadThroughGEPConstantExpr(ConstantExpr::getBitCast(C, VecTy.getPointerTo()) or something like that should work. | |
974 | Sure, it's not always beneficial, but it's likely we can perform some useful transform on the resulting shuffle, and worst-case the backend just turns the shuffle back into a vtbl. And I don't really want to have code here to try to figure out whether a given shuffle is legal; that gets really complicated in general. |
@efriedma I am going to create a new revision for converting a vld1 into an llvm load. I'll then update this revision to keep only the tbl1~>shufflevector conversion. I'll also make the patch accept any constant mask pattern.
I've moved the constant folding to a new revision: https://reviews.llvm.org/D46273. I've also added comments to the tests explaining the reason of this transformation.
Good catch. The current patch hits the assertion when handling the llvm.aarch64.neon.tbl1.v16i8 inrinsic, because NumElts is 16. Does it make sense to perform the transformation in this case? I could get rid of the assert and bail the optimization if NumElts neither 8 nor 16 (or just 8).
lib/Transforms/InstCombine/InstCombineCalls.cpp | ||
---|---|---|
1494 | I guess you could also check that the vector element type is i8? It should be for code generated by clang, but the verifier doesn't actually enforce it, and this code can crash if it isn't. | |
1510 | getLimitedValue() instead of getZExtValue()? Should be the same thing in valid cases, and avoids a potential crash on invalid input. | |
1511 | You need special handling if Index is greater than or equal to NumElts*2. (vtbl produces 0, shufflevector produces undef.) | |
1514 | You could simplify this code a little by using ConstantDataVector::get instead of ConstantVector::get. | |
test/Transforms/InstCombine/AArch64/tbl1.ll | ||
24 | Please add some testcases where the transform bails out. |
- Restrict the optimization to table lookups returning a <8 x i8> vector type. It's only beneficial for the mask {7,6,5,4,3,2,1,0} anyway.
- Bail out if the constant mask contains an index out of range. The second argument of the new shufflevector is always going to be <undef> so the range is [0 ~ NumElts-1], not [0 ~ 2*NumElts-1], like @efriedma noticed.
- Replaced getZExtValue() with getLimitedValue() for getting the value of a ConstantInt.
- Simplified the retrieval of mask indices by using ConstantDataVector::get instead of ConstantVector::get.
- Added testcases where the transform bails out.
- Autogenerated the Filecheck patterns using the script utils/update_test_checks.py
Should we assert here or simply return nullptr?