This is an archive of the discontinued LLVM Phabricator instance.

[X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW.
ClosedPublic

Authored by congh on Nov 17 2015, 2:59 PM.

Details

Summary

This patch detects the AVG pattern in vectorized code, which is simply c = (a + b + 1) / 2, where a, b, and c have the same type which are vectors of either unsigned i8 or unsigned i16. In the IR, i8/i16 will be promoted to i32 before any arithmetic operations. The following IR shows such an example:

%1 = zext <N x i8> %a to <N x i32>
%2 = zext <N x i8> %b to <N x i32>
%3 = add nuw nsw <N x i32> %1, <i32 1 x N>
%4 = add nuw nsw <N x i32> %3, %2
%5 = lshr <N x i32> %N, <i32 1 x N>
%6 = trunc <N x i32> %5 to <N x i8>

and with this patch it will be converted to a X86ISD::AVG instruction.

The pattern recognition is done when combining instructions just before type legalization during instruction selection. We do it here because after type legalization, it is much more difficult to do pattern recognition based on many instructions that are doing type conversions. Therefore, for target-specific instructions (like X86ISD::AVG), we need to take care of type legalization by ourselves. However, as X86ISD::AVG behaves similarly to ISD::ADD, I am wondering if there is a way to legalize operands and result types of X86ISD::AVG together with ISD::ADD. It seems that the current design doesn't support this idea.

Tests are added for SSE2, AVX2, and AVX512BW and both i8 and i16 types of variant vector sizes.

Diff Detail

Event Timeline

congh updated this revision to Diff 40440.Nov 17 2015, 2:59 PM
congh retitled this revision from to [X86][SSE] Detect AVG pattern during instruction combine for SSE2/AVX2/AVX512BW..
congh updated this object.
congh added reviewers: hfinkel, RKSimon, davidxl.
congh added a subscriber: llvm-commits.
RKSimon edited edge metadata.Nov 21 2015, 7:21 AM

Out of curiosity - how well does this work with if InstCombiner::visitCallInst is used to convert _mm_avg_epu16 (etc.) calls to general IR? It should constant fold if possible - but could the lowering work if only one input is constant?

test/CodeGen/X86/avg.ll
4

AVX2/AVX512BW can share an additional AVX prefix - reduce test duplication:

FileCheck %s --check-prefix=AVX --check-prefix=AVX2
FileCheck %s --check-prefix=AVX --check-prefix=AVX512BW

5

What does the code look like if we load the args instead of passing them in registers? Non-legal types in these cases often make the test cases less clear - in this case with all the pand/packuswb calls.

congh added a comment.Nov 22 2015, 1:26 PM

Out of curiosity - how well does this work with if InstCombiner::visitCallInst is used to convert _mm_avg_epu16 (etc.) calls to general IR? It should constant fold if possible - but could the lowering work if only one input is constant?

I didn't consider the case that one input is constant, in which case we are detecting (a + C) / 2 where C is a 8-bit constant and is greater than zero. Then we could perform PAVGW on a and C-1. I will update this patch to take care of this case. Thanks!

test/CodeGen/X86/avg.ll
4

I don't get it here: I didn't use AVX prefix at all. Should I test all SSE versions?

5

If we load v4i8 from memory, those packing instructions will be gone. I will update the test cases.

congh added inline comments.Nov 22 2015, 3:44 PM
test/CodeGen/X86/avg.ll
4

Now I know what you mean: for some test cases AVX2 and AVX512BW generate the same code. Thanks for the suggestion!

congh updated this revision to Diff 40892.Nov 22 2015, 4:58 PM
congh edited edge metadata.

Update the patch. Now we can detect AVG pattern with one constant operand.

Couple of minor comments.

lib/Target/X86/X86ISelLowering.cpp
25268

Do the extended vector element types have to be i32? I understood it as it could be anything that was greater in width than the source.

AMD APM v4 description for PAVGB:

An average is computed by adding pairs of operands, adding 1 to a 9-bit temporary sum, and rightshifting the temporary sum by one bit position.

25291

If this is supposed to be FIXME comment please mark it as such.

25303

Standard Style:

for (unsigned i = 0, e = V.getNumOperands(); i != NumOperands; ++i)

RKSimon added inline comments.Nov 23 2015, 7:44 AM
lib/Target/X86/X86ISelLowering.cpp
25329

Shouldn't the upper limit be 65536 for pavgw?

congh updated this revision to Diff 40956.Nov 23 2015, 11:28 AM
congh marked 2 inline comments as done.

Update the patch according to Simon's comment.

lib/Target/X86/X86ISelLowering.cpp
25268

You are right. I have updated this part to let this type be larger than i8/i16 in case in the future we do type demotion on intermediate types.

25291

This is fixed already in this patch.

25329

Good catch! Corrected. Thanks!

RKSimon accepted this revision.Nov 23 2015, 2:09 PM
RKSimon edited edge metadata.

LGTM - one minor query and please can you alter a couple of test cases so the zext extends to something other than vXi32? (vXi16 and vXi64 I guess).

lib/Target/X86/X86ISelLowering.cpp
25623–25624

Just to be certain - this can only be called for AVX512 truncate stores?

This revision is now accepted and ready to land.Nov 23 2015, 2:09 PM
congh added a comment.Nov 23 2015, 9:26 PM

LGTM - one minor query and please can you alter a couple of test cases so the zext extends to something other than vXi32? (vXi16 and vXi64 I guess).

I have replaced many types in the test file with i16/i64.
Thank you very much for the review!

lib/Target/X86/X86ISelLowering.cpp
25623–25624

I am not sure if TruncatingStore can be generated for other platforms, but I think this is still valid other than AVX512. Let me know if I am wrong.

This revision was automatically updated to reflect the committed changes.