Download Raw Diff

Details

Reviewers

spatel
craig.topper

Commits

rG9bd6067db650: [DAGCombiner] Masked merge: enhance handling of 'andn' with immediates
rL331685: [DAGCombiner] Masked merge: enhance handling of 'andn' with immediates

Summary

Split off from D46031.

The previous patch, D46493, completely disabled unfolding in case of immediates.
But we can do better:

trunk-vs-patch.txt16 KBDownload

prevpatch-vs-patch.txt14 KBDownload

https://rise4fun.com/Alive/xJS

Diff Detail

Repository: rL LLVM

Event Timeline

lebedev.ri created this revision.May 5 2018, 4:29 AM

lebedev.ri added a parent revision: D46493: [DagCombiner] Not all 'andn''s work with immediates..

lebedev.ri mentioned this in D46031: [DAGCombiner] Masked merge: if 'B' is constant, de-canonicalize the pattern (invert the mask)..

spatel added inline comments.May 6 2018, 9:43 AM

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
5415–5416 ↗	(On Diff #145365)	I'd just call these 'N0' and 'N1' to avoid confusion with the matched values.
5437 ↗	(On Diff #145365)	typo: canonicalize
5439 ↗	(On Diff #145365)	How did this become an 'or'? https://rise4fun.com/Alive/UMY We should have a comment/formula to describe the overall transform and why it makes sense: // If we can't form an 'andn' with Y (because it's a constant), // swap the final xor operand, so we can still use 'andn' to invert the mask: // ((X ^ C) & M) ^ C --> ((X ^ C) & NotM') ^ X

lebedev.ri added inline comments.May 6 2018, 9:53 AM

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
5439 ↗	(On Diff #145365)	o_O Great catch! I was initially experimenting with still proceeding to unfold the pattern here, not just de-canonicalize it, and clearly i did not fully undo all the changes.. That means i need to re-do mca analysis for this.

Ok, i really messed that up :D
This should be better: https://rise4fun.com/Alive/xJS

Fixed with the correct fold, updated mca diffs in the differential's description:

trunk-vs-patch.txt16 KBDownload

prevpatch-vs-patch.txt14 KBDownload

Reduce diff noise.

Diffusion mentioned this in rL331625: [InstCombine][NFC] Add tests for one more masked merge pattern..May 7 2018, 2:46 AM

In D46494#1089380, @lebedev.ri wrote:

Fixed with the correct fold, updated mca diffs in the differential's description:

trunk-vs-patch.txt16 KBDownload

prevpatch-vs-patch.txt14 KBDownload

We need to be careful here (and maybe there's a way for mca to show/warn about this, cc @andreadb).

When you simulate these instructions:

andnl %edx, %edi, %eax
orl $42, %edx
andnl %edx, %eax, %eax

Notice that the output of the sequence (%eax) is not used again by any instruction in the sequence. So this is measuring ideal throughput in a vacuum - each simulated iteration proceeds independently. Maybe that's what you intended, but the original sequence that you're comparing does not have that property:

andl	%edx, %edi
notl	%edx
andl	$42, %edx
orl	%edi, %edx    <--- output fed back as the input to first instruction

Each iteration depends on the previous one, so it's not fair to compare the stats for the 2 sequences as they're shown in the attached diff.

test/CodeGen/X86/unfold-masked-merge-scalar-variablemask.ll
707–710 ↗	(On Diff #145412)	I'd still prefer to put a comment on tests like this where we're testing a non-canonical form, so we have a reminder that we're testing this for completeness, but it's not expected. Ie, instcombine would improve this sequence to remove the inverted mask.

lebedev.ri added a child revision: D46008: [X86][AArch64][NFC] Add tests for vector masked merge unfolding.May 7 2018, 8:38 AM

In D46494#1089720, @spatel wrote:

In D46494#1089380, @lebedev.ri wrote:

Fixed with the correct fold, updated mca diffs in the differential's description:

trunk-vs-patch.txt16 KBDownload

prevpatch-vs-patch.txt14 KBDownload

We need to be careful here (and maybe there's a way for mca to show/warn about this, cc @andreadb).

At the moment, mca doesn't warn you about cross-iteration dependencies.
When micro-benchmarking, it is up to the user to make sure that it doesn't negatively impact the analysis. The timeline view makes it easier to catch these situations.

When you simulate these instructions:
andnl %edx, %edi, %eax
orl $42, %edx
andnl %edx, %eax, %eax
Notice that the output of the sequence (%eax) is not used again by any instruction in the sequence. So this is measuring ideal throughput in a vacuum - each simulated iteration proceeds independently. Maybe that's what you intended, but the original sequence that you're comparing does not have that property:
andl	%edx, %edi
notl	%edx
andl	$42, %edx
orl	%edi, %edx    <--- output fed back as the input to first instruction
Each iteration depends on the previous one, so it's not fair to compare the stats for the 2 sequences as they're shown in the attached diff.

I agree with Sanjay.

I also noticed that you often test for ryzen processors. It is also interesting to see what happens on processors with a smaller issue width.
You can manually change the second ANDN from the optimized sequence so that it updates %edx. That would make the two code snippets "sort-of" comparable in term of data dependencies.

With that change, I get that IPC is almost the same. The ANDNL sequence is slightly better, and consumes less cycles mainly because there is one instruction less to execute every iteration. Overall, on btver2, we go from IPC 1.33, to IPC 1.50.

The resource pressure distribution was already optimal in the original case.
The main advantage is that ANDN uses a VEX prefix, and therefore it allows encoding three register operands. That gives a bit more flexibility to the register allocator: the compiler can remove a register dependency at the cost of an extra register use (and a few more bytes in the instruction encoding). Speaking about instruction encoding: the ANDN is 5 bytes (instead of 2 bytes AND), so we go from a total of 9 bytes to a total of 13 bytes for the full sequence. With SimonP, we were thinking about adding instruction encoding information to the llvm-mca output.
If the code is optimized for minsize, then it may be worthy to generate the original sequence with two-address instructions only. It may be less optimal for data-dependencies, but it uses shorter encodings. Not sure if it matters that much though.

In D46494#1089852, @andreadb wrote:
In D46494#1089720, @spatel wrote:

In D46494#1089380, @lebedev.ri wrote:

Fixed with the correct fold, updated mca diffs in the differential's description:

trunk-vs-patch.txt16 KBDownload

prevpatch-vs-patch.txt14 KBDownload

We need to be careful here (and maybe there's a way for mca to show/warn about this, cc @andreadb).

At the moment, mca doesn't warn you about cross-iteration dependencies.
When micro-benchmarking, it is up to the user to make sure that it doesn't negatively impact the analysis. The timeline view makes it easier to catch these situations.
When you simulate these instructions:
andnl %edx, %edi, %eax
orl $42, %edx
andnl %edx, %eax, %eax
Notice that the output of the sequence (%eax) is not used again by any instruction in the sequence. So this is measuring ideal throughput in a vacuum - each simulated iteration proceeds independently. Maybe that's what you intended, but the original sequence that you're comparing does not have that property:
andl	%edx, %edi
notl	%edx
andl	$42, %edx
orl	%edi, %edx    <--- output fed back as the input to first instruction
Each iteration depends on the previous one, so it's not fair to compare the stats for the 2 sequences as they're shown in the attached diff.
I agree with Sanjay.

Right, thank you :)

I also noticed that you often test for ryzen processors. It is also interesting to see what happens on processors with a smaller issue width.

Noted.

You can manually change the second ANDN from the optimized sequence so that it updates %edx. That would make the two code snippets "sort-of" comparable in term of data dependencies.

With that change, I get that IPC is almost the same. The ANDNL sequence is slightly better, and consumes less cycles mainly because there is one instruction less to execute every iteration. Overall, on btver2, we go from IPC 1.33, to IPC 1.50.

The resource pressure distribution was already optimal in the original case.
The main advantage is that ANDN uses a VEX prefix, and therefore it allows encoding three register operands. That gives a bit more flexibility to the register allocator: the compiler can remove a register dependency at the cost of an extra register use (and a few more bytes in the instruction encoding).

Speaking about instruction encoding: the ANDN is 5 bytes (instead of 2 bytes AND), so we go from a total of 9 bytes to a total of 13 bytes for the full sequence. With SimonP, we were thinking about adding instruction encoding information to the llvm-mca output.
If the code is optimized for minsize, then it may be worthy to generate the original sequence with two-address instructions only. It may be less optimal for data-dependencies, but it uses shorter encodings. Not sure if it matters that much though.

Yeah, i was wondering about -Os, and was thinking that maybe we want to not unfold at all.
But since i don't really know anything (besides the obvious - minimize output binary size) about that, i did not bring it up.

Rebased.

LGTM.

This revision is now accepted and ready to land.May 7 2018, 9:56 AM

In D46494#1089940, @spatel wrote:

LGTM.

Thank you for the review!

(Do note that it depends on the D46493 which actually updates the hasAndNot()/hasAndNotCompare() TLI hooks for X86.)

Closed by commit rL331685: [DAGCombiner] Masked merge: enhance handling of 'andn' with immediates (authored by lebedevri). · Explain WhyMay 7 2018, 2:57 PM

This revision was automatically updated to reflect the committed changes.

This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] Masked merge: enhance handling of 'andn' with immediates
ClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 145556

llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

llvm/trunk/test/CodeGen/X86/unfold-masked-merge-scalar-variablemask.ll

This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] Masked merge: enhance handling of 'andn' with immediatesClosedPublic

Details

Diff Detail

Event Timeline

Revision Contents

Diff 145556

llvm/trunk/lib/CodeGen/SelectionDAG/DAGCombiner.cpp

llvm/trunk/test/CodeGen/X86/unfold-masked-merge-scalar-variablemask.ll

[DAGCombiner] Masked merge: enhance handling of 'andn' with immediates
ClosedPublic