This is an archive of the discontinued LLVM Phabricator instance.

[DAGCombiner] Masked merge: enhance handling of 'andn' with immediates
ClosedPublic

Authored by lebedev.ri on May 5 2018, 4:29 AM.

Details

Summary

Split off from D46031.

The previous patch, D46493, completely disabled unfolding in case of immediates.
But we can do better:

https://rise4fun.com/Alive/xJS

Diff Detail

Repository
rL LLVM

Event Timeline

spatel added inline comments.May 6 2018, 9:43 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
5418–5422

I'd just call these 'N0' and 'N1' to avoid confusion with the matched values.

5440

typo: canonicalize

5442

How did this become an 'or'?
https://rise4fun.com/Alive/UMY

We should have a comment/formula to describe the overall transform and why it makes sense:

// If we can't form an 'andn' with Y (because it's a constant), 
// swap the final xor operand, so we can still use 'andn' to invert the mask:
// ((X ^ C) & M) ^ C -->  ((X ^ C) & NotM') ^ X
lebedev.ri added inline comments.May 6 2018, 9:53 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
5442

o_O
Great catch!

I was initially experimenting with still proceeding to unfold the pattern here, not just de-canonicalize it,
and clearly i did not fully undo all the changes..

That means i need to re-do mca analysis for this.

lebedev.ri marked 4 inline comments as done.
lebedev.ri edited the summary of this revision. (Show Details)

Ok, i really messed that up :D
This should be better: https://rise4fun.com/Alive/xJS

Fixed with the correct fold, updated mca diffs in the differential's description:

lebedev.ri updated this revision to Diff 145412.May 6 2018, 1:14 PM

Reduce diff noise.

spatel added a subscriber: andreadb.May 7 2018, 7:38 AM

Fixed with the correct fold, updated mca diffs in the differential's description:

We need to be careful here (and maybe there's a way for mca to show/warn about this, cc @andreadb).

When you simulate these instructions:

andnl %edx, %edi, %eax
orl $42, %edx
andnl %edx, %eax, %eax

Notice that the output of the sequence (%eax) is not used again by any instruction in the sequence. So this is measuring ideal throughput in a vacuum - each simulated iteration proceeds independently. Maybe that's what you intended, but the original sequence that you're comparing does not have that property:

andl	%edx, %edi
notl	%edx
andl	$42, %edx
orl	%edi, %edx    <--- output fed back as the input to first instruction

Each iteration depends on the previous one, so it's not fair to compare the stats for the 2 sequences as they're shown in the attached diff.

test/CodeGen/X86/unfold-masked-merge-scalar-variablemask.ll
707–710

I'd still prefer to put a comment on tests like this where we're testing a non-canonical form, so we have a reminder that we're testing this for completeness, but it's not expected. Ie, instcombine would improve this sequence to remove the inverted mask.

Fixed with the correct fold, updated mca diffs in the differential's description:

We need to be careful here (and maybe there's a way for mca to show/warn about this, cc @andreadb).

At the moment, mca doesn't warn you about cross-iteration dependencies.
When micro-benchmarking, it is up to the user to make sure that it doesn't negatively impact the analysis. The timeline view makes it easier to catch these situations.

When you simulate these instructions:

andnl %edx, %edi, %eax
orl $42, %edx
andnl %edx, %eax, %eax

Notice that the output of the sequence (%eax) is not used again by any instruction in the sequence. So this is measuring ideal throughput in a vacuum - each simulated iteration proceeds independently. Maybe that's what you intended, but the original sequence that you're comparing does not have that property:

andl	%edx, %edi
notl	%edx
andl	$42, %edx
orl	%edi, %edx    <--- output fed back as the input to first instruction

Each iteration depends on the previous one, so it's not fair to compare the stats for the 2 sequences as they're shown in the attached diff.

I agree with Sanjay.

I also noticed that you often test for ryzen processors. It is also interesting to see what happens on processors with a smaller issue width.
You can manually change the second ANDN from the optimized sequence so that it updates %edx. That would make the two code snippets "sort-of" comparable in term of data dependencies.

With that change, I get that IPC is almost the same. The ANDNL sequence is slightly better, and consumes less cycles mainly because there is one instruction less to execute every iteration. Overall, on btver2, we go from IPC 1.33, to IPC 1.50.

The resource pressure distribution was already optimal in the original case.
The main advantage is that ANDN uses a VEX prefix, and therefore it allows encoding three register operands. That gives a bit more flexibility to the register allocator: the compiler can remove a register dependency at the cost of an extra register use (and a few more bytes in the instruction encoding). Speaking about instruction encoding: the ANDN is 5 bytes (instead of 2 bytes AND), so we go from a total of 9 bytes to a total of 13 bytes for the full sequence. With SimonP, we were thinking about adding instruction encoding information to the llvm-mca output.
If the code is optimized for minsize, then it may be worthy to generate the original sequence with two-address instructions only. It may be less optimal for data-dependencies, but it uses shorter encodings. Not sure if it matters that much though.

Fixed with the correct fold, updated mca diffs in the differential's description:

We need to be careful here (and maybe there's a way for mca to show/warn about this, cc @andreadb).

At the moment, mca doesn't warn you about cross-iteration dependencies.
When micro-benchmarking, it is up to the user to make sure that it doesn't negatively impact the analysis. The timeline view makes it easier to catch these situations.

When you simulate these instructions:

andnl %edx, %edi, %eax
orl $42, %edx
andnl %edx, %eax, %eax

Notice that the output of the sequence (%eax) is not used again by any instruction in the sequence. So this is measuring ideal throughput in a vacuum - each simulated iteration proceeds independently. Maybe that's what you intended, but the original sequence that you're comparing does not have that property:

andl	%edx, %edi
notl	%edx
andl	$42, %edx
orl	%edi, %edx    <--- output fed back as the input to first instruction

Each iteration depends on the previous one, so it's not fair to compare the stats for the 2 sequences as they're shown in the attached diff.

I agree with Sanjay.

Right, thank you :)

I also noticed that you often test for ryzen processors. It is also interesting to see what happens on processors with a smaller issue width.

Noted.

You can manually change the second ANDN from the optimized sequence so that it updates %edx. That would make the two code snippets "sort-of" comparable in term of data dependencies.

With that change, I get that IPC is almost the same. The ANDNL sequence is slightly better, and consumes less cycles mainly because there is one instruction less to execute every iteration. Overall, on btver2, we go from IPC 1.33, to IPC 1.50.

The resource pressure distribution was already optimal in the original case.
The main advantage is that ANDN uses a VEX prefix, and therefore it allows encoding three register operands. That gives a bit more flexibility to the register allocator: the compiler can remove a register dependency at the cost of an extra register use (and a few more bytes in the instruction encoding).

Speaking about instruction encoding: the ANDN is 5 bytes (instead of 2 bytes AND), so we go from a total of 9 bytes to a total of 13 bytes for the full sequence. With SimonP, we were thinking about adding instruction encoding information to the llvm-mca output.
If the code is optimized for minsize, then it may be worthy to generate the original sequence with two-address instructions only. It may be less optimal for data-dependencies, but it uses shorter encodings. Not sure if it matters that much though.

Yeah, i was wondering about -Os, and was thinking that maybe we want to not unfold at all.
But since i don't really know anything (besides the obvious - minimize output binary size) about that, i did not bring it up.

lebedev.ri updated this revision to Diff 145493.May 7 2018, 9:48 AM
lebedev.ri marked an inline comment as done.

Rebased.

spatel accepted this revision.May 7 2018, 9:56 AM

LGTM.

This revision is now accepted and ready to land.May 7 2018, 9:56 AM

LGTM.

Thank you for the review!

(Do note that it depends on the D46493 which actually updates the hasAndNot()/hasAndNotCompare() TLI hooks for X86.)

This revision was automatically updated to reflect the committed changes.