Details
Diff Detail
- Repository
- rL LLVM
Event Timeline
lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
5415–5416 ↗ | (On Diff #145365) | I'd just call these 'N0' and 'N1' to avoid confusion with the matched values. |
5437 ↗ | (On Diff #145365) | typo: canonicalize |
5439 ↗ | (On Diff #145365) | How did this become an 'or'? We should have a comment/formula to describe the overall transform and why it makes sense: // If we can't form an 'andn' with Y (because it's a constant), // swap the final xor operand, so we can still use 'andn' to invert the mask: // ((X ^ C) & M) ^ C --> ((X ^ C) & NotM') ^ X |
lib/CodeGen/SelectionDAG/DAGCombiner.cpp | ||
---|---|---|
5439 ↗ | (On Diff #145365) | o_O I was initially experimenting with still proceeding to unfold the pattern here, not just de-canonicalize it, That means i need to re-do mca analysis for this. |
Ok, i really messed that up :D
This should be better: https://rise4fun.com/Alive/xJS
Fixed with the correct fold, updated mca diffs in the differential's description:
We need to be careful here (and maybe there's a way for mca to show/warn about this, cc @andreadb).
When you simulate these instructions:
andnl %edx, %edi, %eax orl $42, %edx andnl %edx, %eax, %eax
Notice that the output of the sequence (%eax) is not used again by any instruction in the sequence. So this is measuring ideal throughput in a vacuum - each simulated iteration proceeds independently. Maybe that's what you intended, but the original sequence that you're comparing does not have that property:
andl %edx, %edi notl %edx andl $42, %edx orl %edi, %edx <--- output fed back as the input to first instruction
Each iteration depends on the previous one, so it's not fair to compare the stats for the 2 sequences as they're shown in the attached diff.
test/CodeGen/X86/unfold-masked-merge-scalar-variablemask.ll | ||
---|---|---|
707–710 ↗ | (On Diff #145412) | I'd still prefer to put a comment on tests like this where we're testing a non-canonical form, so we have a reminder that we're testing this for completeness, but it's not expected. Ie, instcombine would improve this sequence to remove the inverted mask. |
At the moment, mca doesn't warn you about cross-iteration dependencies.
When micro-benchmarking, it is up to the user to make sure that it doesn't negatively impact the analysis. The timeline view makes it easier to catch these situations.
When you simulate these instructions:
andnl %edx, %edi, %eax orl $42, %edx andnl %edx, %eax, %eaxNotice that the output of the sequence (%eax) is not used again by any instruction in the sequence. So this is measuring ideal throughput in a vacuum - each simulated iteration proceeds independently. Maybe that's what you intended, but the original sequence that you're comparing does not have that property:
andl %edx, %edi notl %edx andl $42, %edx orl %edi, %edx <--- output fed back as the input to first instructionEach iteration depends on the previous one, so it's not fair to compare the stats for the 2 sequences as they're shown in the attached diff.
I agree with Sanjay.
I also noticed that you often test for ryzen processors. It is also interesting to see what happens on processors with a smaller issue width.
You can manually change the second ANDN from the optimized sequence so that it updates %edx. That would make the two code snippets "sort-of" comparable in term of data dependencies.
With that change, I get that IPC is almost the same. The ANDNL sequence is slightly better, and consumes less cycles mainly because there is one instruction less to execute every iteration. Overall, on btver2, we go from IPC 1.33, to IPC 1.50.
The resource pressure distribution was already optimal in the original case.
The main advantage is that ANDN uses a VEX prefix, and therefore it allows encoding three register operands. That gives a bit more flexibility to the register allocator: the compiler can remove a register dependency at the cost of an extra register use (and a few more bytes in the instruction encoding). Speaking about instruction encoding: the ANDN is 5 bytes (instead of 2 bytes AND), so we go from a total of 9 bytes to a total of 13 bytes for the full sequence. With SimonP, we were thinking about adding instruction encoding information to the llvm-mca output.
If the code is optimized for minsize, then it may be worthy to generate the original sequence with two-address instructions only. It may be less optimal for data-dependencies, but it uses shorter encodings. Not sure if it matters that much though.
Right, thank you :)
I also noticed that you often test for ryzen processors. It is also interesting to see what happens on processors with a smaller issue width.
Noted.
You can manually change the second ANDN from the optimized sequence so that it updates %edx. That would make the two code snippets "sort-of" comparable in term of data dependencies.
With that change, I get that IPC is almost the same. The ANDNL sequence is slightly better, and consumes less cycles mainly because there is one instruction less to execute every iteration. Overall, on btver2, we go from IPC 1.33, to IPC 1.50.
The resource pressure distribution was already optimal in the original case.
The main advantage is that ANDN uses a VEX prefix, and therefore it allows encoding three register operands. That gives a bit more flexibility to the register allocator: the compiler can remove a register dependency at the cost of an extra register use (and a few more bytes in the instruction encoding).
Speaking about instruction encoding: the ANDN is 5 bytes (instead of 2 bytes AND), so we go from a total of 9 bytes to a total of 13 bytes for the full sequence. With SimonP, we were thinking about adding instruction encoding information to the llvm-mca output.
If the code is optimized for minsize, then it may be worthy to generate the original sequence with two-address instructions only. It may be less optimal for data-dependencies, but it uses shorter encodings. Not sure if it matters that much though.
Yeah, i was wondering about -Os, and was thinking that maybe we want to not unfold at all.
But since i don't really know anything (besides the obvious - minimize output binary size) about that, i did not bring it up.
Thank you for the review!
(Do note that it depends on the D46493 which actually updates the hasAndNot()/hasAndNotCompare() TLI hooks for X86.)