This is an archive of the discontinued LLVM Phabricator instance.

[SelectionDAG] Fold redundant masking operations of shifted value
AbandonedPublic

Authored by dnsampaio on Jun 18 2018, 5:33 AM.

Details

Summary

Allow to reduce redundant shift masks.
For example:
x1 = x & 0xAB00
x2 = (x >> 8) & 0xAB

can be reduced to:
x1 = x & 0xAB00
x2 = x1 >> 8
It only allows folding when the masks and shift values are constants.

Diff Detail

Event Timeline

dnsampaio created this revision.Jun 18 2018, 5:33 AM

Hi Diogo,

Some nitpick comments. Please add negative tests for shifts with multiple uses and where the shift and mask aren't by constants.

cheers,
sam

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
4115

Why not check that you have a supported shift here and potentially exit early?

4154

Remove or uncomment.

4180

use LLVM_FALLTHROUGH

4189

remove

4203

I think you should be adding NewShift instead.

test/CodeGen/ARM/2018_05_29_FoldRedundantMask.ll
119 ↗(On Diff #151700)

Would you mind updating the target triple to something more modern, like v7 and v7m, to check that the code generation is better here. This extra load concerns me a bit.

dnsampaio updated this revision to Diff 154125.Jul 4 2018, 9:00 AM
dnsampaio marked 6 inline comments as done.
dnsampaio retitled this revision from [SelectionDAG]Reduce masked data movement chains and memory access widths pt2 to [SelectionDAG] Fold redundant masking operations of shifted value.
dnsampaio edited the summary of this revision. (Show Details)

Replaced tests as to be not dependent in the load width reduction.
Added 1 positive test per case, and 2 negative tests, one where one mask is not a constant and other the shifted amount is not constant.

samparker accepted this revision.Jul 5 2018, 1:24 AM

LGTM. For future reference, and before committing, arm and aarch64 tests live in different codegen directories so please separate the test into the two sub-directories. Thanks!

This revision is now accepted and ready to land.Jul 5 2018, 1:24 AM
dnsampaio updated this revision to Diff 154217.Jul 5 2018, 6:15 AM

AArch64 tests go into the folder AArch64

dnsampaio updated this revision to Diff 154219.Jul 5 2018, 6:16 AM

Give patch more context.

spatel requested changes to this revision.Jul 8 2018, 8:44 AM
spatel added a subscriber: spatel.

This patch was reverted with rL336453 because it caused:
https://bugs.llvm.org/show_bug.cgi?id=38084

Beyond that, I don't understand the motivation. The patch increases the latency of a computation. Why is that beneficial? The x86 diff doesn't look like a win to me.

I don't know what the ARM/AArch output looked like before this patch. Always check in the baseline tests before making a code change, so we have that as a reference (and in case the patch is reverted, we won't lose the test coverage that motivated the code patch).

This revision now requires changes to proceed.Jul 8 2018, 8:44 AM

Like @spatel I'm not clear on what you're really trying to accomplish here - has the arm/arm64 codegen improved?

test/CodeGen/AArch64/FoldRedundantShiftedMasking.ll
2

There's no need to use a check-prefix, remove it and use CHECK

89

Confusing - please move the shl_nogood checks before the shl_nogood2 define

test/CodeGen/ARM/FoldRedundantShiftedMasking.ll
5

There's no need to use a check-prefix, remove it and use CHECK

90

Confusing - please move the shl_nogood checks before the shl_nogood2 define

dnsampaio marked 4 inline comments as done.EditedJul 9 2018, 3:36 AM

This patch was reverted with rL336453 because it caused:
https://bugs.llvm.org/show_bug.cgi?id=38084

Sorry about that.

Beyond that, I don't understand the motivation. The patch increases the latency of a computation. Why is that beneficial? The x86 diff doesn't look like a win to me.

It reduces the number of computation operations, from 3 to 2, and the number of constants kept for performing the masking, from 2 to 1.
I don't see how it increases the latency. If you are going to perform the masking and the shift anyway.

I don't know what the ARM/AArch output looked like before this patch. Always check in the baseline tests before making a code change, so we have that as a reference (and in case the patch is reverted, we won't lose the test coverage that motivated the code patch).

See in line comments for code change, in both x86-64, AArch64 and ARM.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
4115

All tests are on the header of the function already (except for the case ISD::ROTL). Making a single big if statement just makes it harder to understand and harder to someone add an exception of the invalid rule . Technically speaking, having one big test or one after the other is the same thing.

4180

Added LLVM_FALLTHROUGH. But gcc 8.1 still gives warning, so I'm also keeping the /* fall-through */

/home/diosam01/LLVM/local/src/lib/CodeGen/SelectionDAG/DAGCombiner.cpp: In member function ‘llvm::SDValue {anonymous}::DAGCombiner::foldRedundantShiftedMasks(llvm::SDNode*)’:
/home/diosam01/LLVM/local/src/lib/CodeGen/SelectionDAG/DAGCombiner.cpp:4180:18: warning: this statement may fall through [-Wimplicit-fallthrough=]

N0Opcode = ISD::SRL;
~~~~~~~~~^~~~~

/home/diosam01/LLVM/local/src/lib/CodeGen/SelectionDAG/DAGCombiner.cpp:4182:5: note: here

case ISD::SRL:
^~~~
4203

No, NewShift and its users is added to the work list automatically at DAGCombiner.cpp lines 1483 and 1484. In the log, you can see that a transformed node is always visited again in the same combining phase. I want the other user of this value, as it might be the solely user left.

test/CodeGen/AArch64/FoldRedundantShiftedMasking.ll
4–13
  • Before patch:

mov w9, #15
mov w8, #3855
movk w9, #3840, lsl #16
and w8, w0, w8
and w9, w9, w0, ror #8
orr w0, w9, w8

  • After patch:

mov w8, #3855
and w8, w0, w8
orr w0, w8, w8, ror #8

19–27
  • Before patch:

sxth w8, w0
mov w9, #172
mov w10, #44032
and w9, w8, w9
and w8, w10, w8, lsl #8
orr w0, w9, w8

  • After patch:

mov w8, #172
and w8, w0, w8
orr w0, w8, w8, lsl #8

19–27
    • In x86-64:
  • Before patch:

movswl %di, %eax
movl %eax, %ecx
andl $172, %ecx
shll $8, %eax
andl $44032, %eax # imm = 0xAC00
orl %ecx, %eax

  • After patch:

andl $172, %edi
movl %edi, %eax
shll $8, %eax
leal (%rax,%rdi), %eax

33–41
  • Before patch:

sxth w8, w0
mov w9, #44032
mov w10, #172
and w9, w8, w9
and w8, w10, w8, lsr #8
orr w0, w9, w8

  • After patch:

mov w8, #44032
and w8, w0, w8
orr w0, w8, w8, lsr #8

33–41
    • In x86-64
  • Before patch:

movswl %di, %eax
movl %eax, %ecx
andl $44032, %ecx # imm = 0xAC00
shrl $8, %eax
andl $172, %eax
orl %ecx, %eax

  • After patch:

andl $44032, %edi # imm = 0xAC00
movl %edi, %eax
shrl $8, %eax
leal (%rax,%rdi), %eax

47–55
  • Before patch:

sxth w8, w0
mov w9, #44032
mov w10, #172
and w9, w8, w9
and w8, w10, w8, lsr #8
orr w0, w9, w8

  • After patch:

mov w8, #44032
and w8, w0, w8
orr w0, w8, w8, lsr #8

47–55
    • In x86-64
  • Before patch:

movswl %di, %eax
movl %eax, %ecx
andl $44032, %ecx # imm = 0xAC00
shrl $8, %eax
andl $172, %eax
orl %ecx, %eax

  • After patch:

andl $44032, %edi # imm = 0xAC00
movl %edi, %eax
shrl $8, %eax
leal (%rax,%rdi), %eax

test/CodeGen/ARM/FoldRedundantShiftedMasking.ll
23
  • Before patch:

mov r1, #15
mov r2, #15
orr r1, r1, #3840
orr r2, r2, #251658240
and r1, r0, r1
and r0, r2, r0, ror #8
orr r0, r0, r1

  • After patch:

mov r1, #15
orr r1, r1, #3840
and r0, r0, r1
orr r0, r0, r0, ror #8

24–35
  • Before patch:

lsl r0, r0, #16
mov r1, #172
and r0, r1, r0, asr #16
orr r0, r0, r0, lsl #8

  • After patch:

and r0, r0, #172
orr r0, r0, r0, lsl #8

37–45
  • Before patch:

lsl r0, r0, #16
mov r1, #44032
and r1, r1, r0, asr #16
asr r0, r0, #16
mov r2, #172
and r0, r2, r0, lsr #8
orr r0, r1, r0

  • After patch:

and r0, r0, #44032
orr r0, r0, r0, lsr #8

50–58
  • Before

lsl r0, r0, #16
mov r1, #44032
and r1, r1, r0, asr #16
asr r0, r0, #16
mov r2, #172
and r0, r2, r0, lsr #8
orr r0, r1, r0

  • After

and r0, r0, #44032
orr r0, r0, r0, lsr #8

dnsampaio marked 14 inline comments as done.Jul 9 2018, 3:47 AM

Like @spatel I'm not clear on what you're really trying to accomplish here - has the arm/arm64 codegen improved?

In the examples it reduces most required ARM instructions and constants by half in this examples, as the OR and SHIFT operations can be combined.

spatel added a comment.Jul 9 2018, 7:05 AM

It reduces the number of computation operations, from 3 to 2, and the number of constants kept int constants for performing the masking, from 2 to 1.
I don't see how it increases the latency. If you are going to perform the masking and the shift anyway.

Ah, I see that now. But I'm not convinced this is the right approach. Why are we waiting to optimize this in the backend? This is a universally good optimization, so it should be in IR:
https://rise4fun.com/Alive/O04

I'm not sure exactly where that optimization belongs. Ie, is it EarlyCSE, GVN, somewhere else, or is it its own pass? But I don't see any benefit in waiting to do this in the DAG.

spatel added a comment.Jul 9 2018, 7:15 AM

It reduces the number of computation operations, from 3 to 2, and the number of constants kept int constants for performing the masking, from 2 to 1.
I don't see how it increases the latency. If you are going to perform the masking and the shift anyway.

Ah, I see that now. But I'm not convinced this is the right approach. Why are we waiting to optimize this in the backend? This is a universally good optimization, so it should be in IR:
https://rise4fun.com/Alive/O04

I'm not sure exactly where that optimization belongs. Ie, is it EarlyCSE, GVN, somewhere else, or is it its own pass? But I don't see any benefit in waiting to do this in the DAG.

This also raises a question that has come up in another review recently - D41233. If we reverse the canonicalization of shl+and, we would solve the most basic case that I showed above:

define i32 @shl_first(i32 %a) {
  %t2 = shl i32 %a, 8
  %t3 = and i32 %t2, 44032
  ret i32 %t3
}

define i32 @mask_first(i32 %a) {
  %a2 = and i32 %a, 172
  %a3 = shl i32 %a2, 8
  ret i32 %a3
}
dnsampaio updated this revision to Diff 154598.Jul 9 2018, 7:21 AM

Added checks for shift opcodes as to early exit if not found.
Validate mask widths (although it would be a error in the code if they are of different).

dnsampaio added a comment.EditedJul 9 2018, 7:24 AM

Ah, I see that now. But I'm not convinced this is the right approach. Why are we waiting to optimize this in the backend? This is a universally good optimization, so it should be in IR:

Agree, I also intend to implement this transformation in the IR. But there are cases that this is only seen after some instructions have been combined in the dag, so why not here also? And indeed, it is a requirement for a future patch that detects opportunities to reduce load and store widths.

I'm not sure exactly where that optimization belongs. Ie, is it EarlyCSE, GVN, somewhere else, or is it its own pass? But I don't see any benefit in waiting to do this in the DAG.

This also raises a question that has come up in another review recently - D41233. If we reverse the canonicalization of shl+and, we would solve the most basic case that I showed above:

define i32 @shl_first(i32 %a) {
  %t2 = shl i32 %a, 8
  %t3 = and i32 %t2, 44032
  ret i32 %t3
}

define i32 @mask_first(i32 %a) {
  %a2 = and i32 %a, 172
  %a3 = shl i32 %a2, 8
  ret i32 %a3
}

Ah, I see that now. But I'm not convinced this is the right approach. Why are we waiting to optimize this in the backend? This is a universally good optimization, so it should be in IR:

Agree, I also intend to implement this transformation in the IR. But there are cases that this is only seen after some instructions have been combined in the dag, so why not here also? And indeed, it is a requirement for a future patch that detects opportunities to reduce load and store widths.

IMO, this is backwards, we optimize first in IR (because the sooner we can fold something like this, the more it helps later transforms). Then, only if there's reason to create redundancy (because the patterns emerge late) do we repeat a fold in the backend. Can you post a motivating example that would not be solved by IR transforms, so we can see why this is necessary for the DAG?

efriedma added inline comments.Jul 10 2018, 12:05 PM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
5986

Walking the uses of an SDNode can get expensive... is this O(N^2) in the number of uses of MaskedValue? If not, please add a comment explaining why it isn't.

It reduces the number of computation operations, from 3 to 2, and the number of constants kept int constants for performing the masking, from 2 to 1.
I don't see how it increases the latency. If you are going to perform the masking and the shift anyway.

Ah, I see that now. But I'm not convinced this is the right approach. Why are we waiting to optimize this in the backend? This is a universally good optimization, so it should be in IR:
https://rise4fun.com/Alive/O04

I'm not sure exactly where that optimization belongs. Ie, is it EarlyCSE, GVN, somewhere else, or is it its own pass? But I don't see any benefit in waiting to do this in the DAG.

This also raises a question that has come up in another review recently - D41233. If we reverse the canonicalization of shl+and, we would solve the most basic case that I showed above:

define i32 @shl_first(i32 %a {
  %t2 = shl i32 %a, 8
  %t3 = and i32 %t2, 44032
  ret i32 %t3
}

define i32 @mask_first(i32 %a) {
  %a2 = and i32 %a, 172
  %a3 = shl i32 %a2, 8
  ret i32 %a3
}

This "canonicalization" won't help to prevent even basic duplicated masked values when using a lshr:

%0 = sext i16 %a to i32
%1 = lshr i32 %0, 8
%2 = and i32 %1, 172
%3 = and i32 %0, 44032

And a simplest case, that it is already in the test case, that won't be handled in the IR level:
define i32 @ror(i32 %a) {
entry:

%m2 = and i32 %a, 3855
%shl = shl i32 %a, 24
%shr = lshr i32 %a, 8
%or = or i32 %shl, %shr
%m1 = and i32 %or, 251658255
%or2 = or i32 %m1, %m2
ret i32 %or2

}

The shl shr instructions become a ror that masks the same masked value.

lib/CodeGen/SelectionDAG/DAGCombiner.cpp
5986

Ok, will change it as to work only with MaskedValue used by a shift and a AND operations.

dnsampaio updated this revision to Diff 155375.Jul 13 2018, 6:59 AM
dnsampaio marked 2 inline comments as done.

Only accepts instructions with 2 uses (AND / SHIFT operations). So that looping through the uses is not expensive, and we avoid it in most cases.
Removed recursive bit value computations(computeKnownBits).

efriedma added inline comments.Jul 13 2018, 11:22 AM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
6104

hasNUsesOfValue(). (use_size is linear in the number of uses.)

dnsampaio updated this revision to Diff 155571.Jul 14 2018, 2:23 PM
dnsampaio marked an inline comment as done.

Replaced num_uses by !hasNUsesOfValue as requested.

dnsampaio added inline comments.Jul 14 2018, 2:24 PM
lib/CodeGen/SelectionDAG/DAGCombiner.cpp
6104

Fixed.

dnsampaio marked an inline comment as done.Jul 16 2018, 3:10 AM

Please can you regenerate the diff with context?

Added context.

dnsampaio abandoned this revision.Dec 28 2018, 9:27 AM